How to Remove and Detect Duplicates in Spreadsheets using Python

Image for post
Image for post
Just for the thumbnail :)

Okay…so you might be wondering why the hell you need to remove duplicates in spreadsheets using Python when you can simply click on the Remove Duplicates button in Excel.

Image for post
Image for post
Remove Duplicates Easily in Excel

And you would be justified in asking this question. However, the remove duplicates button retains the FIRST unique value and removes all the rest.

What if you want to remove ALL duplicates? Or you want to remove all but the LAST duplicate? Or you just want to know which records are duplicates?

While these situations can all be hacked in Excel, they are much easier to handle in Python and here at Love Spreadsheets we are all about using the right tool for the job.

Pre-requisite

If you do not know how to run Python scripts or how to read and write files using Python and Pandas, then go through this tutorial first.

Intro to Reading and Writing Spreadsheets with Python

File

We will be using a sample file for this tutorial. You can download it here

We will be counting a record as duplicate if it has the same values in the Name, Address and Call Date columns. Using this criteria, the file has three sets of duplicates as seen below.

Image for post
Image for post
Apple called on 10/13/18, Natalie Portman and Gareth Bale are all duplicate records

Full Code

See the full code below to follow along

Create the File for Code

Open a text editor and create a file duplicates.py. Save this in the same folder as the Duplicates.xlsx file

Import Library

Line 1

Import the pandas library to read, remove duplicates and write the spreadsheets.

import pandas as pd

Read the File

Line 3

We are going to be reading the spreadsheet using pandas and storing the result in a data frame file_df

file_df = pd.read_excel("Duplicates.xlsx")

Keep only FIRST record of duplicate set

Lines 6–7

The first case we will go through is the same as Excel, where we want to remove all duplicates in a set except for the first. The code for this is one line with two important parameters.

file_df_first_record = file_df.drop_duplicates(subset=["Name", "Address", "Call Date"], keep="first")

We call the drop_duplicates method on the file_df that we read in from the spreadsheet. In the drop_duplicates method there are two important parameters.

  1. subset — list of column names that we consider a record to be duplicate. In this case, we have specified duplicates are records that share the same Name, Address and Call Date.
  2. keep — this is the record that we should keep for the set of duplicates, and is the parameter we will be changing in the different scenarios.

These are both OPTIONAL. If you don’t specify them, then Pandas treats a record as duplicate on ALL columns and keeps only the FIRST record.

You can read more about this method here.

Now we will write this dataframe without duplicates and only the first row to a new file

file_df_first_record.to_excel("Duplicates_First_Record.xlsx", index=False)

This is what that file looks like

Image for post
Image for post
As you can see the first record of each set was kept

Keep only LAST record of duplicate set

Lines 10–11

This is exactly identical to the step above except we change the keep parameter to last

file_df_last_record = file_df.drop_duplicates(subset=["Name", "Address", "Call Date"], keep="last")

Then we write this dataframe with only the last row kept from a set of duplicates to a new file

file_df_last_record.to_excel("Duplicates_Last_Record.xlsx", index=False)

This is what the file looks like

Image for post
Image for post
Now the last records were kept for each set

Remove ALL records from a set of duplicates

Lines 14–15

Now let’s say we want to remove all records from a set of duplicates. In this case we change the keep parameter to False

file_df_remove_all = file_df.drop_duplicates(subset=["Name", "Address", "Call Date"], keep=False)

Now we write this dataframe with all duplicate sets removed to a file

file_df_remove_all.to_excel("Duplicates_All_Removed.xlsx", index=False)

This is what that file looks like

Image for post
Image for post
Now there are no records from any set of duplicates

Get Which Records Are Duplicates

Lines 18–20

Okay, so let’s say you want to get a list of all records that belong to a duplicate set. This is not as straightforward of a case but still very doable.

First, we get the a list of which rows contain duplicates or not.

duplicate_row_index = file_df.duplicated(subset=["Name", "Address", "Call Date"], keep=False)

Note: we are using a different method called duplicated. This gives us a list containing True OR False value corresponding to each row in the dataframe/spreadsheet.

In this method, the keep parameter specifies which rows we want to identified as duplicates. Since we pass FALSE, it identifies all rows that belong to a duplicate set. If we had specified first, it would have identified all rows EXCEPT the first row. You can read more about this method here.

Next, we get the actual records from the dataframe. The command below gives us all the rows that were identified as duplicates.

all_duplicate_rows = file_df[duplicate_row_index]

Finally, we write this to a spreadsheet. Here we use index=True because we want to get the row numbers as well.

all_duplicate_rows.to_excel("Duplicate_Rows.xlsx", index=True)

This is what that file looks like

Image for post
Image for post
All records that were part of a duplicate set

NOTE: Pandas index is 0-based AND considers column headers as separate. So in this file, since we had a header, Row 1 next to apple actually refers to Row 3 in the original file.

Next Up

Now that we have seen how to remove identical duplicates from spreadsheets using Python, the next step is how to tackle fuzzy duplicates.

Products and resources that magically simplify hard spreadsheet tasks. Check us out at www.lovespreadsheets.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store