TLDR
In this Mage Academy lesson on data cleaning, we’ll learn how to remove duplicate row entries of a column value in Pandas.
Outline
When’s it necessary?
How to code
Magical no-code solution 🪄
When’s it necessary?
Duplicate data can skew prediction results.
Thus, for columns that should contain unique values, it’s important to search for and exclude any duplicate rows to achieve a more general and accurate prediction.
How to code
Observing Kaggle’s
Pokemon datasetfor example, the extra rows containing “Mega” versions of Pokemon aren’t needed to analyze the entire Pokemon index, since Megas are simply beefier copies of the same Pokemon.
Thiagoazen’s Pokemon dataset, ft. 3 Charizards
From scratch
While a built-in function (see next section) gets the job done, we will also present an algorithm that filters unique values of a column using a dictionary, just in case it shows up in an assignment or exam. 😉
By looking at the first ten rows of data, we can see several duplicates in the “Name” column that we need to remove (like Venusaur).
1
2
3
import pandas as pd
data = pd.read_csv("PokemonDb.csv")
data
1
2
3
4
5
6
7
8
9
uniqueNames = {}
# Keeps only the first duplicate
for i, row in data.iterrows():
if row["Name"] in uniqueNames:
data.drop(i, inplace=True)
uniqueNames[row["Name"]] = True
data
The complete code if you’d like to try it yourself:
Built-in Pandas Function
The promised built-in function,
drop_duplicatesdeletes rows based on duplicates in a list of column name(s) that you specify in the
subset
parameter.
Image generated using carbon.now.sh
Magical no-code solution 🪄
Last, but definitely not least, Mage has a row transformation action that removes duplicates from your dataset! Try this if you’d like to leverage AI without learning the ins and outs of Pandas.
Start building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.