Company

Community

Blog

Careers

Blog

Growth

Data Cleaning - Remove columns

February 17, 2022 · 9 minute read

Jahnavi C.

Growth

TLDR

In this Mage Academy lesson on data cleaning, we’ll learn how to remove columns and discuss various scenarios in which it's necessary to remove columns from a dataset.

Glossary

Why is it necessary
Scenarios
How to code

Why is it necessary?

“Garbage in, garbage out”

When you’re building machine learning models, the quality of data is always a concern. It’s because when bad data is fed into machine learning algorithms, the predictions are bad. Therefore, most of the time is spent on cleaning the data. There are several ways to clean the data, and

removing columns

is a part of the data cleaning process.

Scenarios

Let's look at some of the most common scenarios when removing columns is essential.

Duplicated columns
Columns with high percentage of missing values
Columns with 1 unique value
Categorical
columns with unique values

If the same column appears twice or more, we can keep only one of the columns and remove the others as the data is the same in both columns.

For example, let’s take a look at the

Telco Customer Churn

dataset.

Step-1: Load the dataset using Python’s pandas library. We use the

read_csv()

function to read files that have the

.csv

extension.

Step-2: Look if there are any duplicate columns (i.e., the columns with the same name or if the data is the same in any 2 columns).

Let’s take a closer look at the columns “Lat Long,” “Latitude,” & “Longitude”.

The data in the “Lat Long” column is the same as data in the “Latitude” and “Longitude” columns.

Step-3: If there are duplicate columns, we can remove such columns from the dataset.

As the “Lat Long” column is a duplicated column we can remove it from the dataset.

The columns with a high percentage (i.e., >50) of missing values will be of no use while predicting the target or output, as they don’t contain any information about the data. So, we can remove such columns from the dataset.

Step-1: Load the

dataset

using Python’s pandas library. We use the

read_csv()

function to read files that have the

.csv

extension.

Step-2: Calculate the total count of missing values in each column. We’ll use Python’s pandas library to perform this operation. We’ll use

isnull() and sum()

methods to get the total count of missing values.

Step-3: Calculate the percentage of null values in each column.

Let's calculate the percentage of null values in the "weight" column, as that's the only column that has null values.

Step-4: If the percentage of null values is greater than 50, remove such columns from the dataset.

As the “weight” column has 90% missing values, we can remove the “weight” column from the dataset.

If the count of unique values is 1, this indicates that there isn’t much variation or all the values in the column are constant. Such columns don’t help in predicting the output, so we remove them from the dataset.

Step-1: Load the dataset using Python’s pandas library. We use the

read_csv()

function to read files that have the

.csv

extension.

Step-2: List all the column labels in the dataset using the

columns

method.

Step-3: Calculate the count of unique values in all columns using

nunique()

function.

Step-3: Remove the columns if the count of unique values is 1.

We can remove “Count,” “Country,” and “State” columns from the dataset as these columns each have only 1 unique value.

If all the values in the

categorical

columns are unique, i.e. If the count of unique values is the same as the number of records/data points, remove such columns from the dataset as they don’t help to predict the output.

Note: Don’t remove the column if it’s an ID column.

Step-1: Load the

dataset

using Python’s pandas library. We use the

read_csv()

function to read files that have the

.csv

extension.

Step-2: Check the number of records/data points in a dataset by using

shape

function

Step-3: List all the column labels in the dataset using the

columns

method.

Step-5: Filter categorical columns from the dataset using

dtypes

function. Categorical columns are of object data type.

Step-4: Calculate the count of unique values in categorical columns using the nunique

()

function.

Step-5: Remove columns that have the same number of unique values as the number of records.

Count of unique values in the “name” column is the same as the number of records in the dataset. So we can remove the “name” column from the dataset.

How to code

Step-1: Load

dataset

using

.read_csv()

function.

Step-2: Let’s remove “name” and “weight” columns using the

drop()

method.

Dataset after removing columns

If you're building models with Mage, removing columns is a breeze! The good news is that Mage also calculates the percentage of missing values by default and suggests removing a column that has a high percentage of missing values.

GIF

Want to learn more about machine learning (ML)? Visit

Mage Academy

! ✨🔮

Lesson

Start building for free

No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.

No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.

Get started for free

Mage AI

Harness the power of AI.

company

about us

blog

careers

contact

stay connected

Twitter

Facebook

Instagram

Dev.to

community

Join Discord chat

Mage AI

Harness the power of AI.

company

about us

blog

careers

contact

stay connected

Twitter

Facebook

Instagram

Dev.to

community

Join Discord chat

2025 Mage Technologies, Inc.

Security

SOC 2

2025 Mage Technologies, Inc.

Security

SOC 2