Company

Community

Blog

Careers

Blog

Growth

Data Cleaning - Variance

February 18, 2022 · 7 minute read

Jahnavi C.

Growth

TLDR

In this Mage Academy lesson on data cleaning, we’ll go over variance in detail and see how to identify and remove low variance columns from a dataset.

Glossary

Why is it necessary
Variance
How to code

Why is it necessary?

Data distribution

We can remove columns from the dataset if the columns aren’t useful for predicting the output. Low variance columns are such columns that don’t contribute much while predicting the output as they don’t contain much information. Therefore, it's recommended to remove the low variance columns.

Variance

Variance measures the spread of data, i.e., it measures how far each data point is from the mean.

Variance is calculated using the following formula:

Mathematically we can write variance as shown below:

Zero

variance indicates that

all

the values in the column are

constant

Low

variance indicates that

most

of the values in the column are

similar

and are very close to mean.

High

variance indicates that values in the column are

not similar

and are spread far from the mean.

Numerical data

Usually we calculate variance by using a formula. But for

categorical

data columns we don’t use a formula, instead we visualize the distributions of the categories with the help of Python’s visualization libraries like seaborn, matplotlib, etc.

Zero

variance indicates that the distribution of categories in the column are identical.

Low

variance indicates that the distribution of categories in the column are nearly the same.

High

variance indicates that the distribution of categories in the column are

not

similar and vary.

Let’s take one column and see how we calculate variance for

numerical

data.

Step-1: Calculate mean

Step-2: Find the difference between each data point and mean

Step-3: Square the difference values

Step-4: Sum all the squared difference values

Step-5: Calculate variance

Let’s calculate variance for all the columns in the dataset that has numerical data.

Step-1: Load the

dataset

using Python’s pandas library. We use the

read_csv

function to read files that have the

.csv

extension.

Step-2: Calculate variance of each column using

.var()

function

Step-3: Remove columns if variance is low.

Variance of “history” and “physics” columns is low when compared to “english” and “math” columns variance, so we can remove these columns from the dataset.

Step-1: Load the

dataset

using Python’s pandas library. We use the

read_csv()

function to read files that have the

.csv

extension.

Step-2: Plot the distribution of categorical columns using Python’s seaborn library.

We’ll use the

countplot()

function to visualize the distribution.

Step-3: Drop columns if variance is low.

Variance of “school” and “pass” columns is low, so we can remove these columns from the dataset.

How to code

We’ve seen that “history,” “physics,” “school” and “pass” columns have low variance. So, we’ll use the

.drop()

method to remove these columns.

When you're building models with Mage, it’s easy to remove columns.

Want to learn more about machine learning (ML)? Visit

Mage Academy

! ✨🔮

Lesson

Start building for free

No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.

No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.

Get started for free

Mage AI

Harness the power of AI.

company

about us

blog

careers

contact

stay connected

Twitter

Facebook

Instagram

Dev.to

community

Join Discord chat