TLDR
In this Mage Academy lesson on data cleaning, we’ll go over variance in detail and see how to identify and remove low variance columns from a dataset.
Glossary
Why is it necessary
Variance
How to code
Why is it necessary?
Data distribution
We can remove columns from the dataset if the columns aren’t useful for predicting the output. Low variance columns are such columns that don’t contribute much while predicting the output as they don’t contain much information. Therefore, it's recommended to remove the low variance columns.
Variance
Variance measures the spread of data, i.e., it measures how far each data point is from the mean.
Variance is calculated using the following formula:
Mathematically we can write variance as shown below:
Zero
variance indicates that
all
the values in the column are
constant
.
Low
variance indicates that
most
of the values in the column are
similar
and are very close to mean.
High
variance indicates that values in the column are
not similar
and are spread far from the mean.
Numerical data
Usually we calculate variance by using a formula. But for
categoricaldata columns we don’t use a formula, instead we visualize the distributions of the categories with the help of Python’s visualization libraries like seaborn, matplotlib, etc.
Zero
variance indicates that the distribution of categories in the column are identical.
Low
variance indicates that the distribution of categories in the column are nearly the same.
High
variance indicates that the distribution of categories in the column are
not
similar and vary.
Step-1: Calculate mean
Step-2: Find the difference between each data point and mean
Step-3: Square the difference values
Step-4: Sum all the squared difference values
Step-5: Calculate variance
Let’s calculate variance for all the columns in the dataset that has numerical data.
Step-1: Load the
datasetusing Python’s pandas library. We use the
read_csv
function to read files that have the
.csv
extension.
Step-2: Calculate variance of each column using
.var()
function
Step-3: Remove columns if variance is low.
Variance of “history” and “physics” columns is low when compared to “english” and “math” columns variance, so we can remove these columns from the dataset.
Step-1: Load the
datasetusing Python’s pandas library. We use the
read_csv()
function to read files that have the
.csv
extension.
Step-2: Plot the distribution of categorical columns using Python’s seaborn library.
We’ll use the
countplot()
function to visualize the distribution.
Step-3: Drop columns if variance is low.
Variance of “school” and “pass” columns is low, so we can remove these columns from the dataset.
How to code
We’ve seen that “history,” “physics,” “school” and “pass” columns have low variance. So, we’ll use the
.drop()
method to remove these columns.
When you're building models with Mage, it’s easy to remove columns.
Start building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.