Guide to Churn Prediction: Part 2 — Exploring missing values
Jahnavi C.
Growth
TLDR
We will explore the Telco Customer Churn dataset to understand different types of features and find missing values in the dataset.
Outline
Recap
Exploratory Data Analysis
Dependent and Independent features
Missing values and its mechanisms
Conclusion
Recap
In part 1 of the series,
Guide to Churn Prediction, we’ve gone over and implemented the first
3
steps (Define the objective, Data gathering, Data cleaning) required to build a machine learning model on a
Telco Customer Churn
dataset and saved the
cleaned datasetfor further analysis.
Step 4: Exploratory Data Analysis (EDA)
We will perform Exploratory Data Analysis (EDA) on the
cleaned
Telco Customer dataset to gain
insights
from the data and to
identify patterns
in the data.
I heard many people saying that, in order
to analyze and understand the data, one should think like a detective
🕵️♀️
,
it’s
because
the more you understand the data, the better the machine learning models will be. So, let’s start exploring and try to get as much information as possible from the data by asking some basic questions like:
To skim the dataset
1. What are Dependent and Independent features? 2. Is the dependent feature categorical or numerical? 3. Are there missing values/data in the features? If so, what might be the reason?
Now without further ado, let’s start our investigation to find answers to the above questions.
Import libraries and load dataset
Let’s start with importing the necessary libraries and loading the cleaned dataset. Check out the link to
part 1to see how we cleaned the dataset.
1
2
3
import pandas as pd
df = pd.read_csv('cleaned_dataset.csv')
df
Cleaned dataset
4.1. What are Dependent and Independent features?
To identify dependent and independent features we’ll take a look at few records of dataset and meta information.
1
df.head() # displays first five rows and all the columns in the dataset
First 5 records of cleaned dataset
Meta information
Observations:
Dependent feature
: This is a feature that represents the objective, and the objective of this project is to predict customer churn (i.e., Yes or No). Based on the dataset and meta information, “Churn Label” is the only feature that indicates customer churn with values of “Yes” and “No.” Therefore, we can consider “Churn Label” as a dependent feature.
Independent features
: These features help us predict the output (i.e., “Churn Label”). Therefore, all features except for “Churn Label” will be considered as independent features or inputs.
4.2. Is the dependent feature categorical or numerical?
If the dependent feature is of an
object
data type, then it’s
categorical
, otherwise it’s
numerical
.
Note
: Sometimes categorical features are in the form of numbers. In such cases, we first check whether the numerical data is in continuous or discrete form. If the data is in discrete form and has unique values (1,2,3,4,5 or 0 and 1, etc.) or categories, we call that feature categorical.
So, let’s check data type of the dependent feature, i.e., “Churn Label”.
1
df['Churn Label'].dtype
Datatype of “Churn Label”
Observations:
The “Churn Label” column is of an object data type. This means the dependent feature is
categorical
.
Missing values
4.3. Are there missing values/data in the features? If so, what might be the reason?
If the feature values are NaN or missing, then those values are known as missing values or missing data. The pattern of occurrence of these missing values is known as
missing data mechanism
. Missing values can be categorized into different types based on the missing data mechanisms.
There are 3 types of
missing data mechanisms
:
Missing completely at random
(MCAR) : This indicates that there is no specific reason for why the values are missing. For eg: Human errors (i.e., data is missed, as the operators accidentally skipped adding the values.)
Missing at random
(MAR) : In this case, data will not be recorded due to certain known reasons i.e., the data is intentionally not provided by the data provider. For eg: People with higher or lower salaries generally don’t disclose their details.
Missing not at random
(MNAR) : In this case, data will not be recorded because data doesn’t exist due to some hidden reasons i.e., the data is not provided by the data collector. For eg: a. People didn’t fill in the salary details because they are not working. b. Temperature values are not recorded as the sensors are worn out over time.
Here’s a dataset with
Designation
and
Salary
as features displayed in a tabular form to show different missing data mechanisms.
Missing data mechanism
Steps to check for missing values:
Calculate the total count of missing values in each column using Pandas.
Calculate the percentage of missing values for the columns.
Discover which missing data mechanisms match.
Total count of missing values
Let’s first check if there are missing values in the columns by using
isnull()
and
any()
methods. If there are missing values in a column, then display the total count of missing values in those columns by using the
sum()
method.
1
2
3
nan_col = df.columns[df.isnull().any()]
for i in nan_col:
print(i, df[i].isnull().sum())
Display the columns and count of missing values
Observations:
1. In the “Total Charges” column, there are 11 missing values. 2. In the “Churn Reason” column, there are 5174 missing values.
Percentage of missing values
Calculate the percentage of missing values in Total Charges and Churn Reason features.
1
2
print(f"There are {df['Total Charges'].isnull().sum()/len(df['Total Charges']) * 100}% missing values in Total Charges feature")
print(f"There are {df['Churn Reason'].isnull().sum()/len(df['Churn Reason']) * 100}% missing values in Churn Reason feature")
Percentage of missing values
Discover which missing data mechanisms match
Identifying the missing data mechanism helps us choose the right techniques required to replace the NaN values with some data.
Discover which missing data mechanism matches the missing data in the “Total Charges” column.
Let’s observe all the records/rows of the dataset where the “Total Charges” column has missing data.
1
df[df['Total Charges'].isnull()]
Records where the “Total Charges” column is NaN
Observations:
When “Tenure Months” is 0, the “Total Charges” has a NaN value. So let’s take a closer look to see if the “Total Charges” values are influenced by the “Tenure Months”.
Create a new dataset called
df_new
with only “Tenure Months” and “Total Charges” as features.
1
2
df_new = df[['Tenure Months','Total Charges']]
df_new
A new dataset with features (“Tenure Months” and “Total Charges”)
Now, let’s check the count of each unique value in the “Tenure Months” column by using the
value_counts()
method.
1
df_new['Tenure Months'].value_counts()
Unique values and their count in the “Tenure Months” column
Observations:
There are 11 records/rows with value 0 in the “Tenure Months” column.
Let’s display all the rows/records where the “Tenure Months” column is 0.
1
df_new[df_new['Tenure Months']==0]
“Tenure Months” records with value 0
Observations:
The 11 records/rows that have NaN values in the “Total Charges” column have a corresponding value of 0 months in the “Tenure Months” column.
Reason: “Total Charges” column values are missing because the data doesn’t exist. There is a possibility that customers began using their services by the end of the quarter, or that they’re in the middle of a free trial period. Therefore, we can conclude that the missing values in the “Total Charges” column are following the
MNAR
(Missing Not At Random)
mechanism
.
Discover which missing data mechanism matches the missing data in the “Churn Reason” column.
We know that there are 5174 missing values in the “Churn Reason” column. Now, let’s inspect the missing data mechanism in the “Churn Reason” column. For this, let’s check the values of the “Churn Label” column when “Churn Reason” is NaN.
The reason for only looking at the “Churn Label” and “Churn Reason” columns is that we can deduce from the meta info that “Churn Reason” entries will only be available if a customer leaves the company. For reference, see the image below. The “Churn Reason” has data about the reasons why a customer left the company, and the data about the customers who have left the company is available in the “Churn Label” feature.
Meta information
Create a new dataset with the features “Churn Label” and “Churn Reason” for further analysis.
1
2
df_new = df[['Churn Label','Churn Reason']]
df_new
Display all the records of “Churn Label” and “Churn Reason”
Check the count of unique values in the “Churn Label” column.
1
df_new['Churn Label'].value_counts()
The number of unique values in “Churn Label”
Display only the rows that have “Churn Label” equal to “No”.
1
df_new[df_new['Churn Label']=='No']
“Churn Label” with value as “No”
Observations:
From the count of unique values in “Churn Label”, there are 5174 records equal to “No”.
The number of records in “Churn Label” with “No” is the same as the number of records in “Churn Reason” with “NaN”.
From the observations, we can conclude that the presence of missing values in the “Churn Reason” column is because
the data didn’t exist
, and the data didn’t exist because people didn’t churn (i.e., people are still using their services). Therefore, we can say that these missing values are following the
MNAR
(Missing Not At Random)
mechanism
.
Conclusion
Identifying the dependent and independent features in a dataset makes it easier when we analyze the features in depth. It’s also important to find the missing data and their mechanisms, as missing data affects the performance of the machine learning model.
That’s it for this blog. Next in this series, we’ll continue to perform EDA to analyze the features and dive deep into the dataset by answering questions like:
What are the summary statistics of numerical features?
What are the summary statistics of categorical features?
Are the numerical features normally distributed or is there any skewness?
How many categories are there in each categorical feature?
Are there any outliers in the features? If so, can we really treat them as outliers?
Which independent features are correlated with the dependent feature?
Is there any correlation between the independent features?
Thanks for reading!!
Start building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.