TLDR
In this Mage Academy lesson on model training, we’ll learn how to split our data for training and testing our machine learning models with K Fold Cross Validation.
Glossary
Definition
Conceptual example
How to code
Magical no-code solution ✨🔮
Definition
K-fold cross-validation is a data partitioning technique which splits an entire dataset into k groups. Then, we train and test k different models using different combinations of the groups of data we just partitioned, and use the results from these k models to check the model’s overall performance and generality.
In the context of machine learning, a
foldis a set of rows in a dataset. We will use k-folds to describe a number of groups we decide to partition the data, so in an example of 20 rows, we can split them into 2 folds with 10 rows each, 4 folds with 5 rows each, or 10 folds with 2 rows each.
The entire dataset is randomly split into equally-sized, independent
k-folds
, without reusing any of the rows in another fold.
We use
k-1
folds
for model training, and once that model is complete, we test it using the remaining 1 fold to obtain a score of the model’s performance.
We repeat this process
k times
, so we have
k number
of models and scores for each.
Lastly, we take the mean of the
k number
of scores to evaluate the model’s performance.
Conceptual example
To improve your understanding
twice-fold
😏, consider this analogy about k-fold cross validation with
Twice
, a K-pop girl group. Say we are trying to see how well a
model
can dance by inviting different subsets of Twice girls (called
folds
) as training and test samples.
Source: Twice Official Twitter
If the entire dataset has 9 girls, which are our data points, then we need to manually choose how many folds to split our data into. I’m going with 3 for our example, but there are strategies to
pick the best k.
Since we need an equal amount of data in each fold, we randomly pick 3 girls from Twice for each of the three folds, with no overlaps:
With these 3 folds, we will train and evaluate 3 models (because we picked k=3) by training it on 2 folds (k-1 folds) and use the remaining 1 as a test. We pick different combinations of folds for the 3 models we’re evaluating.
Model 1: Trained on Fold 1 + Fold 2, Tested on Fold 3
Model 2: Trained on Fold 2 + Fold 3, Tested on Fold 1
Model 3: Trained on Fold 1 + Fold 3, Tested on Fold 2
The performance scores would get skewed if the same Twice girls who taught you how to dance were also your judges. So whichever six girls (data points) the model from, the remaining three girls would judge and score you.
Now that you have 3 models and their scores, we can choose a model evaluation method (discussed in another lesson) to determine– generally– whether this model dances well. This is also to ensure that, in one metric, the opinions of all 9 judges/test samples are included.
The resulting evaluation metric would tell us whether we did a good job at dancing. So did we do a good job?
Nayeon is rooting for you!
How to code
Let’s try to evaluate how well a model learns to predict whether customers of a tourism company flake on their plans or not using Tejashvi’s
dataset. Maybe this model could tell us whether we’d follow through with our dreams of vacationing overseas this year, too?
1
2
3
4
import pandas as pd
df = pd.read_csv("Customertravel.csv")
df
Since scikit-learn takes numpy arrays, we’d first have to use Pandas to convert our data frame into a numpy array.
Then, we can use the “KFold” class to configure our evaluation. Our next step is to choose the amount of folds to split our rows of data into. Above, we can see that our dataset has 954 rows, which divides nicely into 9 folds with 106 rows of data each.
This means we’d build and evaluate 9 models total, using 8 folds as training and 1 for scoring each.
1
2
3
4
5
6
7
8
9
10
11
from sklearn.model_selection import KFold
# 2nd + 3rd param: shuffle data before splitting into folds
kfold = KFold(n_splits=9, shuffle=True, random_state=1)
model = 1
# displaying indices for the rows that will be for training/testing
for train, test in kfold.split(np_array):
print('Model #%d:' % model)
print('train: %s, test: %s' % (train, test))
model = model+1
Now that we’re done splitting our data into 9 folds, we’re ready to continue onto the next lesson of evaluating the model!
Magical no-code solution ✨🔮
To skip all those configuration steps for K-fold cross validation, Mage provides an easy, no-code experience of training and testing a dataset. Although we, as users, aren’t able to customize how much of our data is split, Mage uses an algorithm to decide. For this dataset, Mage decided on approximately a 9:1 training to testing split.
You can find further details about the training/test split under “Review > Statistics” on our Mage web application.
We just launched our new
open source toolfor building and running data pipelines that transform you data!
Start building for free
No need for a credit card to get started.
Trying out Mage to build ranking models won’t cost a cent.
No need for a credit card to get started. Trying out Mage to build ranking models won’t cost a cent.