Cross Validation with Code Examples

Xinqian Zhai
7 min readNov 4, 2022

--

Photo by Tai Bui on Unsplash

Summary:

  1. What is cross-validation?
  2. Why do we use cross-validation?
  3. What are the common types of cross-validation?
  4. How to apply cross-validation (with code)?
  5. Overfitting and underfitting in cross-validation
  6. What should we be aware of?

What is cross-validation?

Cross-validation is an evaluation technique used to assess the performance of a machine-learning model. It evaluates a single model using multiple train-test splits and returns multiple accuracy scores. The process is similar to the resampling process, using different subsets of the original dataset to train and evaluate the same model so that we can get an average accuracy score from a series of scores to determine whether the model we trained is a good predictor.

Why do we use cross-validation?

First, cross-validation gives us a more stable and reliable model estimation. The estimated accuracy scores will vary depending on the data samples that end up in the training and test sets. Therefore, with cross-validation, instead of relying on a single specific training set to get the final accuracy score, we can obtain the average accuracy score of the model from a series of scores returned from multiple train-test splits.

Second, cross-validation can show model sensitivity. With a range of accuracy scores returned by cross-validation, we can do a worst-case or best-case scenario for model performance. Specifically, we can make a distribution plot of the accuracy scores to see how likely our model would perform poorly or very well on a new dataset.

What are the common types of cross-validation?

K-fold cross-validation.

Take K = 5 as an example. Randomly split the original dataset into 5 folds of equal size and repeat the process 5 times. For each time, one fold is used as the test set, and the other four folds are used as the training set to train the model to get the corresponding accuracy score. With all these scores, we can obtain an average cross-validation accuracy score.

Image Source: SIADS542 by Kevyn Collins-Thompson

Since all data samples are used during model training and testing, K-fold cross-validation returns less biased models.

However, if the original data samples were in some order or sorted according to class labels, then we could end up training the model with an imbalanced training subset and unrepresented test set. Thus, the resulting accuracy score would be inaccurate.

Code example

# import libs
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# get cancer data
cancer = load_breast_cancer()
X_cancer = cancer['data']
y_cancer = cancer['target']
# normalized the data
scaler = MinMaxScaler()
X_cancer_scaled = scaler.fit_transform(X_cancer)
# apply classifier
clf = LogisticRegression()
# get cv scores
cv_scores = cross_val_score(clf, X_cancer_scaled, y_cancer, cv = 5)
print('Cross validation scores (5 folds): {}'.format(cv_scores))
print('The average cross validation score (5 folds): {}'.format(np.mean(cv_scores)))
## final result ##
## Cross validation scores (5 folds): [0.95614035 0.96491228 0.97368421 0.95614035 0.96460177]
## The average cross validation score (5 folds): 0.9630957925787922

Stratified K-fold Cross Validation.

To address the potential imbalance problem, we can use stratified cross-validation. Basically, the data samples are rearranged to ensure that the proportion of classes in each subset is as close as possible to the actual proportion of classes in the entire dataset.

Image Source: SIADS542 by Kevyn Collins-Thompson

In this way, each subset is a good representation of the entire dataset.

Code example

# import libs
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
# get iris data
iris = load_iris()
X_iris = iris['data']
y_iris = iris['target']
# stratified 3-fold splits
skf = StratifiedKFold(n_splits=3)
skf.get_n_splits(X_iris, y_iris) # 3 iterations
# apply classifier
clf = LogisticRegression()
# get stratified cv scores
cv_skf_scores = cross_val_score(clf, X_iris, y_iris, cv = skf)
print('Cross validation scores (3 folds): {}'.format(cv_skf_scores))
print('The average cross validation score (3 folds): {}'.format(np.mean(cv_skf_scores)))
## final result ##
## Cross validation scores (3 folds): [0.98 0.96 0.98]
## The average cross validation score (3 folds): 0.9733333333333333

It should be noted that here we are just for demonstration. In fact, we don’t need to do the stratified splitting ourselves. The cv parameter in cross_val_score will identify the input estimator by itself. If y is binary or multiclass, StratifiedKFold is used automatically. That is, we can directly replace cv = skf with cv = 3 to get the same result.

leave-one-out cross-validation.

Leave-one-out cross-validation is the K-fold cross-validation when K is equal to the total number (n) of data samples in the dataset. As the name implies, only one data sample is left as the test set, and all the remaining data samples are used as the training set. After iterating K = n times, we can get the average cross-validation accuracy score.

Image Source: SIADS542 by Kevyn Collins-Thompson

If we use self-defined P samples instead of one sample as the test set, then this kind of cross-validation is called Leave-P-out cross-validation.

Code example

# import libs
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import LeavePOut
# get iris data
iris = load_iris()
X_iris = iris['data']
y_iris = iris['target']
# leave one out splits
loo = LeaveOneOut()
loo.get_n_splits(X_iris) # 150 iterations
# leave P out splits
lpo = LeavePOut(2)
lpo.get_n_splits(X_iris) # 11175 iterations
# apply classifier
clf = LogisticRegression()
# get leave-one-out cv scores
cv_loo_scores = cross_val_score(clf, X_iris, y_iris, cv = loo)
print('Cross validation scores: {}'.format(cv_loo_scores))
print('The average cross validation score: {}'.format(np.mean(cv_loo_scores)))
## leave-one-out result ##
Cross validation scores: [1. ... 1. 1.]
The average cross validation score : 0.9666666666666667
# get leave-p-out cv scores
long time to run!
cv_lpo_scores = cross_val_score(clf, X_iris, y_iris, cv = lpo)
print('Cross validation scores: {}'.format(cv_lpo_scores))
print('The average cross validation score: {}'.format(np.mean(cv_lpo_scores)))
## leave-p-out result ##
## Cross validation scores : [1. 1. 1. ... 1. 1. 1.]
## The average cross validation score: 0.9652796420581655

Note that both leave-one-out and leave-p-out are exhaustive cross-validation techniques. It is best to use them when we have a small dataset, otherwise, it is very expensive to run.

Plot validation curves to see overfitting and underfitting

Below we use validation_curve()to get the training and cross-validation scores of the SVM model on the Breast_cancer dataset we used earlier to see the corresponding gamma values for the SVM model underfitting or overfitting.

Image by the author

As we can see that the model is underfitting when the gamma value is lower than 10 to the negative 7th power, and the model is overfitting when the gamma value is higher than 10 to the negative 4th power. A good gamma value should be somewhere in between, since both the training and cross-validation scores are high ( ≥0.9).

Code example

Reference: Scikit-learn.orgfrom sklearn.datasets import load_breast_cancer
from sklearn.model_selection import validation_curve
from sklearn.svm import SVC
import matplotlib.pyplot as plt
# get cancer data
cancer = load_breast_cancer()
X_cancer = cancer['data']
y_cancer = cancer['target']
# set gamma parameter values
param_range = np.logspace(-10, -2, 6)
# get training and test scores
train_score, test_score = validation_curve(SVC(random_state=0), X_cancer, y_cancer,param_name = 'gamma',param_range = param_range, cv = 5)
# get means and stds of training and test scores
train_score_mean = np.mean(train_score, axis = 1)
test_score_mean = np.mean(test_score, axis = 1)
train_score_std = np.std(train_score, axis = 1)
test_score_std = np.std(test_score, axis = 1)
# make validation curve plot
plt.figure(figsize = (8,6))
plt.title("Validation Curve with SVM on Breast Cancer Dataset")
plt.xlabel(r"gamma $\gamma$")
plt.ylabel("Accuracy Score")
plt.ylim(0.0, 1.1)
plt.semilogx(
param_range, train_score_mean, label="Training score", color="blue", lw=2
)
plt.fill_between(
param_range,
train_score_mean - train_score_std,
train_score_mean + train_score_std,
alpha=0.2,
color="blue",
lw=2,
)
plt.semilogx(
param_range, test_score_mean, label="Cross-validation score", color="green", lw=2
)
plt.fill_between(
param_range,
test_score_mean - test_score_std,
test_score_mean + test_score_std,
alpha=0.2,
color="green",
lw=2,
)
plt.legend(loc="best")
plt.show()

What should we be aware of?

Cross-validation can be computationally expensive, especially when we have a large dataset and set a large fold value. This is because the algorithm cannot compute the fold results in parallel, so it takes K times as long to get all the scores.

Also, cross-validation is used for model evaluation, not for model tuning. Instead of using cross-validation to tune model parameters, we should use Grid search to do it.

Thanks for reading! If you like this article, please give me a clap 👏 and follow me for more! ☀️🌸😺

--

--

Xinqian Zhai

Graduate student at the University of Michigan and a new learner on the road to data science.