Breast Cancer Classification Using KNN Classifier

Xinqian Zhai
5 min readJan 4, 2022

--

Photo by Safar Safarov on Unsplash

This is my first tutorial of supervised machine learning classification practice. I will be using the Breast Cancer Wisconsin (Diagnostic) dataset to do the classification and try to help diagnose patients whether a breast mass is malignant or benign. In this article, I will use KNN (K Nearest Neighbor) as the classifier, train and evaluate it, and perform an overfitting test at the end.

My next article will be using more advanced classifiers, SVM (Support Vector Machine) and Logistic Regression, to train the same Wisconsin dataset. Stay tuned for more!

0. Import useful libraries

First, let’s import all the useful libraries and set the random state value.

# Suppress all warnings
import warnings
warnings.filterwarnings('ignore')
# Import some necessary libararies
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
RANDOM_STATE = 42

1. Data Preparation

Now, we need to load the Winsconsin data set from scikit-learn, and transform the raw data from a Bunch object to a data frame for better data manipulation.

from sklearn.datasets import load_breast_cancer# Load the dataset from scikit-learn.
cancer = load_breast_cancer()
# cancer.keys() # to see all the attributes
cancer_df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
cancer_df['target'] = cancer.target
cancer_df.head()
First five rows of cancer_df data frame. Image by the Author

After loading the data, We use .info(), .shape(), describe() to see if there is any data needs to be cleaned up. Fortunately, the data from scikit-learn is quite clean, with no missing values, and no wrong data types. It is ready for the next step.

Now, let’s do a little exploration of the class distribution to see if the target classes are imbalanced. If the class is heavily imbalanced, we can not use it directly to train a model because the model will not learn enough from the minority class, leading to poor performance.

Class distribution. Image by the Author

Not bad. The number of class 1 (Malignant) is a bit more than the number of class 0 (Benign), but in reality, it is not considered heavily imbalanced and can be used for analysis.

In the last step, we need to do the train-test split on the cleaned data to get the data ready for model training.

from sklearn.model_selection import train_test_split

X = cancer_df.drop('target', axis=1)
y = cancer_df.target

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = RANDOM_STATE)

2. Apply KNN Classifier

First, let’s build a KNN classifier with a random number of neighbors as the parameter. Here I used number 1.

from sklearn.neighbors import KNeighborsClassifier

def knn_model_fitting(X_train, X_test, y_train, y_test, n_neighbors):

knn = KNeighborsClassifier(n_neighbors = n_neighbors)
knn.fit(X_train, y_train)

preds = knn.predict(X_test)
score = knn.score(X_test, y_test)
print('The mean accuracy of this KNN classifier is:
{}'.format(score))

return preds, knn
# here just set the # of neighbors =1 to see the model performance
n_neighbors= 1
preds_default, knn_default = knn_model_fitting(X_train, X_test, y_train, y_test, n_neighbors)
# the result
# The mean accuracy of this KNN classifier is: 0.9300699300699301

A classifier with an accuracy of about 0.93, pretty good. Well, can we achieve an even better model performance by changing the number of nearest neighbors? Let’s tune the parameter and give it a try!

def get_best_knn_neighbors():

global X_train, X_test, y_train, y_test

best_scores = 0
best_neighbor = 0
for i, k in enumerate(range(1,20)):
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)
score = knn.score(X_test, y_test)
if score > best_scores:
best_scores = score
best_neighbor = k

return best_neighbor
get_best_knn_neighbors()# the result
# 11

We found the best number of neighbors for the KNN classifier, it is 11. Let’s fit the classifier with the best parameter and see the model accuracy.

# fit knn classifier with best parameter
best_n_neighbors= get_best_knn_neighbors()
preds_tuned, knn_tuned = knn_model_fitting(X_train, X_test, y_train, y_test, best_n_neighbors)
# the result
# The mean accuracy of this KNN classifier is: 0.9790209790209791

Great! After tunning the parameter, the model accuracy did increase from 0.93 to 0.98.

3. Evaluate KNN Classifier

Let’s evaluate the KNN classifier using another metric, confusion matrix, and compare model performance differences.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
# make a confusion matrix with not tuned param
print('Before parameter tunning \nClass 1: Malignant \nClass 0: Bengin')
plot_confusion_matrix(knn_default, X_test, y_test);
# make another confusion matrix with tuned param
print('After parameter tunning \nClass 1: Malignant \nClass 0: Bengin')
plot_confusion_matrix(knn_tuned, X_test, y_test);
Confusion matrix before and after parameter tuning. Image by the Author

As we can see, both the number of false positives and false negatives has reduced after tunning the parameter (false-positive: 6 to 2, false-negative: 4 to 1). We’ve greatly improved the model performance by finding the best parameter (n_neighbors).

4. Overfitting test

Since an overfitted model can have extremely high accuracy on the training data set, but a considerably lower accuracy on the test data set, we would like to try to see if there is a scenario where overfitting is likely to be happening. We can define a function to walk through odd numbers from 1 to 20 as the value of the n_neighbors parameter and try to find the possible overfitting scenario.

def overfitting_num_neighbors():

global X_train, X_test, y_train, y_test

k_scores = []
for i, k in enumerate(range(1,20,2)):
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)
train_score = knn.score(X_train, y_train)
test_score = knn.score(X_test, y_test)
k_scores.append((k, train_score, test_score))

k_overfitting = max(k_scores, key = lambda x: x[1])
k_best_performance = max(k_scores, key = lambda x: x[2])

return k_overfitting, k_best_performanceoverfit,
best_performance = overfitting_num_neighbors()
print(overfit)
print(best_performance)
# the result
# n_neigbbors, training score, test score
# (1, 1.0, 0.9300699300699301)
# (11, 0.9342723004694836, 0.9790209790209791)

When the parameter n_neighbors = 1, we did find overfitting. In this case, the training accuracy is 1, which means that the classifier performs perfectly on the training data, while the accuracy on the test data is relatively low (0.93). When the n_neighbors = 11, we achieved the highest accuracy score on the test data, 0.98, and lower accuracy, 0.93, on the training data. This is what we want, a higher accuracy model without overfitting, which means the trained classifier can generalize well on new unseen data.

In my next article, I will use the same dataset (Wisconsin) to do another classification practice using SVC and Logistic Regression as classifiers. Stay tuned for more!

--

--

Xinqian Zhai
Xinqian Zhai

Written by Xinqian Zhai

Graduate student at the University of Michigan and a new learner on the road to data science.

No responses yet