Breast Cancer Classification Using SVC and Logistic Regression Classifiers
In this article, I will continue to do supervised learning classification using the Breast Cancer Wisconsin (Diagnostic) dataset, but this time use two other classifiers. One is a Support Vector Machine classifier (SVC), and the other is a Logistic Regression classifier to help diagnose patients.
Click here to see Breast Cancer Classification Using KNN (K Nearest Neighbor) Classifier.
0. Import useful libraries
First, let’s import all the useful libraries and set the random state value.
# Suppress all warnings
import warnings
warnings.filterwarnings('ignore')# Import some necessary libararies
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inlineRANDOM_STATE = 42
1. Data Preparation
Now, we load the Winsconsin data set from scikit-learn and transform the raw data from a Bunch object to a data frame for better data manipulation.
from sklearn.datasets import load_breast_cancer# Load the dataset from scikit-learn.
cancer = load_breast_cancer()
# cancer.keys() # to see all the attributescancer_df = pd.DataFrame(cancer.data, columns = cancer.feature_names)
cancer_df['target'] = cancer.target
cancer_df.head()
let’s do some simple data exploration to see if there are some missing values, wrong data types, or imbalanced classes, etc.
Since the data set is clean and the class distribution is relatively balanced, we can use the clean data to do the train-test split and get ready for the model training using different classifiers.
from sklearn.model_selection import train_test_split
X = cancer_df.drop('target', axis=1)
y = cancer_df.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = RANDOM_STATE)
2. Create dummy classifiers
First, let’s make two dummy classifiers as a sanity check. A dummy classifier is a classifier that does not generate any insight from the data but can provide a baseline accuracy of the class label. In our case, we will use two dummy classifiers:
- Dummy A (Uniform): a dummy classifier that “generates predictions uniformly at random from the unique classes observed in the target data set”. (scikit-learn documentation).
- Dummy B (Most Frequent): a dummy classifier that “always returns the most frequent class label in the observed y data set”. (scikit-learn documentation).
from sklearn.dummy import DummyClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import recall_scoredef make_dummy_classidiers(X_train, y_train, X_test, y_test):
# make two dummy classifiers and fit the training data
dummy_A = DummyClassifier(strategy = 'uniform',
random_state = RANDOM_STATE)
dummy_A.fit(X_train, y_train)
dummy_B = DummyClassifier(strategy = 'most_frequent',
random_state = RANDOM_STATE)
dummy_B.fit(X_train, y_train) # get the predicted classes
dummy_A_pred = dummy_A.predict(X_test)
dummy_B_pred = dummy_B.predict(X_test) # get precision, recall, and accuracy scores for each dummy
classifier
preA, recA, accA = precision_score(y_test, dummy_A_pred),
recall_score(y_test, dummy_A_pred),
dummy_A.score(X_test, y_test)
preB, recB, accB = precision_score(y_test, dummy_B_pred),
recall_score(y_test, dummy_B_pred),
dummy_B.score(X_test, y_test)
return (preA, recA, accA),( preB, recB, accB)
Then, let’s evaluate the two dummy classifiers by seeing the precision, recall, and accuracy scores of the two dummy classifiers.
dummy_A_scores, dummy_B_scores = make_dummy_classidiers(X_train, y_train, X_test, y_test)print('Dummy classifier A (Stratified) scores:')
print('Precision: {}, Recall: {}, Accuracy: {}'.format(dummy_A_scores[0],dummy_A_scores[1],dummy_A_scores[2]))
print('-----------------------')
print('Dummy classifier B (Most frequent) scores:')
print('Precision: {}, Recall: {}, Accuracy: {}'.format(dummy_B_scores[0],dummy_B_scores[1],dummy_B_scores[2]))# the result
Dummy classifier A (uniform) scores:
Precision: 0.6708860759493671,
Recall: 0.5955056179775281,
Accuracy: 0.5664335664335665
-----------------------
Dummy classifier B (most_frequent) scores:
Precision: 0.6223776223776224,
Recall: 1.0,
Accuracy: 0.6223776223776224
Those scores of the dummy classifiers are the base scores of our true classifiers. If the precision, recall, or accuracy scores of our real classifier is close to that of the dummies, it means that the performance of our classifiers is not better than a random guess.
3. Apply SVC classifier
First, let’s build an SVC classifier and fit it with the training data.
from sklearn.svm import SVC# make a func using defult para values
def make_SVC_classifier(X_train, X_test, y_train, y_test,
kernel = 'rbf', C = 1, gamma = 'scale',
random_state = RANDOM_STATE):
svc = SVC(kernel = kernel, C = C, random_state = random_state)
svc.fit(X_train, y_train) svc_preds = svc.predict(X_test)
preSVC, recSVC, accSVC = precision_score(y_test, svc_preds),
recall_score(y_test, svc_preds),
svc.score(X_test, y_test)
return (preSVC, recSVC, accSVC), svc, svc_preds
Then, let’s evaluate the SVC classifier by seeing the precision, recall, and accuracy scores,
# get the precision, recall, and accuracy scores
svc_scores, svc, svc_preds = make_SVC_classifier(X_train, X_test, y_train, y_test)print('SVC classifier scores:')
print('Precision: {}, Recall: {}, Accuracy: {}'.format(svc_scores[0],svc_scores[1],svc_scores[2]))# the result
SVC classifier scores:
Precision: 0.9361702127659575,
Recall: 0.9887640449438202,
Accuracy: 0.951048951048951
and plotting the precision-recall curve.
# plot the precision recall curve chart
from sklearn.metrics import plot_precision_recall_curveplot_precision_recall_curve(svc, X_test, y_test);
4. Apply Logistic Regression classifier
Next, let’s build a Logistic regression classifier,
from sklearn.linear_model import LogisticRegressiondef make_logreg_classifier(X_train, X_test, y_train, y_test,
penalty = 'l2', C=1, # defult para values
random_state = RANDOM_STATE):
lr = LogisticRegression(random_state = random_state)
lr.fit(X_train, y_train) lr_preds = lr.predict(X_test)
prelr, reclr, acclr = precision_score(y_test, lr_preds),
recall_score(y_test, lr_preds),
lr.score(X_test, y_test)
return (prelr, reclr, acclr), lr, lr_preds
and evaluate the Logistic regression classifier by seeing the precision, recall, and accuracy scores,
lr_scores, lr, lr_preds = make_logreg_classifier(X_train, X_test, y_train, y_test)
print('Logistic Regression classifier scores:')
print('Precision: {}, Recall: {}, Accuracy: {}'.format(lr_scores[0],lr_scores[1],lr_scores[2]))# result
# Logistic Regression classifier scores:
Precision: 0.9666666666666667,
Recall: 0.9775280898876404,
Accuracy: 0.965034965034965
and plotting the precision-recall curve too.
Overall, both two classifiers perform pretty well on the Winsconsin dataset. The recall rate of the SVC classifier is a bit higher than that of the Logistic Regression classifier (recall: 0.99 vs 0.98), but both the precision rate and accuracy of it is lower than that of the Logistic Regression classifier (precision: 0.94 vs 0.97, accuracy: 0.95 vs 0.97). So, you can choose one of them as the final classifier.
One note in this setup is that we prefer a higher recall rate more than the precision rate. This is the classic precision-recall tradeoff. In the cancer detection scenario, false negatives are obviously very undesirable, because it would cause patients who do have malignant cells to be falsely detected as not, which may result in them not being treated. You can find more about precision and recall definitions on Wikipedia.
Click here to see Breast Cancer Classification Using KNN (K Nearest Neighbor) Classifier.