Select Columns and Add New Columns in an ML Pipeline with Code Example
In this post, we will discuss how to create classes to select specific columns/features, add new columns/features, and combine them together to build a neat pipeline using Python and Sklearn. In a later article, I will continue the pipeline topic with code examples to show how to useColumnTransformer
and FeatureUnion
.
Without further ado, let’s dig into it.
0. Prepare data
First, let’s prepare the data. Here I’ll use a small sample of data from my recent running race ranking prediction project to create a small dataset for illustration. The X_train here has only 4 columns/features, including the participant’s gender, age, distance, and distance unit. The target variable y_train is the derived race duration (in minutes).
# get the X,y
X = train_df.drop('time', axis=1)
y = train_df['time']
# Split the data into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.5,random_state=0)
print(X_train.shape, y_train.shape)
X_train.sample(3)
1. Select custom columns
With the data ready, let’s create a simple ColumnSelector
class to custom-select a subset of columns from a given dataset. The advantage is that once we set up the class, it can be directly used as a step in a pipeline. We will see it later in this article.
# first, create a custom column selector to select specific columns
from sklearn.base import BaseEstimator, TransformerMixin
class ColumnSelector(BaseEstimator, TransformerMixin):
'''select specific columns of a given dataset'''
def __init__(self, subset):
self.subset = subset
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X.loc[:, self.subset]
customized_cols = ['distance','unit']
## test the ColumnSelector class
selected_cols = ColumnSelector(customized_cols)
selected_cols.transform(X_train).head()
Nice, we’ve successfully selected the columns/features we are interested in and we’re ready to perform other transformations on the reduced data.
But wait, what if we need different transformations for different data types? For example, one-hot encoding for the unit column and scaling for the distance column? One solution is to create two classes to differentiate the data type.
2. Select all numeric and all categorical columns
Next, let’s create a NumColSelector
class to select all the numeric columns and another class to select all the categorical columns in the training data.
# second, create a numeric-column selector and a categorical-column selector
class NumColSelector(BaseEstimator, TransformerMixin):
'''select all numeric columns of a given dataset'''
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X.select_dtypes(include='number')
class CatColSelector(BaseEstimator, TransformerMixin):
'''select all categorical columns of a given dataset'''
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
return X.select_dtypes(include='object')
## test the NumericColSelector class
NumColSelector().transform(X_train).head()
# CatColSelector().transform(X_train).head()
Nice. We can easily separate categorical and numeric columns using these two classes and do different data transformations on them later.
What if we want to include a new column/feature that was not in the original training data? For example, we want to create a new feature to describe the median age of each distance group, which might be useful for predicting the target variable time.
3. Add a new column
Now, let’s create another class AgeMedianByDistGroup
to add this new column/feature.
# create a class to add a new feature AgeMedianByDistGroup
class AgeMedianByDistGroup(BaseEstimator, TransformerMixin):
'''get the median age of each distance group'''
def __init__(self, train):
self.age_median_by_dist_group = train.groupby('distance').apply(lambda x: x['age'].median())
self.age_median_by_dist_group.name = 'age_median_by_dist_group'
def fit(self, X=None, y=None):
return self
def transform(self, X, y=None):
new_X = pd.merge(X, self.age_median_by_dist_group,
left_on = 'distance', right_index=True, how='left')
X['age_median_by_dist_group'] = new_X['age_median_by_dist_group']
return X
As we can see, the median age for running 5km is 36, and the median age for 8km is 30.
4. Set up the final pipeline
Finally, let’s build a pipeline
by wiring all together using the classes we created previously. Basically, we will perform the following steps to the training data:
- Step 1: Add a new column/feature with
AgeMedianByDistGroup
- Step 2: Select all numeric columns/features with
NumColSelector
- Step 3: Fix all nan values to the median with
SimpleImputer
- Step 4: Normalize values to [0,1] range with
MinMaxScaler
# build the final pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([
('add_new_col', AgeMedianByDistGroup(X_train)),
('get_num_cols', NumColSelector()),
('fix_nan', SimpleImputer(missing_values=np.nan, strategy='median')),
('scale_data', MinMaxScaler())
])
pipe.fit(X_train)
The final pipeline looks straightforward, right? One of the biggest advantages of using pipelines is that it automates the process and keeps our code clean and organized.
Through this series of steps, the test dataset and the holdout dataset can also be easily transformed using the same pipeline without repeated coding. The final transformed training data as shown below is a normalized data frame, one without any NaN values in our selected columns/features.
Voila! We built several transformers and combined them in an elegant pipeline! We can use the prepared training data for later modeling.
Thanks for reading my article! I hope this walk-through helps you with your current or next project. My next post will discuss how to use ColumnTransformer
and FeatureUnion
for column transformation in the pipeline.
Follow me and stay tuned! 😺 🍁 ☃️