Select Columns and Add New Columns in an ML Pipeline with Code Example

5 min readDec 15, 2022

In this post, we will discuss how to create classes to select specific columns/features, add new columns/features, and combine them together to build a neat pipeline using Python and Sklearn. In a later article, I will continue the pipeline topic with code examples to show how to useColumnTransformer and FeatureUnion.

Without further ado, let’s dig into it.

0. Prepare data

First, let’s prepare the data. Here I’ll use a small sample of data from my recent running race ranking prediction project to create a small dataset for illustration. The X_train here has only 4 columns/features, including the participant’s gender, age, distance, and distance unit. The target variable y_train is the derived race duration (in minutes).

# get the X,y
X = train_df.drop('time', axis=1)
y = train_df['time']

# Split the data into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.5,random_state=0)
print(X_train.shape, y_train.shape)
X_train.sample(3)

1. Select custom columns

With the data ready, let’s create a simple ColumnSelector class to custom-select a subset of columns from a given dataset. The advantage is that once we set up the class, it can be directly used as a step in a pipeline. We will see it later in this article.

# first, create a custom column selector to select specific columns 
from sklearn.base import BaseEstimator, TransformerMixin

class ColumnSelector(BaseEstimator, TransformerMixin):
    '''select specific columns of a given dataset'''
    def __init__(self, subset):
        self.subset = subset
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X.loc[:, self.subset]

customized_cols = ['distance','unit']
    
## test the ColumnSelector class
selected_cols = ColumnSelector(customized_cols)
selected_cols.transform(X_train).head()

transformed output samples from ColumnSelector

Nice, we’ve successfully selected the columns/features we are interested in and we’re ready to perform other transformations on the reduced data.

But wait, what if we need different transformations for different data types? For example, one-hot encoding for the unit column and scaling for the distance column? One solution is to create two classes to differentiate the data type.

2. Select all numeric and all categorical columns

Next, let’s create a NumColSelector class to select all the numeric columns and another class to select all the categorical columns in the training data.

# second, create a numeric-column selector and a categorical-column selector
class NumColSelector(BaseEstimator, TransformerMixin):
    '''select all numeric columns of a given dataset'''        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X.select_dtypes(include='number')

class CatColSelector(BaseEstimator, TransformerMixin):
    '''select all categorical columns of a given dataset'''        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return X.select_dtypes(include='object')
    
    
## test the NumericColSelector class
NumColSelector().transform(X_train).head()
# CatColSelector().transform(X_train).head()

transformed output samples from NumColSelector

transformed output samples from CatColSelector

Nice. We can easily separate categorical and numeric columns using these two classes and do different data transformations on them later.

What if we want to include a new column/feature that was not in the original training data? For example, we want to create a new feature to describe the median age of each distance group, which might be useful for predicting the target variable time.

3. Add a new column

Now, let’s create another class AgeMedianByDistGroup to add this new column/feature.

# create a class to add a new feature AgeMedianByDistGroup
class AgeMedianByDistGroup(BaseEstimator, TransformerMixin):
    '''get the median age of each distance group''' 
    def __init__(self, train):
        self.age_median_by_dist_group = train.groupby('distance').apply(lambda x: x['age'].median())
        self.age_median_by_dist_group.name = 'age_median_by_dist_group'
        
    def fit(self, X=None, y=None):
        return self
    
    def transform(self, X, y=None):
        new_X = pd.merge(X, self.age_median_by_dist_group, 
                         left_on = 'distance', right_index=True, how='left')        
        X['age_median_by_dist_group'] = new_X['age_median_by_dist_group']
        return X

transformed output samples from AgeMedianByDistGroup

As we can see, the median age for running 5km is 36, and the median age for 8km is 30.

4. Set up the final pipeline

Finally, let’s build a pipeline by wiring all together using the classes we created previously. Basically, we will perform the following steps to the training data:

Step 1: Add a new column/feature with AgeMedianByDistGroup
Step 2: Select all numeric columns/features with NumColSelector
Step 3: Fix all nan values to the median with SimpleImputer
Step 4: Normalize values to [0,1] range with MinMaxScaler

# build the final pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

pipe = Pipeline([
                ('add_new_col', AgeMedianByDistGroup(X_train)),
                ('get_num_cols', NumColSelector()),
                ('fix_nan', SimpleImputer(missing_values=np.nan, strategy='median')),
                ('scale_data', MinMaxScaler())
])

pipe.fit(X_train)

The final pipeline looks straightforward, right? One of the biggest advantages of using pipelines is that it automates the process and keeps our code clean and organized.

Through this series of steps, the test dataset and the holdout dataset can also be easily transformed using the same pipeline without repeated coding. The final transformed training data as shown below is a normalized data frame, one without any NaN values in our selected columns/features.

transformed X_train data frame & NaN values sum output

Voila! We built several transformers and combined them in an elegant pipeline! We can use the prepared training data for later modeling.

Thanks for reading my article! I hope this walk-through helps you with your current or next project. My next post will discuss how to use ColumnTransformer and FeatureUnion for column transformation in the pipeline.

Follow me and stay tuned! 😺 🍁 ☃️

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com