Create ColumnTransformer & FeatureUnion in Pipelines with Code Example
In this post, we’ll discuss how to differentiate ColumnTransformer
and FeatureUnion
in an easy way, and show how to create ColumnTransformer
, FeatureUnion
, and their combination with pipelines using Python code examples.
ColumnTransformer & FeatureUnion
In the previous post, we discussed how to select and add new columns in an ML pipeline and built several customized ColumnSelector. Now, let’s continue the pipeline topic with new content ColumnTransformer
and FeatureUnion
.
ColumnTransformer
and FeatureUnion
are both estimators which can be fit on Dataframe columns and generate a model/transformer. They help us organize multiple transformers applied to our input dataset. Usually, they are used in the data preprocessing step. However, they can be confusing sometimes due to their similar functionality.
So how to distinguish them?
ColumnTransformer vs FeatureUnion
According to the sklearn documentation:
ColumnTransformer: This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.
FeatureUnion: Concatenates results of multiple transformer objects. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.
So, using ColumnTransformer
, we can apply different transformers to different columns of the input dataset one by one, and concatenate these transformations into one single transformer.
On the other hand, using FeatureUnion
, we can apply different transformers to the same column (or the entire) of the input dataset at the same time, and combine these transformations into one single transformer.
Code examples
0. Prepare data
First, let’s create a small dataset with the shape of (3, 4). This sample dataset X_train
will be used for all code samples, with each row representing a record of customer reviews.
# import the neccessary libraries
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
# Create a small sample dataset
X_train = pd.DataFrame(data={'review': ['Good food good service.',
'Good food with friendly service.',
'Average food and bad service'],
'star': [5,4,2],
'meal_time': ['dinner','lunch','breakfirst'],
'tip_%': [0.25, 0.18, np.nan]})
X_train.head()
1. Construct a ColumnTransformer
- Step 1: For the categorical column
meal_time
, encode it usingOneHotEncoder
- Step 2: For numeric columns
star
andtip_%
, create aPipeline
to first impute NaN values usingSimpleImputer
and then scale its result usingMinMaxScaler
- Step 3: Pass through the remaining columns (
review
) - Step 4: Combine the two results together using
ColumnTransformer
cat_col = ['meal_time']
num_col = ['star','tip_%']
# make a pipeline to do computing and scaling
num_pipe = Pipeline([
('computer',SimpleImputer(strategy='constant', fill_value=0)),
('scaler', MinMaxScaler())
])
# construct the ColumnTransformer
ColumnTransformation = ColumnTransformer(
transformers=[
# one-hot encode categorical cols
('cat_ohe', OneHotEncoder(sparse = False, handle_unknown='ignore'), cat_col),
# pipe transform numeric cols
('num_pipe', num_pipe, num_col)
]
# passthrough the rest cols
, remainder = 'passthrough'
)
# fit and transform the data
ColumnTransformation.fit_transform(X_train)
As we can see, the transformed data is an array with a shape of (3, 6). After adding the column names, the original meal_time
column has expanded into three columns, with values 1 and 0 representing presence. The star
and tip_%
columns are scaled from (0, 5) and (0, 0.25) to the range (0, 1) respectively.
2. Construct a FeatureUnion
Now we’ll take two different ways to process the same column review
in parallel.
- Step 1: Get the word count of each
review
by creatingWordCounter
class - Step 2: Vectorize each
review
usingTfidfVectorizer
- Step 3: Combine the two parallel results together using
FeatureUnion
# create a WordCounter class
class WordCounter(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
num_word = X.apply(lambda x: len(x.split())).values.reshape(-1,1)
return num_word
# tfidf vectorize the reviews
vectorizer = TfidfVectorizer( stop_words='english')
# make a pipeline to scale the WordCounter result into range (0,1)
word_cnt_pipe = Pipeline([
('word_counter', WordCounter()),
('scaler', MinMaxScaler())
])
# construct the FeatureUnion
FeatureUnionTransformation = FeatureUnion([
('vectorizer', vectorizer),
('word_counter', word_cnt_pipe)
])
# fit and transform the data
FeatureUnionTransformation.fit_transform(X_train['review'])
Here, we converted the raw review
text into a series of TF-ITF weighted features. At the same time, we counted and scaled the number of words in each raw review
text. After combining the two parts using FeatureUnion
we get the ideal numerical data of the same scale with a shape of (3, 7).
3. Construct a Mix of ColumnTransformer & FeatureUnion & Pipeline
After completing the first two parts, it is very straightforward to combine them. here. We only need another ColumnTransformer
to chain the previously built ColumnTransformer
and FeatureUnion
parts together to finish the final data preprocessing.
# make a final data preprocessor
preprocessor = ColumnTransformer([
('col_trans', ColumnTransformation, ['meal_time','star','tip_%']),
('feaure_union', FeatureUnionTransformation, 'review')
])
# fit and transform the data
preprocessor.fit_transform(X_train)
Great! We build the final preprocessor by combing these two parts using another ColumnTransformer
. The original sample dataset with the shape of (3, 4) was successfully transformed into the desired format with the shape of (3, 12).
The final pipeline workflow of the combination of ColumnTransformer, FeatureUnion, and Pipeline is shown below.
After getting familiar with the two transformers, we can add other transformations (and pipelines) to further extend the workflow.
That’s all for now. Thanks for your reading!
Follow me and stay tuned! 😺 🪴 🍄
Related Post: Select Columns and Add New Columns in an ML Pipeline with Code Example