Create ColumnTransformer & FeatureUnion in Pipelines with Code Example

Xinqian Zhai
5 min readJan 30, 2023

--

Photo by EJ Strat on Unsplash

In this post, we’ll discuss how to differentiate ColumnTransformer and FeatureUnion in an easy way, and show how to create ColumnTransformer, FeatureUnion, and their combination with pipelines using Python code examples.

ColumnTransformer & FeatureUnion

In the previous post, we discussed how to select and add new columns in an ML pipeline and built several customized ColumnSelector. Now, let’s continue the pipeline topic with new content ColumnTransformer and FeatureUnion .

ColumnTransformer and FeatureUnion are both estimators which can be fit on Dataframe columns and generate a model/transformer. They help us organize multiple transformers applied to our input dataset. Usually, they are used in the data preprocessing step. However, they can be confusing sometimes due to their similar functionality.

So how to distinguish them?

ColumnTransformer vs FeatureUnion

According to the sklearn documentation:

ColumnTransformer: This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This is useful for heterogeneous or columnar data, to combine several feature extraction mechanisms or transformations into a single transformer.

FeatureUnion: Concatenates results of multiple transformer objects. This estimator applies a list of transformer objects in parallel to the input data, then concatenates the results. This is useful to combine several feature extraction mechanisms into a single transformer.

So, using ColumnTransformer, we can apply different transformers to different columns of the input dataset one by one, and concatenate these transformations into one single transformer.

On the other hand, using FeatureUnion, we can apply different transformers to the same column (or the entire) of the input dataset at the same time, and combine these transformations into one single transformer.

Code examples

0. Prepare data

First, let’s create a small dataset with the shape of (3, 4). This sample dataset X_train will be used for all code samples, with each row representing a record of customer reviews.

# import the neccessary libraries
import numpy as np
import pandas as pd

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion

# Create a small sample dataset
X_train = pd.DataFrame(data={'review': ['Good food good service.',
'Good food with friendly service.',
'Average food and bad service'],
'star': [5,4,2],
'meal_time': ['dinner','lunch','breakfirst'],
'tip_%': [0.25, 0.18, np.nan]})
X_train.head()
Raw sample dataset X_train

1. Construct a ColumnTransformer

  • Step 1: For the categorical column meal_time, encode it using OneHotEncoder
  • Step 2: For numeric columns star and tip_%, create a Pipeline to first impute NaN values using SimpleImputer and then scale its result using MinMaxScaler
  • Step 3: Pass through the remaining columns (review)
  • Step 4: Combine the two results together using ColumnTransformer
cat_col = ['meal_time']
num_col = ['star','tip_%']

# make a pipeline to do computing and scaling
num_pipe = Pipeline([
('computer',SimpleImputer(strategy='constant', fill_value=0)),
('scaler', MinMaxScaler())
])

# construct the ColumnTransformer
ColumnTransformation = ColumnTransformer(
transformers=[
# one-hot encode categorical cols
('cat_ohe', OneHotEncoder(sparse = False, handle_unknown='ignore'), cat_col),
# pipe transform numeric cols
('num_pipe', num_pipe, num_col)
]
# passthrough the rest cols
, remainder = 'passthrough'
)

# fit and transform the data
ColumnTransformation.fit_transform(X_train)
Dataframe after applying ColumnTransformer

As we can see, the transformed data is an array with a shape of (3, 6). After adding the column names, the original meal_time column has expanded into three columns, with values 1 and 0 representing presence. The star and tip_% columns are scaled from (0, 5) and (0, 0.25) to the range (0, 1) respectively.

2. Construct a FeatureUnion

Now we’ll take two different ways to process the same column review in parallel.

  • Step 1: Get the word count of each review by creating WordCounter class
  • Step 2: Vectorize each review using TfidfVectorizer
  • Step 3: Combine the two parallel results together using FeatureUnion
# create a WordCounter class
class WordCounter(BaseEstimator, TransformerMixin):
def __init__(self):
pass
def fit(self, X, y = None):
return self
def transform(self, X, y = None):
num_word = X.apply(lambda x: len(x.split())).values.reshape(-1,1)
return num_word

# tfidf vectorize the reviews
vectorizer = TfidfVectorizer( stop_words='english')

# make a pipeline to scale the WordCounter result into range (0,1)
word_cnt_pipe = Pipeline([
('word_counter', WordCounter()),
('scaler', MinMaxScaler())
])

# construct the FeatureUnion
FeatureUnionTransformation = FeatureUnion([
('vectorizer', vectorizer),
('word_counter', word_cnt_pipe)
])

# fit and transform the data
FeatureUnionTransformation.fit_transform(X_train['review'])
Dataframe after applying FeatureUnion

Here, we converted the raw review text into a series of TF-ITF weighted features. At the same time, we counted and scaled the number of words in each raw review text. After combining the two parts using FeatureUnion we get the ideal numerical data of the same scale with a shape of (3, 7).

3. Construct a Mix of ColumnTransformer & FeatureUnion & Pipeline

After completing the first two parts, it is very straightforward to combine them. here. We only need another ColumnTransformer to chain the previously built ColumnTransformer and FeatureUnion parts together to finish the final data preprocessing.

# make a final data preprocessor 
preprocessor = ColumnTransformer([
('col_trans', ColumnTransformation, ['meal_time','star','tip_%']),
('feaure_union', FeatureUnionTransformation, 'review')
])

# fit and transform the data
preprocessor.fit_transform(X_train)
Dataframe after combining ColumnTransformer & FeatureUnion

Great! We build the final preprocessor by combing these two parts using another ColumnTransformer. The original sample dataset with the shape of (3, 4) was successfully transformed into the desired format with the shape of (3, 12).

The final pipeline workflow of the combination of ColumnTransformer, FeatureUnion, and Pipeline is shown below.

Workflow of the ColumnTransformer (left) and FeatureUnion (right)

After getting familiar with the two transformers, we can add other transformations (and pipelines) to further extend the workflow.

--

--

Xinqian Zhai

Graduate student at the University of Michigan and a new learner on the road to data science.