There is a similar question asked here on SO many years back but there was no answer. I have the same question. I would like to add in new column(s) of data, in my case 3 columns for dummy variables, to a sparse matrix (from TfidfVectorizer
), before building a Pipeline
and conducting a GridSearch
to find the best hyper parameters.
Currently, I am able to do this model by model without GridSearch
and Pipeline
using the code below.
# this is an NLP project
X = df["text"] # column of text
y = df["target"] # continuous target variable
X_train, X_unseen, y_train, y_unseen = train_test_split(X, y, test_size=0.5, stratify=df_merged["platform"], random_state=42)
# vectorize
tvec = TfidfVectorizer(stop_words="english")
X_train_tvec = tvec.fit_transform(X_train)
# get dummies
dummies = pd.get_dummies(df["dummies"]).values
# add dummies to tvec sparse matrix
X_train_tvec_dumm = hstack([X_train_tvec, dummies]).toarray()
From here, I can fit my model onto the X_train_tvec_dumm
training data which includes the sparse matrix (shape: n_rows, n_columns) of word vectors from TfidfVectorizer
and 3 dummy columns. The final shape is therefore (n_rows, n_columns + 3).
I tried to build the Pipeline
as follows.
# get dummies
dummies = pd.get_dummies(df["dummies"]).values
def add_dummies(matrix):
return hstack([matrix, dummies]).toarray()
pipe = Pipeline([
("features", FeatureUnion([
("tvec", TfidfVectorizer(stop_words="english")),
("dummies", add_dummies(??)) <-- how do I add this step into the pipeline?
])),
("ridge", RidgeCV())
])
pipe_params = {
'features__tvec__max_features': [200, 500],
'features__tvec__ngram_range': [(1,1), (1,2)]
}
gs = GridSearchCV(pipe, param_grid=pipe_params, cv=4)
gs.fit(X_train, y_train)
print(gs.best_score_)
There is this tutorial that described how to build a custom transformer for the Pipeline
but its custom function adds a new feature engineered by transforming on X_train. My dummy variables are unfortunately external to the X_train set.