CountVectorizer not working in ColumnTransformer

Question

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case:

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Create a sample data frame
df = pd.DataFrame({
    'corpus': ['This is the first document.', 'This document is the second document.', 'And this is the third one.',
               'Is this the first document?', 'I have the fourth document'],
    'word_length': [27, 37, 26, 27, 26]
})

text_feature = ["corpus"]
count_transformer = CountVectorizer()

# Create the ColumnTransformer
ct = ColumnTransformer(transformers=[
    ("count", count_transformer, text_feature)],
    remainder='passthrough')

ct.fit_transform(df)

The output says:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 5

I tried the code below which does the job but is doesn't scale easily as ColumnTransformer().

np.c_[count_transformer.fit_transform(df["corpus"]).toarray(), df["word_length"].values]. The result is the numpy array below:


array([[ 0,  1,  1,  0,  0,  1,  0,  0,  1,  0,  1, 27],
         0,  2,  0,  0,  0,  1,  0,  1,  1,  0,  1, 37],
         1,  0,  0,  0,  0,  1,  1,  0,  1,  1,  1, 26],
         0,  1,  1,  0,  0,  1,  0,  0,  1,  0,  1, 27],
         0,  1,  0,  1,  1,  0,  0,  0,  1,  0,  0, 26]], dtype=int64)

Does https://stackoverflow.com/questions/70550018/sklearn-custom-transformers-with-pipeline-all-the-input-array-dimensions-for-th/70550548#70550548 help? — amiola, Jan 04 '23 at 17:28
Thanks, @amiola, yes, I already saw that. It does not answer my question. — trazoM, Jan 04 '23 at 17:36
Thanks again @amiola, I just realised. Frustration did not let me understand the details. Now that I have, I appreciate your answer a lot more. Thanks. — trazoM, Jan 05 '23 at 10:19

CountVectorizer not working in ColumnTransformer

0 Answers0