1

Combining CountVectorizer() with ColumnTransformer() gives me an error. Here is a reproduced case:

from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Create a sample data frame
df = pd.DataFrame({
    'corpus': ['This is the first document.', 'This document is the second document.', 'And this is the third one.',
               'Is this the first document?', 'I have the fourth document'],
    'word_length': [27, 37, 26, 27, 26]
})

text_feature = ["corpus"]
count_transformer = CountVectorizer()

# Create the ColumnTransformer
ct = ColumnTransformer(transformers=[
    ("count", count_transformer, text_feature)],
    remainder='passthrough')

ct.fit_transform(df)

The output says:

ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 0, the array at index 0 has size 1 and the array at index 1 has size 5

I tried the code below which does the job but is doesn't scale easily as ColumnTransformer().

np.c_[count_transformer.fit_transform(df["corpus"]).toarray(), df["word_length"].values]. The result is the numpy array below:


array([[ 0,  1,  1,  0,  0,  1,  0,  0,  1,  0,  1, 27],
         0,  2,  0,  0,  0,  1,  0,  1,  1,  0,  1, 37],
         1,  0,  0,  0,  0,  1,  1,  0,  1,  1,  1, 26],
         0,  1,  1,  0,  0,  1,  0,  0,  1,  0,  1, 27],
         0,  1,  0,  1,  1,  0,  0,  0,  1,  0,  0, 26]], dtype=int64)
trazoM
  • 50
  • 1
  • 8

0 Answers0