2

I am trying to apply Logistic Regression Models with text.

I Vectorized my data by TFIDF:

vectorizer = TfidfVectorizer(max_features=1500)
x = vectorizer.fit_transform(df['text_column'])

vectorizer_df = pd.DataFrame(x.toarray(), columns=vectorizer.get_feature_names())
df.drop('text_column', axis=1, inplace=True)  
result = pd.concat([df, vectorizer_df], axis=1)

I split my data:

x = result.drop('target', 1)
y = result['target']

and finally:

x_raw_train, x_raw_test, y_train, y_test = train_test_split(x, y, test_size=0.3,  random_state=0)

I build a classifier:

classifier = Pipeline([('clf', LogisticRegression(solver="liblinear"))])
classifier.fit(x_raw_train, y_train)

And I get this error:

ValueError: y should be a 1d array, got an array of shape (74216, 2) instead.

This is a strange thing because when I assign max_features=1000 it is working well, but when max_features=1500 I got an error.

Someone can help me please?

NivB
  • 33
  • 4

1 Answers1

1

Basically, the text_column column in df contains at least one occurrence of the word target. This word becomes a column name when you convert the TF-IDF feature matrix to a dataframe with the parameter columns=vectorizer.get_feature_names(). Lastly, when you concatenate df with vectorized_df, you add both the target columns into the final dataframe.

Therefore, result['target'] will return two columns instead of one as there are effectively two target columns in the result dataframe. This will naturally lead to a ValueError, because, as specified in the error description, you need a 1d target array to fit your estimator, whereas your target array has two columns.

The reason why you are encountering this for a high max_features threshold is simply because the word target isn't making the cut with the lower threshold allowing the process to run as it should.

Unless you have a reason to vectorize separately, the best solution for this is to combine all your steps in a pipeline. It's as simple as:

pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(max_features=1500)),
                ('clf', LogisticRegression(solver="liblinear")),
                 ])
    
pipeline.fit(x_train.text_column, y_train.target)
A.T.B
  • 625
  • 6
  • 16