Dimension mismatch when I try to apply tf-idf to test set

Question

I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier

What I have tried now is the following:

def test_tfidf(data, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
    tfidf_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = tfidf_vectorizer.transform(list_corpus)
    
    return X, list_labels

(I would suggest to refer to the link I mentioned above for all the code). When I try to apply the latter two function to my dataset:

train_x, train_y, count_vectorizer  = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, ngrams = 1)

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y), ignore_index = True)

I get this error:

---> 12 full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, ), ignore_index = True) 
---> 14     y_pred = clf.predict(X_test_naive)

ValueError: dimension mismatch

The function mentioned in the error is:

def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() 
    clf.fit(X_train_naive, y_train_naive)
    y_pred = clf.predict(X_test_naive)
        
    return

Any help in understanding what is wrong in my new definition and/or in applying the tf-idf to my dataset (please refer here for the relevant parts: Encoding text in ML classifier), it would be appreciated.

Update: I think this question/answer might be useful as well for helping me in figure out the issue: scikit-learn ValueError: dimension mismatch

if I replace test_x, test_y = test_tfidf(testing_set, ngrams = 1) with test_x, test_y = test_tfidf(undersample_train, ngrams = 1) it does not return any error. However, I do not think it is right, as I am getting values very very high (99% on all statistics)

too much code. please try to reduce the amount you specify to a small subset that can be easily copied and tested ([minimal reproproducable code](https://stackoverflow.com/help/minimal-reproducible-example)) — Marc, Dec 12 '20 at 23:30
I deleted some code. But you need to refer to the other answer to have some reproducible code. There is no way to keep small code and have a reproducible one in this case — LdM, Dec 12 '20 at 23:42

Qusai Alothman · Accepted Answer · 2020-12-13T12:40:47.007

When using transformes (TfidfVectorizer in this case), you must use the same object ot transform both train and test data. The transformer is typically fitted using the training data only, and then re-used to transform the test data.

The correct way to do this in your case:

def tfidf(data, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
    tfidf_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = tfidf_vectorizer.transform(list_corpus)
    
    return X, list_labels, tfidf_vectorizer


def test_tfidf(data, vectorizer, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)

    # No need to create a new TfidfVectorizer here!

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = vectorizer.transform(list_corpus)
    
    return X, list_labels

# this method is copied from the other SO question
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() # Gaussian Naive Bayes
    clf.fit(X_train_naive, y_train_naive)

    res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
    
    y_pred = clf.predict(X_test_naive)
    
    f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
    pres = precision_score(y_pred, y_test_naive, average = 'weighted')
    rec = recall_score(y_pred, y_test_naive, average = 'weighted')
    acc = accuracy_score(y_pred, y_test_naive)
    
    res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 
                     'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)

    return res 

train_x, train_y, count_vectorizer  = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, count_vectorizer, ngrams = 1)

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, count_vectorizer), ignore_index = True)

Also, as a side note: The code can be greatly improved, but that's outside the scope of this question. I tried to keep the original code intact as much as possible. — Qusai Alothman, Dec 13 '20 at 12:26
@LdM I took a look and fixed the `training_naive` issue. As for the `ngrams` parameter error, please make sure to run this version of the code first. You are probably still using the old code which doesn't expect the `vectorizer` parameter. — Qusai Alothman, Dec 13 '20 at 12:35

Dimension mismatch when I try to apply tf-idf to test set

1 Answers1

Linked