Identical dimensions for train and test matrices in text analysis?

Question

Suppose I have a dataframe with different rows of text, and I want to cluster those rows to find out underlying themes in the data:

import pandas as pd
df = pd.DataFrame({"id_num": np.random.randint(low = 0, high = 50, size = 10), "text": ["hello these are words i would like to cluster", "hello i would like to go home", "home i would like to go please thank you", "thank you please apple banana", "orange banana apple fruit corn", "orange orange orange banana banana banana banana", "can you take me home i have had enough of this place", "i am bored can we go home", "i would like to leave now to go home", "apple apple banana"])

I will first separate this dataframe into train and test:

>>> from sklearn.cross_validation import train_test_split
>>> train, test = train_test_split(df, test_size = 0.40)
>>> train, test = train["text"], test["text"]

Then start the clustering process:

>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.cluster import KMeans

>>> vectorizer = TfidfVectorizer()

>>> train_X = vectorizer.fit_transform(train)
>>> test_X = vectorizer.fit_transform(test)

>>> model = KMeans(n_clusters = 2)

>>> model.fit(train_X)

>>> model.predict(test_X)

ValueError: Incorrect number of features. Got 22 features, expected 18.

Of course, if you run this code on your own machine, you might get different results. Perhapsthe number of features might even be aligned. But in most cases, the dimensions of train_X and test_X will not match up.

Has anyone else dealt with this issue? I suppose one approach to make the dimensions equal would be to employ some sort of dimensionality reduction by taking only the features (read: words) that are present in both train and test. The other solution, which would make larger matrices, would be to fill in zeros in both matrices where a given document doesn't have the word from the other corpus.

Is there another way I should be approaching this?

score 2 · Accepted Answer · edited May 23 '17 at 12:00

2

After some digging, I found several StackOverflow answers to this same question: Python vectorization for classification and Scikit learn - fit_transform on the test set.

In short, I needed to change

train_X = vectorizer.fit_transform(train)
test_X = vectorizer.fit_transform(test)

to

train_X = vectorizer.fit_transform(train)
test_X = vectorizer.transform(test)

Using transform instead of fit_transform preserves the vocabulary created from fit_transform in the previous line, and ensures identical columns for these matrices.

edited May 23 '17 at 12:00

Community

1
1

answered Aug 26 '16 at 18:32

blacksite

12,086
10
64
109

since it is a duplicate, instead of giving an answer, just mark a question as duplicate. – lejlot Aug 26 '16 at 21:47

Identical dimensions for train and test matrices in text analysis?

1 Answers1