Suppose I have a dataframe with different rows of text, and I want to cluster those rows to find out underlying themes in the data:
import pandas as pd
df = pd.DataFrame({"id_num": np.random.randint(low = 0, high = 50, size = 10), "text": ["hello these are words i would like to cluster", "hello i would like to go home", "home i would like to go please thank you", "thank you please apple banana", "orange banana apple fruit corn", "orange orange orange banana banana banana banana", "can you take me home i have had enough of this place", "i am bored can we go home", "i would like to leave now to go home", "apple apple banana"])
I will first separate this dataframe into train
and test
:
>>> from sklearn.cross_validation import train_test_split
>>> train, test = train_test_split(df, test_size = 0.40)
>>> train, test = train["text"], test["text"]
Then start the clustering process:
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.cluster import KMeans
>>> vectorizer = TfidfVectorizer()
>>> train_X = vectorizer.fit_transform(train)
>>> test_X = vectorizer.fit_transform(test)
>>> model = KMeans(n_clusters = 2)
>>> model.fit(train_X)
>>> model.predict(test_X)
ValueError: Incorrect number of features. Got 22 features, expected 18.
Of course, if you run this code on your own machine, you might get different results. Perhapsthe number of features might even be aligned. But in most cases, the dimensions of train_X
and test_X
will not match up.
Has anyone else dealt with this issue? I suppose one approach to make the dimensions equal would be to employ some sort of dimensionality reduction by taking only the features (read: words) that are present in both train
and test
. The other solution, which would make larger matrices, would be to fill in zeros in both matrices where a given document doesn't have the word from the other corpus.
Is there another way I should be approaching this?