how to cluster tokenized documents

Question

I have a list of documents, and I want to find out how close they are, in terms of similarity, to some single document. I just figured out how to cluster tokenized documents, but I do not know how to check their distance from a target document.

The way I implemented the clustering was, I first took the list of documents...

text = [
    "This is a test",
    "This is something else",
    "This is also a test"
]

I then tokenized them using the following function...

def word_tokenizer(sentences):
    tokens = word_tokenize(sentences)
    stemmer = PorterStemmer()
    tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
    return tokens

I passed this function to TfidfVectorizer...

tfidf_vect = TfidfVectorizer(
        tokenizer=word_tokenizer,
        max_df=0.9,
        min_df=0.1,
        lowercase=True
    )

tfidf_matrix = tfidf_vect.fit_transform(text)

I then used Kmeans to cluster the matrices...

kmeans = KMeans(n_clusters=3)
kmeans.fit(tfidf_matrix)

I then saved each cluster and printed out the results...

for i, label in enumerate(kmeans.labels_):
    clusters[label].append(i)
res = dict(clusters)

for cluster in range(3):
    print("cluster ", cluster, ":")
    for i, sentence in enumerate(res[cluster]):
        print("\tsentence ", i, ": ", text[sentence])

The results are as follows...

cluster  0 :
    sentence  0 :  This is also a test
cluster  1 :
    sentence  0 :  This is something else
cluster  2 :
    sentence  0 :  This is a test

This is useful information, but let's say I have a target document and I want to see how similar these documents are to the target, how do I do so?

For example, supposed I have the following target...

target = ["This is target"]

How can I check to see how similar each document in text is to this target?

score 2 · Answer 1 · answered Apr 11 '18 at 03:24

For your question the clustering isn't really of use. Clusters can give you a general idea of which groups data belongs to but you can't use it to compare two individual datapoints.

At this point you'd have to implement a loss function. I'd suggest using something simple like euclidean distance or mean squared error.

Vectorize your target document, and iterate through your tfidf_matrix. For each value in the matrix, calculate its loss with your target document. From here you can find which document it is most similar to/different from.

score 1 · Answer 2 · answered Apr 12 '18 at 21:10

1

You want similarity search, not clustering.

Wrong tool for the problem, you don't need to buy an entire supermarket just to get a beer.

In fact you are now back at the same problem you had in the first place... You put everything document into a cluster, and now need to find the nearest cluster. Just find the nearest document right away... Or back to the supermarket metaphor: you bought the entire supermarket, but now you still need to go there to actually get the beer.

answered Apr 12 '18 at 21:10

Has QUIT--Anony-Mousse

76,138
12
138
194

For this particular challenge I'm supposed to implement some sort of unsupervised approach to check for similarity. I have already calculated `cosine similarity` between documents if that's what you mean by similarity search. – buydadip Apr 12 '18 at 21:31
Which is unsupervised, isn't it? – Has QUIT--Anony-Mousse Apr 12 '18 at 23:28
I wasn't sure, I thought it was just a simple mathematical computation between two `tfidf` matrices. But I guess it is unsupervised when you think about it. – buydadip Apr 12 '18 at 23:39

score 0 · Accepted Answer · answered Apr 11 '18 at 04:36

0

You can simply use KMeans.predict()

Predict the closest cluster each sample in X belongs to.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

This will return the index of the cluster to which the new sentence belongs.

Apply the same preprocessing on the target sentence and call predict(). Make sure to use the same tfidfvectorizer to transform the sentence.

Something like:

target_tfidf_matrix = tfidf_vect.transform(target)
results = kmeans.predict(target_tfidf_matrix)

answered Apr 11 '18 at 04:36

Vivek Kumar

35,217
8
109
132

I tried implementing this. Your answer seems promising, however I am getting the following error `raise ValueError("After pruning, no terms remain. Try a lower" ValueError: After pruning, no terms remain. Try a lower min_df or a higher max_df.`. Any idea why? – buydadip Apr 11 '18 at 19:04
@Bolboa. That error seems related to TfidfVectorizer. Make sure your `word_tokenizer` is returning some data for the supplied test data. Please add the `word_tokenize` function in the question so that I can duplicate the behaviour. – Vivek Kumar Apr 12 '18 at 06:12
The error was simple, I accidentally applied `fit_transform` to the target instead of `transform`. Thanks for the help though. – buydadip Apr 12 '18 at 06:21

how to cluster tokenized documents

3 Answers3