I have a list of documents, and I want to find out how close they are, in terms of similarity, to some single document. I just figured out how to cluster tokenized documents, but I do not know how to check their distance from a target document.
The way I implemented the clustering was, I first took the list of documents...
text = [
"This is a test",
"This is something else",
"This is also a test"
]
I then tokenized them using the following function...
def word_tokenizer(sentences):
tokens = word_tokenize(sentences)
stemmer = PorterStemmer()
tokens = [stemmer.stem(t) for t in tokens if t not in stopwords.words('english')]
return tokens
I passed this function to TfidfVectorizer
...
tfidf_vect = TfidfVectorizer(
tokenizer=word_tokenizer,
max_df=0.9,
min_df=0.1,
lowercase=True
)
tfidf_matrix = tfidf_vect.fit_transform(text)
I then used Kmeans
to cluster the matrices...
kmeans = KMeans(n_clusters=3)
kmeans.fit(tfidf_matrix)
I then saved each cluster and printed out the results...
for i, label in enumerate(kmeans.labels_):
clusters[label].append(i)
res = dict(clusters)
for cluster in range(3):
print("cluster ", cluster, ":")
for i, sentence in enumerate(res[cluster]):
print("\tsentence ", i, ": ", text[sentence])
The results are as follows...
cluster 0 :
sentence 0 : This is also a test
cluster 1 :
sentence 0 : This is something else
cluster 2 :
sentence 0 : This is a test
This is useful information, but let's say I have a target document and I want to see how similar these documents are to the target, how do I do so?
For example, supposed I have the following target...
target = ["This is target"]
How can I check to see how similar each document in text
is to this target?