I have a problem with the evaluation of clusters result.
I have 3 lists:
# 10 objects in my corpus
TOT = [1,2,3,4,5,6,7,8,9,10]
# .... clustering into k=5 clusters
# For each automatic cluster:
# Objects with ID 2 and 8 are stored into this
predicted = [2,8]
# For each cluster in the ground truth:
true = [2,4,9]
# computes TP, FP, TN, FN
A = set(docs_in_cluster)
B = set(constraints)
TP = list(A & B)
FP = list(A - (A & B))
TN = list((TOT - A) & (TOT - B))
FN = list(B - A)
My question is: Can I compute TP, FP, TN, FN for each cluster? Does it make sense?
EDIT: Reproducible code
Short story:
I'm doing NLP, I have a corpus of 9k document that I have processed with Gensim's Word2Vec, extracted the vectors, and computed a "document vector" for each document. After that, I have computed the cosine similarity between document vectors obtaining a 9k x 9k matrix.
Finally, using this matrix I have run KMeans and Hierarchical Clustering.
Let's consider the outputs from HAC with 14 clusters:
id label
0 1
1 8
....
9k 12
Now the problem is: How can I evaluate the quality of my clusters?
My professor have read 100 of these 9k documents and has created some 'clusters' saying: "ok this document talks about: label1
" or "ok this other talks about both label2
and label3
.
Notice that labels provided by my professor are completely unrelated to the clustering process, and are just a summary of the topic, but the number is the same (in this example =14).
The code
I have two dataframes, the one above from HAC clustering and the one of 100 documents from my professors, that looks like: (with the example made before)
GT
id label1 label2 label3 .... label14
5 1 0 0 0
34 0 1 1 0
...........................
And finally, my code does this:
# since I have labels only for 100 of my 9k documents
indexes = list(map(int, ground_truth['id'].values.tolist()))
reduced_df = clusters.loc[clusters['id'].isin(indexes), :]
# now reduced_df contains only the documents that have been read by my prof
TOT = set(reduced_df['id'].values.tolist())
for each cluster from HAC
doc_in_this_cluster = [ .... ]
for each cluster from GT
doc_in_this_label = [ ... ]
A = set(doc_in_this_cluster )
B = set(doc_in_this_label )
TP = list(A & B)
FP = list(A - (A & B))
TN = list((TOT - A) & (TOT - B))
FN = list(B - A)
And the code:
indexes = list(map(int, self.ground_truth['id'].values.tolist()))
# reduce clusters_file matching only manually analyzed documents: --------> TOT
reduced_df = self.clusters.loc[self.clusters['id'].isin(indexes), :]
TOT = set(reduced_df['id'].values.tolist())
clusters_groups = reduced_df.groupby('label')
for label, df_group in clusters_groups:
docs_in_cluster = df_group['id'].values.tolist()
row = []
for col in self.ground_truth.columns[1:]:
constraints = list(
map(int, self.ground_truth.loc[self.ground_truth[col] == 1, 'id'].values.tolist())
)
A = set(docs_in_cluster)
B = set(constraints)
TP = list(A & B)
FP = list(A - (A & B))
TN = list((TOT - A) & (TOT - B))
FN = list(B - A)
print(f"HAC Cluster: {label} - GT Label: {col}")
print(TP, FP, TN, FN)