How can I calculate correlation between classes of the texts? E.g., I have 3 texts:
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
subjects = ["final", "Crowned",
"returns"]
So, each text has a label (class). So, it is close to the text classification problem. But I need to calculate the measure of "difference".
I can count Tfidf and get the matrix:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
"Chennai super kings returns"]
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(texts)
res = pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())
2018 champions chennai crowned final ipl kings returns super the won
"final" 0.333407445657484 0.0 0.2589206239570202 0.0 0.4383907244416506 0.333407445657484 0.2589206239570202 0.0 0.2589206239570202 0.4383907244416506 0.4383907244416506
"Crowned" 0.37095371207541605 0.4877595527309446 0.28807864923451976 0.4877595527309446 0.0 0.37095371207541605 0.28807864923451976 0.0 0.28807864923451976 0.0 0.0
"returns" 0.0 0.0 0.4128585720620119 0.0 0.0 0.0 0.4128585720620119 0.6990303272568005 0.4128585720620119 0.0 0.0
I need to get a score which will tell me: - how much the subject "final" is close to "Crowned".
What metric should I use?
//////////////////////////////////////////////////////////////// Suppose you have 5 texts:
After school, Kamal took the girls to the old house. It was very old and very dirty too. There was rubbish everywhere. The windows were broken and the walls were damp. It was scary. (1) Amy didn’t like it. There were paintings of zombies and skeletons on the walls. “We’re going to take photos for the school art competition,” said Kamal. Amy didn’t like it but she didn’t say anything. (2) “Where’s Grant?” asked Tara. “Er, he’s buying more paint.” Kamal looked away quickly. Tara thought he looked suspicious. “It’s getting dark, can we go now?” said Amy. She didn’t like zombies. (3) Then, they heard a loud noise coming from a cupboard in the corner of the room. “What’s that?” Amy was frightened. “I didn’t hear anything,” said Kamal. Something was making strange noises. (4) “What do you mean? There’s nothing there!” Kamal was trying not to smile. Suddenly the door opened with a bang and a zombie appeared, shouting and moving its arms. Amy screamed and covered her eyes. (5)
Each text has labels:
1st text - school, house, scary 2nd text - zombies, paint 3rd text - zombies, dark, paint 4th text - noise, frightened 5th text - zombie, screamed
the 1st task is to find the correlation between text. Seems @MarkH has already given me the right direction (cosine similarity) the 2nd task is to find the correlation between labels. You see that almost all labels are "zombie". Also, the 3rd sentence and the 2th sentence have 2 equal labeles: "zombies, paint". Suppose we have 10000 texts. So what chance these lables describes the same thing and we can delete one of label (paint) and use onle 1 (zombie)? So, it's like a contribution to the variation. Does it affect too much if we remove some lables? Can we remove/unit some labels?