How can I calculate correlation between subjects?

Question

How can I calculate correlation between classes of the texts? E.g., I have 3 texts:

texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
                 "Chennai super kings returns"]

subjects = ["final", "Crowned",
                 "returns"]

So, each text has a label (class). So, it is close to the text classification problem. But I need to calculate the measure of "difference".

I can count Tfidf and get the matrix:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

texts = ["Chennai Super Kings won the final 2018 IPL", "Chennai Super Kings Crowned IPL 2018 Champions",
         "Chennai super kings returns"]
tfidf = TfidfVectorizer()
features = tfidf.fit_transform(texts)
res = pd.DataFrame(features.todense(), columns=tfidf.get_feature_names())

        2018    champions   chennai crowned final   ipl kings   returns super   the won
    "final"     0.333407445657484   0.0 0.2589206239570202  0.0 0.4383907244416506  0.333407445657484   0.2589206239570202  0.0 0.2589206239570202  0.4383907244416506  0.4383907244416506
    "Crowned"   0.37095371207541605 0.4877595527309446  0.28807864923451976 0.4877595527309446  0.0 0.37095371207541605 0.28807864923451976 0.0 0.28807864923451976 0.0 0.0
    "returns"   0.0 0.0 0.4128585720620119  0.0 0.0 0.0 0.4128585720620119  0.6990303272568005  0.4128585720620119  0.0 0.0

I need to get a score which will tell me: - how much the subject "final" is close to "Crowned".

What metric should I use?

//////////////////////////////////////////////////////////////// Suppose you have 5 texts:

After school, Kamal took the girls to the old house. It was very old and very dirty too. There was rubbish everywhere. The windows were broken and the walls were damp. It was scary. (1) Amy didn’t like it. There were paintings of zombies and skeletons on the walls. “We’re going to take photos for the school art competition,” said Kamal. Amy didn’t like it but she didn’t say anything. (2) “Where’s Grant?” asked Tara. “Er, he’s buying more paint.” Kamal looked away quickly. Tara thought he looked suspicious. “It’s getting dark, can we go now?” said Amy. She didn’t like zombies. (3) Then, they heard a loud noise coming from a cupboard in the corner of the room. “What’s that?” Amy was frightened. “I didn’t hear anything,” said Kamal. Something was making strange noises. (4) “What do you mean? There’s nothing there!” Kamal was trying not to smile. Suddenly the door opened with a bang and a zombie appeared, shouting and moving its arms. Amy screamed and covered her eyes. (5)

Each text has labels:

1st text - school, house, scary 2nd text - zombies, paint 3rd text - zombies, dark, paint 4th text - noise, frightened 5th text - zombie, screamed

the 1st task is to find the correlation between text. Seems @MarkH has already given me the right direction (cosine similarity) the 2nd task is to find the correlation between labels. You see that almost all labels are "zombie". Also, the 3rd sentence and the 2th sentence have 2 equal labeles: "zombies, paint". Suppose we have 10000 texts. So what chance these lables describes the same thing and we can delete one of label (paint) and use onle 1 (zombie)? So, it's like a contribution to the variation. Does it affect too much if we remove some lables? Can we remove/unit some labels?

score 1 · Answer 1 · answered Sep 20 '19 at 14:48

1

I think you can use cosine similarity which is quite common for this kind of task.

from sklearn.metrics.pairwise import cosine_similarity
msgs_CosSim = pd.DataFrame(cosine_similarity(features, features))

answered Sep 20 '19 at 14:48

MarkH

122
9

score 0 · Answer 2 · answered Sep 20 '19 at 15:06

0

the concept of correlation finds the closeness between the features but you are saying you want to do it for the class labels that don't make sense bcoz if the features are same the then they must have the same class label. Please share the ultimate problem u r trying to solve.

answered Sep 20 '19 at 15:06

vBrail

181
10

I have added a new explanation in the question. Could you please comment? – user565447 Sep 21 '19 at 15:52
Hi from your explanation it seems u r dealing with a multilabel instead of multiclass classification problem. And u r at preprocessing step and instead of millions of label you want to reduce the no of labels so that you can train your model with less headache. If I'm right then reply I'll suggest some ways & ref to tackle the problem. Thankyou. – vBrail Sep 22 '19 at 04:40
You are right, but it't both multilabel and multiclass problem. Mutilabel - when we have more than 1 label for each class. Multiclass - when we have more than 1 class (not binary classification). But the rest of your message is correct. Could you please advice to tackle the problem? – user565447 Sep 22 '19 at 11:19
Hi I m assuming that you are handling labels in 10ks or more, then one of soln can convert your text feature in w2v 300 feature form and convert your labels as a binary value. now consider your binary labels as the features and on each 300 w2v as o/p class and train regression model seperately then you will get the 300 weight vectors with no. of the component as no. of lables now create new vector by using technique like mean median anything and finally compare each component value with each other and if the values are colose enough then labels corresponding to them will be more corelated. – vBrail Sep 22 '19 at 13:35

How can I calculate correlation between subjects?

2 Answers2