2

Does it make sense to calculate pearson correlation coefficients based on a tf-idf matrix to see which terms occur in combination with other terms? Is it mathematically correct?

My output is a correlation matrix with correlation coefficients in each cell for each term.

  • -------term1 term2 term3
  • term2
  • term2
  • term2
hippietrail
  • 15,848
  • 18
  • 99
  • 158
user1341610
  • 21
  • 1
  • 2

1 Answers1

3

It depends on your definition of 'occurs in combination with other terms'. To clarify this some more points:

idf is irrelevant when doing a Pearson mean correlation. All tf values for the same term will be multiplied by the same idf value yielding the final tf-idf. The PMC is invariant with respect to scaling of the input, so the idf is canceled out here. Hence all that matters in your proposed idea is the tf. You might save some calculations if you do not even calculate the idf, but it wont hurt much if you do.

Now about the usage of the tf. Let's make an example to figure out what you might need:

Lets say TermA appears in Document1 very often and a little in Document2. TermB on the other hand appears in Document1 a little and very often in Document2. Would you say that these two terms appear together or not? They occur in the same document, but at different frequency. If you use PMC of tf-idf then the result will be, that they do not co-occur (because of the differences in frequency).

At this point you should also note that the PMC goes from -1 to 1 in values. I.e. you could have words which co-occur (PMC=1) which are independent (PMC=0) and such words which are opposite (PMC=-1). Does this fit the domain you are modelling? If not, just add 1 to the PMC.

Another alternative would be to use cosine-similarity, which is very similar to PMC but has some different characteristics. Also in some other cases you might only be interested in actual co-occuring and do not care about the frequency.

All these methods are 'correct' so to say. The more important question is, which of these methods fits best to the problem you are modelling. In many cases this can not be determined theoretically, but only by trying out different alternatives and test which one fits best to your problem domain.

EDIT (some remarks about the comments below):

Cosine similarity does actually help, but you have to think differently in that case. You can of course produce term-frequency vectors for the terms in the document and then calculate the cosine similarity for these document term-frequency vectors. You pointed out correctly, that this would give you the similarity of posts to each other. But this is not what I meant. If you have your complete term-frequency matrix, you can also produce vectors, which describe for a single term how often this term appeared in each document. You can also calculate the cosine similarity of these vectors. This would give you the similarity of terms based on document co-occurence.

Think about it this way (but first we will need some notation):

let f_{i,j} denote the number of times the term i appeared in document j (note that I am ignoring idf here, since it will just cancel out, when handling terms instead of documents). Also let F=(f_{i,j})_{i=1...N,j=1...M} denote the whole document-term matrix (Terms go in columns and documents in rows). Then finally we will call |F|_c the matrix F where each colum is normalized according to the l^2 norm and |F|_r the matrix F where each row is normalized according to the l^2 norm. And of course as usual A^T denotes the transpose of A. In that case you have the normal cosine distance between all documents based on terms as

(|F|_r)*(|F|_r)^T

This would give you a MxM matrix which describes the similarity of the documents.

If you want to calculate term similarity instead, you would simply calculate

(|F|_c)^T*(|F|_c)

which gives you an NxN matrix describing the term similarity based on co-occurences in documents.

Note that the calculation of the PMC would basicly be the same and just differ in the type of normalisation which is applied to rows and columns in each of the matrix multiplications.

Now to your other post, you say that you would like to find out how likely it is that if termA appears in a document, that termB also appears in the same document. Or formaly speaking p(termB | termA) where p(termX) denotes the probability of termX appearing in a document. That is a different beast altogether, but again very simple to calculate:

1. Count the number of documents in which `termA` appears (call it num_termA)
2. Count the number of documents in which both `termA` and `termB` appear (call it num_termA_termB)

then p(termB | termA)=num_termA_termB/num_termA

This is an actuall statistical measure of the likelihood of co-occurenct. Howeve be aware, most likely the relationship p(termB | termA ) == p(termA | termB) will not hold, so this measure of co-occurence is not usable at all for clustering via MDS and this is most likely (no pun intendet).

My suggestion is to try both PMC and cosine-similarity (as you can see above they only differ in normalisation so they should be fast to implement both) and then check which one looks better after clustering.

There are some more advanced techniques for clustering topics based on a set of documents. A Principal component analysis (PCA) or Non-negative matrix factorisation of the term document matrix is also frequently used (see latent semantic analysis or LSA for more info). However this might be overkill for your use case and these techniques are much harder to do. PMC and cosine-similarity have the absolute benefit of being dead simple to implement (cosine-similarity being a bit simpler, because the normalisation is easier) and thus are hard to get wrong.

LiKao
  • 10,408
  • 6
  • 53
  • 91
  • I'am trying to create 2 dimensional map of the hottest topics and their relations (do they occur together) about a particular brand. I have about 2500 social media/network posts (from facebook, twitter, boards etc.). I tokenized, filtern (stopwords) and stemmed the posts. Then i calculated tf-idf values for all documents(posts)/terms. I used this values to calculate a correlation matrix. The correlation matrix was used to do a multidimensinal scaling. The output is a "map" of all topics (terms). Topics which occur together are closer than topics that do not occur together. – user1341610 Apr 18 '12 at 17:05
  • To my mind, the cosine-similarity would not fit to my approach because it calculates the similarity of the whole post/document. What I need is to measure whether the terms are related to each other. Like if someone wrote in his post XY it is likely that he wrote XZ too. – user1341610 Apr 18 '12 at 17:38
  • @user1341610: see my edits, I hope I could clear this up a little. – LiKao Apr 18 '12 at 22:10