Systematic threshold for cosine similarity with TF-IDF weights

Question

I am running an analysis of several thousand (e.g., 10,000) text documents. I have computed TF-IDF weights and have a matrix with pairwise cosine similarities. I want to treat the documents as a graph to analyze various properties (e.g., the path length separating groups of documents) and to visualize the connections as a network.

The problem is that there are too many similarities. Most are too small to be meaningful. I see many people dealing with this problem by dropping all similarities below a particular threshold, e.g., similarities below 0.5.

However, 0.5 (or 0.6, or 0.7, etc.) is an arbitrary threshold, and I'm looking for techniques that are more objective or systematic to get rid of tiny similarities.

I'm open to many different strategies. For example, is there a different alternative to tf-idf that would make most of the small similarities 0? Other methods to keep only significant similarities?

Andrew Scott Evans · Answer 1 · 2015-07-10T15:29:34.033

In short, take the average cosine value of an initial clustering or even all of the initial sentences and accept or reject clusters based on something akin to the following.

One way to look at the problem is to try and develop a score based on a distance from the mean similarity (1.5 standard deviations (86th percentile if the data were normal) tends to mark an outlier with 3 (99.9th percentile) being an extreme outlier), taking the high end for good measure. I cannot remember where, but this idea has had traction in other forums and formed the basis for my similarity.

Keep in mind that the data is not likely to be normally distributed.

average(cosine_similarities)+alpha*standard_deviation(cosine_similarities)

In order to obtain alpha, you could use the Wu Palmer score or another score as described by NLTK. Strong similarities with Wu Palmer should lead to a larger range of acceptance while lower Wu Palmer scores should lead to a more strict acceptance. Therefore, taking 1-Wu Palmer score would be adviseable. You can even use this method for LSA or LDA groups. To be even more strict and take things close to 1.5 or more standard deviations, you could even try 1+Wu Palmer (the cream of the crop), re-find the ultimate K,find the new score, cluster, and repeat.

Beware though, this would mean finding the Wu Palmer of all relevant words and is quite a large computational problem. Also, 10000 documents is peanuts compared to most algorithms. The smallest I have seen for tweets was 15,000 and the 20 news groups set was 20,000 documents. I am pretty sure Alchemy API uses something akin to the 20 news groups set. They definitely use senti-wordnet.

The basic equation is not really mine so feel free to dig around for it.

Another thing to keep in mind is that the calculation is time intensive. It may be a good idea to use a student t value for estimating the expected value/mean wu-palmer score of SOV pairings and especially good if you try to take the entire sentence. Commons Math3 for java/scala includes the distribution as does scipy for python and R should already have something as well.

Xbar +/- tsub(alpha/2)*sample_std/sqrt(sample_size)

Note: There is another option with this weight. You could use an algorithm that adds or subtracts from this threshold until achieving the best result. This would likely not be related solely to the cosine importance but possibly to an inflection point or gap as with Tibshirani's gap statistic.

I would also recommend looking into silhouettes to determine if your centers actually are an issue. — Andrew Scott Evans, Aug 24 '16 at 23:27

Systematic threshold for cosine similarity with TF-IDF weights

1 Answers1

Linked