13

How does Google News and Techmeme cluster news items that are similar? Are there any well know algorithm that is used to achieve this?

Appreciate your help.

Thanks in advance.

niraj
  • 215
  • 3
  • 8

3 Answers3

9

One fairly common way to cluster text based on content is to use Principle Component Analysis on the word vectors (a vector of n dimensions where each possible word represents one dimension and the magnitude in each direction, for each vector, is the number occurrences of the word in that particular article), followed by just a simple clustering such as K-Means.

maxaposteriori
  • 7,267
  • 4
  • 28
  • 25
  • 9
    Thanks Andy. Appreciate your help. While researching this topic from your answer I found some useful links. I am posting it here as comment so that anyone interested in this topic can have a starting point. Hierarchical agglomerative clustering http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/kmeans.html Introduction to Information Retrieval http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html – niraj Apr 29 '09 at 15:24
  • @niraj: Thanks for the link to the tutorial which is very informative. – mins Jul 30 '14 at 14:13
5

The algorithmic basis is agglomerative clustering or something similar. But there are a number of heuristics on top of that. For example, the vector space is surely comprised of words and phrases (word n-grams). Limiting the search in a strict time period is also very important. And identifying names, and weighing more the title and the paragraph headings are also key parts.

On a tangentially related note. If you are interested in finding near-duplicate articles then there are a number of easier to implement approaches, such as the one described here

Costas Boulis
  • 201
  • 2
  • 3
1

There's a few different ways to do it. The standard is to do a "bag of words" analysis (weighted TF-IDF), and then do cosine similarity and k-means.

I've had success with this paper: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851

The great thing about it is: 1) It's incremental, which is great for news. With standard k-means, you need to have the entire data set. With news, you usually have articles arriving over time. Incremental algorithms solve that. 2) It's phrase-based. So it relies on phrases rather than just words.

Recently, there have been techniques that use semantic meaning instead of words (for instance, by extracting Wikipedia or DBPedia concepts from each article, and using that instead of just words).

Octodone
  • 515
  • 6
  • 13