How to tell if two web contents are similar?

Question

Given 2 html sources, I want to first extract the main content out of it using something like this. Are there any other better libraries - I am specifically looking for Python/Javascript ones?

Once I have the two extracted contents, I want to return a score between 0 and 1 denoting how similar they are e.g. news articles on the same topic from CNN and BBC would have higher similarity scores since they are on the same topic or webpages pertaining to the same product on Amazon.com and Walmart.com would have a high score too. How can I do this? Are there existing libraries that do this already? What are some good libraries I can use? Basically I am looking for a combination of automatic summarization, keyword extraction, named-entity recognition and sentiment-analysis.

score 5 · Accepted Answer · answered Apr 05 '12 at 20:36

There are many things embedded in your question. I will try to provide you with a library or else will suggest you Algorithms that can solve your tasks (which you can Google and you will get many python implementations)

Point 1. To extract main content out of html (http://nltk.googlecode.com/svn/trunk/doc/book/ch03.html) & for other NLP related stuff you can check out NLTK. Its written in Python. You can also check out for a library called BeautifulSoup, its awesome (http://www.crummy.com/software/BeautifulSoup/)

Point 2. When you say:

Once I have the two extracted contents, I want to return a score between 0 and 1 denoting how similar they are....

For this I suggest you can cluster your document set using any unsupervised learning clustering technique. Since your problem falls under the distance-metric based clustering so it should be really easy for you to cluster similar documents and then assign a score to them based on their similarity with the cluster centroid. Try either K-Means or Adaptive Resonance Theory. In the latter you dont need to define the number of clusters in advance. OR as larsman points out in his comments you can simply use TF-IDF (http://www.miislita.com/term-vector/term-vector-3.html)

Point 3.When you say:

Basically I am looking for a combination of automatic summarization, keyword extraction, named-entity recognition and sentiment-analysis

For Automatic Summarization use Non Negative Matrix Factorization

For Keyword extraction use NLTK

For Named-Entity Recognition use NLTK

For Sentiment Analysis use NLTK

How to tell if two web contents are similar?

1 Answers1