Given 2 html sources, I want to first extract the main content out of it using something like this. Are there any other better libraries - I am specifically looking for Python/Javascript ones?
Once I have the two extracted contents, I want to return a score between 0 and 1 denoting how similar they are e.g. news articles on the same topic from CNN and BBC would have higher similarity scores since they are on the same topic or webpages pertaining to the same product on Amazon.com and Walmart.com would have a high score too. How can I do this? Are there existing libraries that do this already? What are some good libraries I can use? Basically I am looking for a combination of automatic summarization, keyword extraction, named-entity recognition and sentiment-analysis.