I'm writing a program that reads a general HTML "article" page (Wikipedia, NY Times, Yahoo News, ect). From that page I want to strip away all of the "noise" (ads, header bars.. anything that isn't part of the article content.) To think about it another way, I want to keep the most important stuff. (Main content, Title, author)
I'm trying to come up with a clever way to find the main content of an article I have a few ideas but they aren't exactly what I want. I do not want to parse every node in the DOM. My current idea involves using the size of the elements.
Any ideas are appreciated. At its core, this is a design question.
Thanks.