2

I'm writing a program that reads a general HTML "article" page (Wikipedia, NY Times, Yahoo News, ect). From that page I want to strip away all of the "noise" (ads, header bars.. anything that isn't part of the article content.) To think about it another way, I want to keep the most important stuff. (Main content, Title, author)

I'm trying to come up with a clever way to find the main content of an article I have a few ideas but they aren't exactly what I want. I do not want to parse every node in the DOM. My current idea involves using the size of the elements.

Any ideas are appreciated. At its core, this is a design question.

Thanks.

  • What is your platform? Browser? node.js? And, what does the content look like (we need to see the HTML)? Scraping content from HTML does not have a generic solution. All solutions are dependent upon understanding how the particular target site organizes its content. – jfriend00 May 29 '15 at 03:27
  • 1
    You might be interested in reading more about the `
    ` and `
    ` HTML5 tags. More info: http://diveintohtml5.info/semantics.html
    – johnnyRose May 29 '15 at 03:27
  • Also, why don't you use an HTML parser and then analyze the parsed document? You're making the job hundreds of times harder if you're not going to parse the HTML. – jfriend00 May 29 '15 at 03:28
  • Thanks @johnnyRose Your comment was the most helpful for HTML5 pages. Generic HTML4 scraping is pretty tough, but I'm still working on it. – Damon Williams Jul 10 '15 at 17:54
  • Glad to hear! Hope you got it all figured out. – johnnyRose Jul 10 '15 at 18:37

1 Answers1

1

I think setting up a parser by yourself is probably too complicated. Often there is poor markup with no semantic elements and other stuff.

What you could do is to use the Parser API from Readability. If you are using NodeJS you can do a http.get request, if you are using Javascript in the Browser you can make an ajax request to the API.

  • 2
    "Give Up" is not a valid option. I'm doing this, its only a matter of how. – Damon Williams Jun 30 '15 at 21:30
  • Of course you can do it. The same way I can build my own house, my own car or whatever. Just because I can do it in theory doesn't mean I have to do it in practice. That's why the Readability API might be of great use because you can use their expertise and instead concentrate on your actual product. –  Jul 01 '15 at 02:49