Libraries/Tools for Website Parsing

Question

I would like to start working with parsing large numbers of raw HTML pages into semantic data structures.

Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language.

So far, planning on using Hadoop to manage a lot of the processing, but curious about alternatives.

What do you mean by 'parse HTML into semantic data structures?' — bmargulies, Sep 12 '10 at 00:25
Write programs that read a particular HTML page and pick out particular elements for storage in some native data structures. — Kevin, Sep 12 '10 at 19:04

score 0 · Answer 1 · answered Jan 16 '12 at 07:32

First you need to download your page source and then create a DOM tree. if you are coding in C# you can user the following tools to create your DOM tree.

1) http://htmlagilitypack.codeplex.com/
2) http://www.majestic12.co.uk/projects/html_parser.php

the first one is easy to use but second one is much faster and memory friendly and I suggest you to use the second one if you want to create a robust application

then you can extract usefull content from web page using:

http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html

and many other articles you can find to extract content from web page by Googling (extract main content from web page)

Hope it helps

Libraries/Tools for Website Parsing

1 Answers1