1

I would like to start working with parsing large numbers of raw HTML pages into semantic data structures.

Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language.

So far, planning on using Hadoop to manage a lot of the processing, but curious about alternatives.

Kevin
  • 31
  • 2

1 Answers1

0

First you need to download your page source and then create a DOM tree. if you are coding in C# you can user the following tools to create your DOM tree.

1) http://htmlagilitypack.codeplex.com/
2) http://www.majestic12.co.uk/projects/html_parser.php

the first one is easy to use but second one is much faster and memory friendly and I suggest you to use the second one if you want to create a robust application

then you can extract usefull content from web page using:

http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html

and many other articles you can find to extract content from web page by Googling (extract main content from web page)

Hope it helps

Ehsan
  • 1,662
  • 6
  • 28
  • 49