Scrapy Vs Nutch

Question

I am planning to use webcrawling in an application i am currently working on. I did some research on Nutch and run some preliminary test using it. But then i came across scrapy. But when i did some preliminary research and went through the documentation about scrapy i found that it can capture only structed data (You have to give the div name from which you want to capture data). The backend of the application i am developing is based on Python and i understand scrapy is Python based and some have suggested that scrapy is better than Nutch.

My requirement is to capture the data from more than a 1000 different webpages and run search for relevant keywords in that information.Is there any way scrapy can satisfy the same requirement.

1)If yes can you point out some example on how it can be done ?

2)Or Nutch+Solr is best suited for my requirement

div name is not a requirement for Scrapy, you can capture anything you want. — Steven Almeroth, Jun 19 '13 at 19:42

score 18 · Accepted Answer · answered Jun 19 '13 at 19:55

Scrapy would work perfectly in your case.

You are not required to give divs names - you can get anything you want:

Scrapy comes with its own mechanism for extracting data. They’re called XPath selectors (or just “selectors”, for short) because they “select” certain parts of the HTML document specified by XPath expressions.

Plus, you can use BeautifulSoup and lxml for extracting the data from the page content.

Besides, scrapy is based on twisted and is completely async and fast.

There are plenty of examples scrapy spiders here on SO - just look through the scrapy tag questions. If you have a more specific question - just ask.

Hope that helps.

Thanks alecxe..I will try scrapy then. I guess i have to do a better research on the same. — Vidhu, Jun 19 '13 at 21:40
There was an article just recently on using [Scrapy with Solr](http://searchhub.org/2013/06/13/indexing-web-sites-in-solr-with-python/). — Alexandre Rafalovitch, Jun 19 '13 at 23:37

Scrapy Vs Nutch

1 Answers1