Crawling using Storm Crawler

Question

We are trying to implement Storm Crawler to crawl data. We have been able to find sub-links from an url but we want to get contents from those sublinks. I have not been able find much resources which would guide me how to get it? Any useful links/websites in this regard would be helpful. Thanks.

score 4 · Answer 1 · answered Dec 28 '16 at 13:54

4

Getting Started, presentations and talks, as well as the various blog posts should be useful.

If the sublinks are fetched and parsed - which you can check in the logs, then the content will be available for indexing or storing e.g as WARC. There is a dummy indexer which dumps the content to the console which can be taken as a starting point, alternatively there are resources for indexing the documents in Elasticsearch or SOLR. The WARC module can be used to store the content of pages as well.

answered Dec 28 '16 at 13:54

Julien Nioche

4,772
1
22
28

Hi Julien, as directed i have added the provided snippet in the Warc module page to my CrawlTopology.java file but when i am running mvn clean package, I am getting the following error: cannot find symbol symbol: class FileNameFormat location: class crawler.CrawlTopology ALong with many other similar lines. Do I have to add some dependency to pom.xml – Ravi Ranjan Jan 09 '17 at 10:17
1

Hi. You should add the WARC module to the dependencies com.digitalpebble.stormcrawler storm-crawler-warc ${storm-crawler.version} Maybe keep it simple for now and use the dummy indexer, it is already in the core module and does not require additional dependencies. Also, use the code generated by the archetype as a starting point, this will save you loads of trouble. – Julien Nioche Jan 09 '17 at 10:28

Crawling using Storm Crawler

1 Answers1