We are trying to implement Storm Crawler to crawl data. We have been able to find sub-links from an url but we want to get contents from those sublinks. I have not been able find much resources which would guide me how to get it? Any useful links/websites in this regard would be helpful. Thanks.
Asked
Active
Viewed 1,635 times
1 Answers
4
Getting Started, presentations and talks, as well as the various blog posts should be useful.
If the sublinks are fetched and parsed - which you can check in the logs, then the content will be available for indexing or storing e.g as WARC. There is a dummy indexer which dumps the content to the console which can be taken as a starting point, alternatively there are resources for indexing the documents in Elasticsearch or SOLR. The WARC module can be used to store the content of pages as well.

Julien Nioche
- 4,772
- 1
- 22
- 28
-
Hi Julien, as directed i have added the provided snippet in the Warc module page to my CrawlTopology.java file but when i am running mvn clean package, I am getting the following error: cannot find symbol symbol: class FileNameFormat location: class crawler.CrawlTopology ALong with many other similar lines. Do I have to add some dependency to pom.xml – Ravi Ranjan Jan 09 '17 at 10:17
-
1Hi. You should add the WARC module to the dependencies
com.digitalpebble.stormcrawler storm-crawler-warc ${storm-crawler.version}