Fetch Common crawl data using Apache Nutch

Asked Jan 17 '17 at 07:44

Active Jan 17 '17 at 07:44

Viewed 180 times

I find my data on common crawl website and i downloads that data from there

and now i have to fetch that data using Apache Nutch but don't know how.

This file is in warc file format.

asked Jan 17 '17 at 07:44

Sahil Rohila

Do you care to elaborate e little more what are you trying to accomplish? Are you trying to import the downloaded warc data into Apache Nutch? or crawl some sites using Apache Nutch and then storing the crawled info as a warc file? – Jorge Luis Jan 17 '17 at 20:46
I am trying to import the downloaded warc data into Apache Nutch – Sahil Rohila Jan 18 '17 at 09:43
1

At the moment there is no such feature implemented, it was mentioned in https://issues.apache.org/jira/browse/NUTCH-2102 about the desire of add this later on. But if my memory is not playing tricks is not implemented just yet. – Jorge Luis Jan 18 '17 at 21:02
can you please tell me how i can fetch a common crawl data which is a warc file (containing the space of 850mb of gzip size ) using apache nutch .Is it possible ? – Sahil Rohila Jan 19 '17 at 06:39
1

This is not officially supported by Nutch at the moment, some initial work has been here (https://github.com/DigitalPebble/nutch/find/warcjn). Perhaps you can obtain a patch of the WARCImporter class (and dependencies) and apply the patch into the Nutch source code with this feature included. – Jorge Luis Jan 21 '17 at 21:29

0 Answers0