Dump data from a Nutch crawl into multiple warc files

Question

I have crawled a list of websites using Nutch 1.12. I can dump the crawl data into separate HTML files by using:

./bin/nutch dump -segment crawl/segments/ -o outputDir nameOfDir

And into a single WARC file by using:

./bin/nutch warc crawl/warcs crawl/segment/nameOfSegment

But how can I dump the collected data into multiple WARC files, one for each webpage crawled?

score 1 · Accepted Answer · edited Nov 23 '16 at 18:39

1

After quite a few attempts, I managed to find out that

./bin/nutch commoncrawldump -outputDir nameOfOutputDir -segment crawl/segments/segmentDir -warc

does exactly what I needed: a full dump of the segment into individual WARC files!

edited Nov 23 '16 at 18:39

nbro

15,395
32
113
196

answered Oct 26 '16 at 12:53

Chronus

301
3
17

score 0 · Answer 2 · answered Oct 24 '16 at 15:00

0

Sounds a bit wasteful to have one WARC per doc but here you go : you could specify a low value for 'warc.output.segment.size' so that the files get rotated every time a new document is written. WarcExporter uses [https://github.com/ept/warc-hadoop] under the bonnet, the config is used there.

answered Oct 24 '16 at 15:00

Julien Nioche

4,772
1
22
28

Is the same possible without the use of Hadoop? – Chronus Oct 26 '16 at 10:21

Dump data from a Nutch crawl into multiple warc files

2 Answers2