Why does my Apache Nutch warc and commoncrawldump fail after crawl?

Question

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment folder.

I am using nutch v-1.17 and running:

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/ despite having just ran a crawl there.

I have found test segment data which I am able to run commoncrawldump on successfully however I am still unsure on the difference between the two folders of segment data. — cc100, Sep 15 '20 at 10:28

score 0 · Accepted Answer · answered Sep 15 '20 at 12:59

Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.

Why does my Apache Nutch warc and commoncrawldump fail after crawl?

1 Answers1