1

I have successfully crawled a website using Nutch and now I want to create a warc from the results. However, running both the warc and commoncrawldump commands fail. Also, running bin/nutch dump -segement .... works successfully on the same segment folder.

I am using nutch v-1.17 and running:

bin/nutch commoncrawldump -outputDir output/ -segment crawl/segments

The error from hadoop.log is ERROR tools.CommonCrawlDataDumper - No segment directories found in my/path/ despite having just ran a crawl there.

cc100
  • 31
  • 6
  • I have found test segment data which I am able to run commoncrawldump on successfully however I am still unsure on the difference between the two folders of segment data. – cc100 Sep 15 '20 at 10:28

1 Answers1

0

Inside the segments folder were segments from a previous crawl that were throwing up the error. They did not contain all the segment data as I believe the crawl was cancelled/finished early. This caused the entire process to fail. Deleting all those files and starting anew fixed the issue.

cc100
  • 31
  • 6