Error in generating Behemoth corpus

Question

I am new to hadoop and behemoth and I followed the tutorial on https://github.com/DigitalPebble/behemoth/wiki/tutorial to generate a behemoth corpus for a text document, using the following command:

sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /home/madhumita/Documents/testFile -o /home/madhumita/behemoth/testGateOpCorpus

I am getting the error:

ERROR util.CorpusGenerator: Input does not exist : /home/madhumita/Documents/testFile

every time I run the command, though I have checked with gedit that the path is correct. I searched online for any similar issues, but I could not find any. Any ideas as to why it may be happening? If .txt file format is not acceptable, what is the required file format?

score 1 · Answer 1 · answered Mar 18 '13 at 14:54

Okay, I managed to solve the problem. The input path required was the path to the file on the hadoop distributed file system, not on the local machine.

So first I copied the local file to /data/test.txt on HDFS and gave this path as the input parameter. The commands are as follows:

    sudo bin/hadoop fs -copyFromLocal /home/madhumita/Documents/testFile/test.txt /docs/test.txt

    sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /docs/test.txt -o /docs/behemoth/test

This solves the issue. Thanks to everyone who tried to solve the problem.

score 0 · Answer 2 · answered Dec 04 '14 at 11:02

To generate Behemoth corpus directly from local filesystem, refer it using file protocol. (file:///)

hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "file:///home/madhumita/Documents/testFile/test.txt" -o "/docs/behemoth/test"

Error in generating Behemoth corpus

2 Answers2