0

I am new to hadoop and behemoth and I followed the tutorial on https://github.com/DigitalPebble/behemoth/wiki/tutorial to generate a behemoth corpus for a text document, using the following command:

sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /home/madhumita/Documents/testFile -o /home/madhumita/behemoth/testGateOpCorpus

I am getting the error:

ERROR util.CorpusGenerator: Input does not exist : /home/madhumita/Documents/testFile

every time I run the command, though I have checked with gedit that the path is correct. I searched online for any similar issues, but I could not find any. Any ideas as to why it may be happening? If .txt file format is not acceptable, what is the required file format?

madzie
  • 47
  • 1
  • 9

2 Answers2

1

Okay, I managed to solve the problem. The input path required was the path to the file on the hadoop distributed file system, not on the local machine.

So first I copied the local file to /data/test.txt on HDFS and gave this path as the input parameter. The commands are as follows:

    sudo bin/hadoop fs -copyFromLocal /home/madhumita/Documents/testFile/test.txt /docs/test.txt

    sudo bin/hadoop jar /home/madhumita/behemoth/core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i /docs/test.txt -o /docs/behemoth/test

This solves the issue. Thanks to everyone who tried to solve the problem.

madzie
  • 47
  • 1
  • 9
0

To generate Behemoth corpus directly from local filesystem, refer it using file protocol. (file:///)

hadoop jar core/target/behemoth-core-*-job.jar com.digitalpebble.behemoth.util.CorpusGenerator -i "file:///home/madhumita/Documents/testFile/test.txt" -o "/docs/behemoth/test"
Ramanan
  • 1,000
  • 1
  • 7
  • 20