0

I am trying to run my hadoop program in Amazon Elastic MapReduce system. My program takes an input file from the local filesystem which contains parameters needed for the program to run. However, since the file is normally read from the local filesystem with FileInputStream the task fails when executed in AWS environment with an error saying that the parameter file was not found. Note that, I already uploaded the file into Amazon S3. How can I fix this problem? Thanks. Below is the code that I use to read the paremeter file and consequently read the parameters in the file.

FileInputStream fstream = new FileInputStream(path);
            FileInputStream os = new FileInputStream(fstream);
            DataInputStream datain = new DataInputStream(os);
            BufferedReader br = new BufferedReader(new InputStreamReader(datain));

            String[] args = new String[7];

            int i = 0;
            String strLine;
            while ((strLine = br.readLine()) != null) {
                args[i++] = strLine;
            }
Ahmadov
  • 1,567
  • 5
  • 31
  • 48
  • Please don't use DataInputStream to read text http://vanillajava.blogspot.co.uk/2012/08/java-memes-which-refuse-to-die.html – Peter Lawrey Jan 30 '13 at 20:50

3 Answers3

1

If you must read the file from the local file system, you can configure your EMR job to run with a boostrap action. In that action, simply copy the file from S3 to a local file using s3cmd or similar.

You could also go through the Hadoop FileSystem class to read the file, as I'm pretty sure EMR supports direct access like this. For example:

FileSystem fs = FileSystem.get(new URI("s3://my.bucket.name/"), conf);
DataInputStream in = fs.open(new Path("/my/parameter/file"));
Joe K
  • 18,204
  • 2
  • 36
  • 58
0

I did not try Amazon Elastic yet, however it looks like a classical application of distributed cache. You add file do cache using -files option (if you implement Tool/ToolRunner) or job.addCacheFile(URI uri) method, and access it as if it existed locally.

Yevgen Yampolskiy
  • 7,022
  • 3
  • 26
  • 23
0

You can add this file to the distributed cache as follows :

...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...

Later, in configure() of your mapper/reducer, you can do the following:

...
Path s3FilePath;
@Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
...
}
Amar
  • 11,930
  • 5
  • 50
  • 73
  • Thanks for the answer. But I don't need to use DistributedCache. I just need to read the parameters off of a file and then start executing my MapReduce job. – Ahmadov Dec 25 '12 at 14:34