Reading in a parameter file in Amazon Elastic MapReduce and S3

Question

I am trying to run my hadoop program in Amazon Elastic MapReduce system. My program takes an input file from the local filesystem which contains parameters needed for the program to run. However, since the file is normally read from the local filesystem with FileInputStream the task fails when executed in AWS environment with an error saying that the parameter file was not found. Note that, I already uploaded the file into Amazon S3. How can I fix this problem? Thanks. Below is the code that I use to read the paremeter file and consequently read the parameters in the file.

FileInputStream fstream = new FileInputStream(path);
            FileInputStream os = new FileInputStream(fstream);
            DataInputStream datain = new DataInputStream(os);
            BufferedReader br = new BufferedReader(new InputStreamReader(datain));

            String[] args = new String[7];

            int i = 0;
            String strLine;
            while ((strLine = br.readLine()) != null) {
                args[i++] = strLine;
            }

Please don't use DataInputStream to read text http://vanillajava.blogspot.co.uk/2012/08/java-memes-which-refuse-to-die.html — Peter Lawrey, Jan 30 '13 at 20:50

score 1 · Answer 1 · answered Dec 14 '12 at 20:29

If you must read the file from the local file system, you can configure your EMR job to run with a boostrap action. In that action, simply copy the file from S3 to a local file using s3cmd or similar.

You could also go through the Hadoop FileSystem class to read the file, as I'm pretty sure EMR supports direct access like this. For example:

FileSystem fs = FileSystem.get(new URI("s3://my.bucket.name/"), conf);
DataInputStream in = fs.open(new Path("/my/parameter/file"));

score 0 · Answer 2 · answered Dec 16 '12 at 01:14

I did not try Amazon Elastic yet, however it looks like a classical application of distributed cache. You add file do cache using -files option (if you implement Tool/ToolRunner) or job.addCacheFile(URI uri) method, and access it as if it existed locally.

Amar · Answer 3 · 2012-12-18T14:52:39.647

0

You can add this file to the distributed cache as follows :

...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...

Later, in configure() of your mapper/reducer, you can do the following:

...
Path s3FilePath;
@Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
...
}

edited Dec 18 '12 at 14:52

answered Dec 16 '12 at 20:51

Amar

11,930
5
50
73

Thanks for the answer. But I don't need to use DistributedCache. I just need to read the parameters off of a file and then start executing my MapReduce job. – Ahmadov Dec 25 '12 at 14:34

Reading in a parameter file in Amazon Elastic MapReduce and S3

3 Answers3