Output Folders for Amazon EMR

Question

I want to jun a custom jar, whose main class a chain of map reduce jobs, with the output of the first job going as the input of the second jar, and so on.

What do I set in FileOutputFormat.setOutputPath("what path should be here?");

If I specify -outputdir in the argument, I get the error FileAlraedy exists. If I don't specify, then I do not know where the ouput will land. I want to be able to see the ouput from every job of the chained mapreduce jobs.

Thanks in adv. Pls help!

score 0 · Answer 1 · answered Jun 04 '13 at 19:36

You are likely getting the "FileAlraedy exists" error because that output directory exists prior to the job you are running. Make sure to delete the directories that you specify as output for your Hadoop jobs; otherwise you will not be able to run those jobs.

score 0 · Answer 2 · answered Oct 18 '13 at 17:35

Good practice is to take output from command line as it will increase flexibility of your code And you will compile your jar only once provided the changes are related to your path. for EMR if you launch your cluster and compile your jar

For eg.

dfs_ip_folder=HDFS_IP_DIR
dfs_op_folder=HDFS_OP_DIR
hadoop jar hadoop-examples-*.jar wordcount ${dfs_ip_folder} ${dfs_op_folder}

Note : you have to create dfs_ip_folder and store input data inside it. dfs_op_folder will be created automatically on HDFS not on local file system To access the HDFS op folder either you can copy it to local file system or you can do cat. eg.

hadoop fs -cat ${dfs_op_folder}/<file_name>
hadoop fs -copyToLocal ${dfs_op_folder} ${your_local_input_dir_path}

Output Folders for Amazon EMR

2 Answers2