Is there a way to determine the filename passed to a map job in Hadoop/Dumbo/Mrjob?

Question

All,

I am working on creating an interface for dealing with some massive data and generating arff files for doing some machine learning stuff with. I can currently collect the features- but I have no way of associating them with the files they were derived from. I am currently using Dumbo

def mapper(key, value):
    #do stuff to generate features

Is there any convenient method for determining the filename that was opened and had its contents passed to the mapper function?

Thanks again. -Sam

score 1 · Answer 1 · answered Apr 17 '12 at 10:27

If you're able to access the job configuration properties, then the mapreduce.job.input.file property should contain the file name of the current file.

I'm not sure how you get at these properties in Dumbo/Mrjob though - the docs specify that periods (in the conf names) are replaced with underscores, and then looking through the source for PipeMapRed.java, looks like everything single job conf property is set as a env variable - so try and access an env variable named mapreduce_job_input_file

http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html#Configured+Parameters

score 1 · Answer 2 · answered Sep 17 '12 at 08:17

1

As described here, you can use -addpath yes option.

-addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)

answered Sep 17 '12 at 08:17

Mikhail Shevelev

408
5
12

Is there a way to determine the filename passed to a map job in Hadoop/Dumbo/Mrjob?

2 Answers2