1

All,

I am working on creating an interface for dealing with some massive data and generating arff files for doing some machine learning stuff with. I can currently collect the features- but I have no way of associating them with the files they were derived from. I am currently using Dumbo

def mapper(key, value):
    #do stuff to generate features

Is there any convenient method for determining the filename that was opened and had its contents passed to the mapper function?

Thanks again. -Sam

John Vandenberg
  • 474
  • 6
  • 16
sampwing
  • 1,238
  • 1
  • 10
  • 13

2 Answers2

1

If you're able to access the job configuration properties, then the mapreduce.job.input.file property should contain the file name of the current file.

I'm not sure how you get at these properties in Dumbo/Mrjob though - the docs specify that periods (in the conf names) are replaced with underscores, and then looking through the source for PipeMapRed.java, looks like everything single job conf property is set as a env variable - so try and access an env variable named mapreduce_job_input_file

http://hadoop.apache.org/mapreduce/docs/r0.21.0/mapred_tutorial.html#Configured+Parameters

Chris White
  • 29,949
  • 4
  • 71
  • 93
1

As described here, you can use -addpath yes option.

-addpath yes (replace each input key by a tuple consisting of the path of the corresponding input file and the original key)

Mikhail Shevelev
  • 408
  • 5
  • 12