0

I have a use-case with Hadoop where I would like my output files to be split by key. At the moment I have the reducer simply outputting each value in the iterator. For example, here's some python streaming code:

for line in sys.stdin:
    data = line.split("\t")
    print data[1]

This method works for a small dataset (around 4GB). Each output file of the job only contains the values for one key.

However, if I increase the size of the dataset (over 40GB) then each file contains a mixture of keys, in sorted order.

Is there an easier way to solve this? I know that the output will be in sorted order and I could simply do a sequential scan and add to files. But it seems that this shouldn't be necessary since Hadoop sorts and splits the keys for you.

Question may not be the clearest, so I'll clarify if anyone has any comments. Thanks

Shane
  • 2,315
  • 3
  • 21
  • 33
  • I am open to it. Could you please tell me some of the solutions? – Shane Feb 20 '13 at 10:12
  • I am right in thinking you want to create a unique output file for each key? What happens if you have 100,000's of unique keys (and hence 100,000's of output files)? – Chris White Feb 20 '13 at 10:13
  • That would be a problem, but won't occur. I've control over the input dataset and I'll know roughly the amount of keys beforehand – Shane Feb 20 '13 at 10:28

1 Answers1

1

Ok then create a custom jar implementation of your MapReduce solution and go for MultipleTextOutputFormat to be the OutputFormat used as explained here. You just have to emit the filename (in your case the key) as the key in your reducer and the entire payload as the value, and your data will be written in the file named as your key.

Amar
  • 11,930
  • 5
  • 50
  • 73