Hadoop streaming KeyFieldBasedPartitioner

Question

I am extracting data from freebase dump (title, aliases, type names) into avro (not yet in this job). I am using mapreduce streaming with python.

This job reducer expects type title (which is generally any object title) and type id reference to object. Form of record is: id%relation\tvalue

For example:

common.topic%title  Topic
common.topic%used_by    m.01dyhm
common.topic%used_by    m.03x5qm
common.topic%used_by    m.04pm6

Reducer emits:

m.01dyhm%type   Topic
m.03x5qm%type   Topic
m.04pm6%type    Topic

Title preceeds references (so reducer remembers it and emits dereferenced records), and all records related with one type must be partitioned to one reducer. This assured by key sorting. As I am using composite key, I need to correctly partition records. I am using KeyFieldBasedPartitioner with configuration "-k1,1" and I set key field separator to "%". It should partition data on object identifier, e.g. "common.topic" or "m.01dyhm". But I think my configuration is wrong. It works with single reducer (Hortonworks VM), but emits blank files on 32 node cluster (which I do not have direct access, so I can't effectively experiment). I guess partitioning is wrong and there are no data to join on single reducer.

This is my hadoop command:

hadoop \
jar $streaming \
-D mapred.job.name='Freebase extract - phase 3' \
-D mapreduce.map.output.key.field.separator='%' \
-D mapreduce.partition.keypartitioner.options=-k1,1 \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-input freebase/job1output \
-input freebase/job2output \
-output freebase/job3output \
-mapper "python job3mapper.py" \
-reducer "python job3reducer.py" \
-file job3mapper.py \
-file job3reducer.py

Is my partitioner configuration right? Thanks for any help.

score 1 · Answer 1 · answered Dec 03 '14 at 15:50

1

This looks good to me. You are splitting the key into two subkeys and using the first part for partitioning.

You might want to add the following option to tell the partitioner that you want to sort by the compound key so that your reducer input is sorted.

-D stream.num.map.output.key.fields=2

If you are getting empty lines in your output that indicates that you are writing extra linefeeds. The lines are fed in through sys.stdin with a trailing \n. You can try using print line, or print line.strip() in your mappers and reducers to see if that is the case.

If you are getting no output at all the problem might be in the python code.

answered Dec 03 '14 at 15:50

Nonnib

468
3
11

Thanks, I will try this option. I get no output at all. Python code is right, it works with single reducer when data are properly sorted. – Ondrej Galbavý Dec 03 '14 at 21:25
If you post your mapper/reducer code I can try running it on my cluster on the provided example data and see what I get. – Nonnib Dec 03 '14 at 22:03
My teacher is not responding, so I don't know if this option helps. Could you please try it on your cluster? Unfortunately, I don't have subset of dataset that contains enough relevant data. On 32 node cluster it should run under a hour. [Freebase Triples dataset](https://developers.google.com/freebase/data). [Code](https://github.com/OndroNR/freebase/tree/odovzdanie/python/src/mr_extractor_avro). Let me know if you need to clarify anything. Thanks a lot. – Ondrej Galbavý Dec 27 '14 at 22:45

Hadoop streaming KeyFieldBasedPartitioner

1 Answers1