0

I know that all the values associated with a Key are sent to a single Reducer. Is it the case that a Reducer could get multiple keys at once via it's standard input?

My use case is that I am splitting lines into key-value pairs, then I want to send all lines associated with a key to an API. I'm seeing though that multiple keys get send into the API at once.

Here is some example code that my job is running

Mapper

def main():
    for line in sys.stdin
        part1 = get_part1(line)
        part2 = get_part2(line)
        key = '%s - %s' % (part1, part2)
        print '%s\t%s' % (key, line)

Reducer

def main():
    my_module.sent_to_api(sys.stdin)
Quetzalcoatl
  • 3,037
  • 2
  • 18
  • 27
Shane
  • 2,315
  • 3
  • 21
  • 33
  • I presume you're using multiple reducers which are able to run concurrently across a number of machines/cores, so I'd imagine it to be entirely possible for you to be sending multiple keys from the various reducers to the API. – Quetzalcoatl Apr 08 '13 at 14:34
  • Actually, what is happening is each reducer is meant to send the entire sys.stdin file to the API. When I then open this file via the API, it contains multiple keys. It wouldn't be possible for two reducers to have inserted into the one file, so I can only assume that one reducer has gotten multiple keys into it's sys.stdin – Shane Apr 08 '13 at 14:39
  • Bear in mind that while all values associated with a single key are sent to a single reducer, that reducer may be getting more than just that one key. Would that explain the situation you are seeing? (Note that if you use anything other than the default partitioner that may not necessarily be the case.) – Quetzalcoatl Apr 08 '13 at 14:45
  • Thanks Quetzalcoatl, this would explain the behavior. I assumed a new process would be started on each reducer per key. Thank you – Shane Apr 08 '13 at 15:14
  • Glad to know that explains it, can you mark my answer as correct so the question is wrapped up nicely? – Quetzalcoatl Apr 08 '13 at 15:16

1 Answers1

2

While all values associated with a single key are sent to a single reducer, that reducer may be getting more than just that one key, hence the appearance of multiple keys in each of the output files.

Quetzalcoatl
  • 3,037
  • 2
  • 18
  • 27
  • Is there a setting to have one key per reducer process? – Shane Apr 08 '13 at 15:20
  • There probably is, but it is probably better to perhaps rework your writing out to this API, as it doesn't seem particularly hadoop-ish to restrict your reducers in this way. I'm afraid I don't personally know off the top of my head but by all means go searching to see if anyone has asked such a question before, and if not, ask it yourself as a separate question. – Quetzalcoatl Apr 08 '13 at 15:25