1

I am trying to partition an input file using AWS EMR. I use a streaming step to read from stdin.
I want to split this file into 2 files based on the values of specific fields from each line of stdin and store the resulting outputs into S3 to be used later. I cannot find any documentation on how to achieve this using python. Can you point me in the right direction? I'd greatly appreciate it.

Thank you

Zihs
  • 347
  • 2
  • 4
  • 17

1 Answers1

1

Not exactly sure what troubles you are having. Here is a good article - http://aws.amazon.com/articles/2294

Your specific question, you want to create a mapper which takes in your file as input and splits each line into a key, value pair (key determining which output file it will be in), and your reducer will just have to output these, a no-op.

Mapper

#!/usr/bin/python

def main():
    for line in sys.stdin:
        key = get_my_key(line)
        value = line
        print '{}\t{}'.format(key, value)

if __name__ == "__main__":
    main()

Reducer

#!/usr/bin/python

def main():
    for line in sys.stdin:
        print line

if __name__ == "__main__":
    main()

When you are adding this step you specify your input, output (some s3 bucket) and these files as the mapper and reducer.

Note, there are configurations to set no reducer, just a mapper task. I've included it all above because you seem to be a beginner

Shane
  • 2,315
  • 3
  • 21
  • 33
  • Thank you for the swift reply. If I understand correctly, the mapper will determine how the input file is to be split (this is where I write the conditions under which this is true). Where I am confused is how this will create the 2 files I want.. The reducer's job is to print out the different key-value pairs to an output file. Will there be one reducer per type of key-value pair and hence 2 output files? – Zihs Apr 30 '13 at 17:53
  • Hello again, I was able to run your code after some modification. The EMR job returns 3 files (part-00000, part-00001, part-00002). They all contain data which matches the key and value fields I specified in the code. However, I'd like to save all the records that match a specific field value into 1 file and all the other records into another file. This techniques alienates well the records I want but ignores the others. Any ideas? – Zihs Apr 30 '13 at 21:07
  • Your `get_key` function should return a key for the field values you want in 1 file, and then a separate for all other records – Shane May 01 '13 at 13:54
  • I wrote a simple case that sets the key as one field under some condition and another field otherwise. However, the output still returns 3 output files because of the 3 reducers allocated to run the job. I fail to understand how I can specify within EMR or my python code to create 2 disjunct files and populate them according to my specifications. – Zihs May 01 '13 at 15:24
  • You have too many reducers then. A reducer gets 1 key, so if you know that you only want 2 keys then you should set it to 2 reducer tasks – Shane May 01 '13 at 17:32
  • Ok that reduced the number of output files to 2 from 3. However my records are still mixed throughout these 2. I keep reading about using the java class MultipleOutputFormat in order to partition key-value pairs throughout multiple files. Almost all the documentation is in Java and involves overriding methods. I cannot seem to find anything in Python that will accomplish the same result – Zihs May 01 '13 at 19:20
  • Hey Zihs, Hopefully you have got solution for this, I am looking solution for similar problem. will be grateful if you can share sample code. Thanks – Pooja Mar 26 '16 at 11:33