MRJob and python - .csv file output for Reducer?

Question

I'm using the MRJob module for python 2.7. I have created a class that inherits from MRJob, and have correctly mapped everything using the inherited mapper function.

Problem is, I would like to have the reducer function output a .csv file...here is the code for the reducer:

def reducer(self, geo_key, info_list):
        info_list.insert(0, ['Name,Age,Gender,Height'])
        for set in info_list:
            yield set

Then i run in the command line---> python -m map_csv <inputfile.txt> outputfile.csv

I keep getting this error, and dont really understand why:

Counters from step 1:
  Unencodable output:
    TypeError: 785

The info_list parameter in the reducer is simply a list containing lists of various values that match the types in the header (i.e.

[
['Bill', 28, 'Male',75],
['Emily', 16, 'Female',56],
['Jason', 21, 'Male',63]]

Any idea what the problem is here? Thanks!

You shouldn't use `set` as a variable name, but that's not the problem. — ChrisP, Jun 24 '15 at 17:20

msharp · Accepted Answer · 2016-02-22T23:34:04.817

4

To manage input and output formats in mrjob, you need to use protocols.

Luckily, there is an existing package which implements a CSV protocol that you could use - https://pypi.python.org/pypi/mr3px

Import the package in your job script

from mr3px.csvprotocol import CsvProtocol

Specify the protocol in your job class

class CsvOutputJob(MRJob):
    ...
    OUTPUT_PROTOCOL = CsvProtocol  # write output as CSV

And then just yield your list (or tuple) of fields

def reducer(self, geo_key, info_list):
    for row in info_list:
        yield (None, row)

Note that you cannot reliably add a header row to this output because Hadoop will use several reducers to generate the output in parallel.

To use this package on EMR, you'll need to install it during the instance bootstrap phase by adding an item to the bootstrap section of your config.

runners:
  emr:
    ...
    bootstrap:
      - sudo apt-get install -y python-setuptools
      - sudo easy_install pip
      - sudo pip install mr3px

disclaimer - I am the maintainer of the mr3px package, which is forked from mr3po

edited Feb 22 '16 at 23:34

answered Jul 10 '15 at 03:16

msharp

3,000
2
19
6

Thank you for your response, I had the same problem and your solution worked perfectly fine. Just one thing if i run the file on my local machine it returns the result in the required format. However when i run the file on emr it returns this error. ImportError: No module named mr3px.csvprotocol. Do I need to make some changes to the config file. Thanks – Fahad Sarfraz Feb 21 '16 at 10:18
Yes, you'll need to install the package from pypi in your EMR bootstrap step. In the `bootstrap` section of your config, you'll need something like `- sudo pip install mr3px` – msharp Feb 22 '16 at 22:58
@FahadSarfraz Updated the answer to show how to configure your EMR jobs. – msharp Feb 22 '16 at 23:35
mr3px was very nice. Something you might need to look at is that for my case, csv cells will contain additional ""s at the end. For example, cell needs to have abc. Instead it will contain abc"" . – aghd Mar 03 '19 at 20:49
I was able to install it using : bootstrap: - sudo yum install -y python-setuptools - sudo easy_install pip - sudo pip install mr3px – aghd Mar 04 '19 at 06:45
Glad you found it useful @AminGhaderi - the csv parsing just uses the Python stdlib package under the hood. You may need to tell the protocol class that you expect a different quoting character. Try `OUTPUT_PROTOCOL = CsvProtocol(quotechar="'")` – msharp Mar 05 '19 at 01:40

MRJob and python - .csv file output for Reducer?

1 Answers1

Linked