3

I'm trying to implement a very basic wordcount example with MRJob. Everything works fine with ascii input, but when I mix cyrillic words into the input, I get something like this as an output

"\u043c\u0438\u0440"    1
"again!"    1
"hello" 2
"world" 1

As far as I understand, the first row above is the encoded single occurrence of cyrillic word "мир", which is a correct result with respect to my sample input text. Here is MR code

class MRWordCount(MRJob):

    def mapper(self, key, line):
       line = line.decode('cp1251').strip()
       words = line.split()
       for term in words:
          yield term, 1

    def reducer(self, term, howmany):
        yield term, sum(howmany)

if __name__ == '__main__':
        MRWordCount.run()

I'm using Python 2.7 and mrjob 0.4.2 on windows. My questions are:

a) how do I manage to correctly produce readable cyrillic output on cyrillic input? b) what is the root cause of this behavior -- is it due to python/MR version or expected to work differently on non-windows -- any clues?

I'm reproducing the output of python -c "print u'мир'"

Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\encodings\cp866.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to <undefined>
Anton
  • 66
  • 6
  • What is the output for the command on your machine: `python -c "print 'мир', u'мир'"`? – jfs Feb 22 '14 at 15:23
  • UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-2: character maps to – Anton Feb 22 '14 at 15:38
  • try to write the Unicode literal using unicode escapes: `python -c "print 'мир', u'\u043c\u0438\u0440'"`. If it also produces the error then set `PYTHONIOENCODING` envvar to the character encoding that your console uses e.g., `cp1251` or `cp866`. Probably, to avoid issues with printing non-ascii characters to Windows console from Python, `mrjob` calls `.encode('unicode-escape')` or it might use JSON text as input/output (that also uses similar escapes). – jfs Feb 22 '14 at 15:54
  • this python print command works as expected. I also tried `OUTPUT_PROTOCOL = RawValueProtocol` and this explicitly prepended my output with u'' (i.e. u'\u043c\u0438\u0440'). Also the MR script output is not sent to console (i.e. `--no-output`), it's sent to file. – Anton Feb 22 '14 at 18:55
  • it seems everything works as it should. Look at [`mrjob`'s input/output protocols](http://pythonhosted.org/mrjob/protocols.html). The output from your question might be produced by [`JSONProtocol`](http://pythonhosted.org/mrjob/guides/writing-mrjobs.html#writing-protocols). You could pass `ensure_ascii=False` to `json.dumps()` to avoid escaping non-ascii characters. – jfs Feb 22 '14 at 21:24
  • Yes, the output in my question was produced by JSONProtocol, though RawValueProtocol works much in the same way. If I can change this with `ensure_ascii=False` as you wrote, it would be great. Since I'm new to Python, I would appreciate a more detailed instruction on how to accomplish that. – Anton Feb 23 '14 at 11:26
  • follow [the link from my previous comment](http://pythonhosted.org/mrjob/guides/writing-mrjobs.html#writing-protocols), It contains a simplified example implementation of `JSONProtocol`. Just add `ensure_ascii=False` to `json.dumps()` calls and set `OUTPUT_PROTOCOL` to the class you've created. If it is not clear; update your question or ask a new one. – jfs Feb 23 '14 at 11:49

2 Answers2

2

To print this more readably in Python 2.x, you need to explicitly tell the interpreter that it is a unicode string:

>>> print(u"\u043c\u0438\u0440") # note leading u
мир

To convert your strings into unicode strings, use unicode:

>>> print(unicode("\u043c\u0438\u0440", "unicode_escape"))
мир
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
0

To print to your console, you need to encode the characters to an encoding your terminal understands. Most of the time that'll be UTF-8: print u"\u043c\u0438\u0440".encode("utf-8"), but on Windows you might need to use another one (cp1251, maybe?).

Max Noel
  • 8,810
  • 1
  • 27
  • 35
  • Thanks, but neither of this produces readable Cyrillic output, whether I print to console or file (with `--no-output` option) – Anton Feb 22 '14 at 19:21