1

I use a regular expression in order to manipulate accented vowels and «ñ» in spanish texts in the following way:

WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")

Although it works fine with any string, when I execute the map reduce program, it doesn't manipulate properly spanish words with accents like «acción», and the word appears cut in the resulting file. There is a line like

acci: 6

instead of:

acción: 6

Here is the python code. Any suggestions? Thank you.

# -*- coding: utf-8 -*-
from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        words = WORD_REGEXP.findall(line)
        for word in words:
            yield word.lower(), 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()
John Vandenberg
  • 474
  • 6
  • 16

1 Answers1

0

It seems like an encoding problem.

The documentation suggests the use of BytesValueProtocol to force encoding.

class MREncodingEnforcer(MRJob):

    INPUT_PROTOCOL = BytesValueProtocol

    def mapper(self, _, value):
        value = value.decode('utf_8')
        ...
Arthur Gouveia
  • 734
  • 4
  • 12