Regular expressions in python map reduce: Counting words with «ñ» and accented vowels

Question

I use a regular expression in order to manipulate accented vowels and «ñ» in spanish texts in the following way:

WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")

Although it works fine with any string, when I execute the map reduce program, it doesn't manipulate properly spanish words with accents like «acción», and the word appears cut in the resulting file. There is a line like

acci: 6

instead of:

acción: 6

Here is the python code. Any suggestions? Thank you.

# -*- coding: utf-8 -*-
from mrjob.job import MRJob
import re

WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")

class MRWordFrequencyCount(MRJob):

    def mapper(self, _, line):
        words = WORD_REGEXP.findall(line)
        for word in words:
            yield word.lower(), 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    MRWordFrequencyCount.run()

Hmm...`WORD_REGEXP.findall(line)` gives me `['acci', 'instead', 'of', 'acción']`. Isn't that correct? What's the expect output? — Remi Guan, Dec 06 '15 at 09:10
The expected output would be with the full key: «acción» instead of «acci» — Alvaro Fierro Clavero, Dec 06 '15 at 18:46

score 0 · Answer 1 · answered Sep 28 '17 at 12:38

It seems like an encoding problem.

The documentation suggests the use of BytesValueProtocol to force encoding.

class MREncodingEnforcer(MRJob):

    INPUT_PROTOCOL = BytesValueProtocol

    def mapper(self, _, value):
        value = value.decode('utf_8')
        ...

Regular expressions in python map reduce: Counting words with «ñ» and accented vowels

1 Answers1