I use a regular expression in order to manipulate accented vowels and «ñ» in spanish texts in the following way:
WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")
Although it works fine with any string, when I execute the map reduce program, it doesn't manipulate properly spanish words with accents like «acción», and the word appears cut in the resulting file. There is a line like
acci: 6
instead of:
acción: 6
Here is the python code. Any suggestions? Thank you.
# -*- coding: utf-8 -*-
from mrjob.job import MRJob
import re
WORD_REGEXP = re.compile(r"[a-zA-Záéíóúñ]+")
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
words = WORD_REGEXP.findall(line)
for word in words:
yield word.lower(), 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()