I am new to MapReduce MRJob (and also to Python to be honest). I am trying to use MRJob to count the number of combinations of pairs of letters in different columns, from "A" to "E", that I have in a text file, i.e. "A", "A" = 10 occurences, "A", "B" = 13 occurences, "C", "E"= 6 occurences, etc. The error I get when I run it is a "list index out of range" and for the life of me, I can't figure out why.
Here is a sample of the text file used in conjunction with the python mapreduce file with the mapper and reducer functions (by the way, the string has a date, a time, the duration of a phone call, a customer ID of the person making a call that begins with a letter from "A" to "E" where the letter designates a country, another customer ID of the person receiving a call and key words in the conversation). I broke down the string into a list and in my mapper indicated the index I am interested in, but I am not sure if this approach is correct:
Details
2020-03-05 # 19:28 # 5:10 # A-466 # C-563 # tendremos lindo ahi fuimos derecho carajo junto acabar
2020-03-10 # 05:08 # 5:14 # C-954 # D-353 # carajo calle película acaso voz creía irá san montón ambos hablas empieza estaremos parecía mitad estén vuelto música anoche tendremos tenían dormir habitación encuentra ésa
2020-01-15 # 09:47 # 4:46 # C-413 # B-881 # pudiera dejes querido maestro hacerle llamada paz estados estuviera hablo decirle bonito linda blanco negro querida hacerte dormir empieza mayoría
2020-01-10 # 20:54 # 4:58 # E-027 # A-549 # estuviera tuviste vieja volvió solía alrededor decía maestro estaremos línea sigues
2020-03-17 # 21:38 # 5:21 # C-917 # D-138 # encima música barco tuvimos dejes damas boca
Here is the entire code of the python file:
from mrjob.job import MRJob
class MRduracion_llamadas(MRJob):
def mapper(self, _, line):
"""
First we need to convert the string from the text file into a list and eliminate the
unnecessary characters, such as "#", "-", ":", which I have substituted with a ";" to
facilitate the "split"part of this process.
"""
table = {35 : 59, 45 : 59, 58 : 59}
llamadas2020_text_line = [column.strip() for column in \
(line.translate(table)).split(";")]
#Now we can assign values to "Key" and "Values"
print(line)
pais_emisor = llamadas2020_text_line[7]
pais_receptor = llamadas2020_text_line[9]
minutos = ""
#If a call is "x" minutes and "y" secs long, where y > 0, then we can round up
#the minutes by 1 minute.
if int(llamadas2020_text_line[6]) > 0:
minutos = int(llamadas2020_text_line[5]) + 1
else:
minutos = int(llamadas2020_text_line[5])
yield (pais_emisor, pais_receptor), minutos
def reducer(self, key, values):
yield print(key, sum(values))
if __name__ == "__main__":
MRduracion_llamadas.run()