I am new to map-reduce and coding, I am trying to write a code in python that would calculate the average number of characters and "#" in a tweet
Sample data:
1469453965000;757570956625870854;RT @lasteven04: La jeune Rebecca #Kpossi, nageuse, 18 ans à peine devrait être la porte-drapeau du #Togo à #Rio2016 hyperlink;Twitter for Android 1469453965000;757570957502394369;Over 30 million women footballers in the world. Most of us would trade places with this lot for #Rio2016 ⚽️ hyperlink;Twitter for iPhone
fields/columns details:
0: epoch_time 1: tweetId 2: tweet 3: device
Here is the code that I've written, I need help to calculate the average in the reducer function, any help/guidance would be really appreciated :- updated as per the answer provided by @oneCricketeer
import re
from mrjob.job import MRJob
class Lab3(MRJob):
def mapper(self,_,line):
try:
fields=line.split(";")
if(len(fields)==4):
tweet=fields[2]
tweet_id=fields[0]
yield(None,tweet_id,("{},{}".format(len(tweet),tweet.count('#')))
except:
pass
def reduce(self,tweet_id,tweet_info):
total_tweet_length=0
total_tweet_hash=0
count=0
for v in tweet_info:
tweet_length,hashes = map(int,v.split())
tweet_length_sum+= tweet_length
total_tweet_hash+=hashes
count+=1
yield(total_tweet_length/(1.0*count),total_tweet_hash/(1.0*count))
if __name__=="__main__":
Lab3.run()