0

I am new to map-reduce and coding, I am trying to write a code in python that would calculate the average number of characters and "#" in a tweet

Sample data:

1469453965000;757570956625870854;RT @lasteven04: La jeune Rebecca #Kpossi, nageuse, 18 ans à peine devrait être la porte-drapeau du #Togo à #Rio2016 hyperlink;Twitter for Android 1469453965000;757570957502394369;Over 30 million women footballers in the world. Most of us would trade places with this lot for #Rio2016 ⚽️ hyperlink;Twitter for iPhone

fields/columns details:

 0: epoch_time  1: tweetId  2: tweet  3: device

Here is the code that I've written, I need help to calculate the average in the reducer function, any help/guidance would be really appreciated :- updated as per the answer provided by @oneCricketeer

import re
from mrjob.job import MRJob

class Lab3(MRJob):

def mapper(self,_,line):

    try:
        fields=line.split(";")
        if(len(fields)==4):
            tweet=fields[2]
            tweet_id=fields[0]
            yield(None,tweet_id,("{},{}".format(len(tweet),tweet.count('#')))
    except:
        pass

def reduce(self,tweet_id,tweet_info):
    total_tweet_length=0
    total_tweet_hash=0
    count=0
    for v in tweet_info:
        tweet_length,hashes = map(int,v.split())
        tweet_length_sum+= tweet_length
        total_tweet_hash+=hashes
        count+=1

    yield(total_tweet_length/(1.0*count),total_tweet_hash/(1.0*count))


if __name__=="__main__":
    Lab3.run()
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
horasaab
  • 11
  • 1
  • 3

1 Answers1

0

Your mapper needs to yield a key and a value, 2 elements, not 3, therefore outputting both average length and hashtag count should ideally be separate mapreduce jobs, but for this case you could combine them because you're processing the entire line, not separate words

# you could use the tweetId as the key, too, but would only help if tweets shared ids 
yield (None, "{} {}".format(len(tweet), tweet.count('#'))) 

Note: len(tweet) includes spaces and emojis, which you may want to exclude as "characters"

I'm not sure you can put _ in a function definition, so maybe change that too


Your reduce function is syntactically incorrect. You cannot put a string as a function parameter, nor use += on a variable that wasn't already defined. Then, an average calculation would require you to divide after you've totalled and counted (so, one returned result per reducer, not per value, in the loop}

def reduce(self,key,tweet_info):
    total_tweet_length = 0
    total_tweet_hash = 0
    count = 0
    for v in tweet_info:
        tweet_length, hashes = map(int, v.split())
        total_tweet_length += tweet_length
        total_tweet_hash += hashes
        count+=1
    yield(total_tweet_length / (1.0 * count), total_tweet_hash / (1.0 * count))  # forcing a floating point output 
OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Many thanks @OneCricketeer for your help, much appreciated :) I've updated the codebut still getting a syntax error on "Except:" any suggestions why I am getting an error. – horasaab Mar 26 '21 at 00:18
  • My answer doesn't use try-except statements, but I think you're missing a close parentheses on the map yield line, and you're still yielding 3 things, not 2. Plus, if you're using a comma in the mapper, you need to split on a comma in the reducer – OneCricketeer Mar 26 '21 at 22:33