MapReduce job that calculates chi-square values in Python

Question

I am writing a MapReduce job in Python using MrJob. The sample of my dataset which is in JSON:

{"reviewerID": "A2VNYWOPJ13AFP", "asin": "0981850006", "reviewerName": "Amazon Customer \"carringt0n\"", "helpful": [6, 7], "reviewText": "This was a gift for my other husband.  He's making us things from it all the time and we love the food.  Directions are simple, easy to read and interpret, and fun to make.  We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page.  Have at it.  You'll love the food and it has provided us with an insight into the culture that produced it. It's all about broadening horizons.  Yum!!", "overall": 5.0, "summary": "Delish", "unixReviewTime": 1259798400, "reviewTime": "12 3, 2009", "category": "Patio_Lawn_and_Garde"}
{"reviewerID": "A2E5XXXC07AGA7", "asin": "B00002N66D", "reviewerName": "James", "helpful": [1, 1], "reviewText": "This is a very nice spreader.  It feels very solid and the pneumatic tires give it great maneuverability and handling over bumps.  The control arm is solid metal, not a cable, which gives you precise control and will last a long time.  The settings take some experimentation with your various products to get it right, but that is true of any spreader.  It has good distribution... probably flings material a little farther on the right side than the left, but it is far more even than my crappy Edgeguard ever was.", "overall": 5.0, "summary": "Nice spreader", "unixReviewTime": 1354492800, "reviewTime": "12 3, 2012", "category": "Patio_Lawn_and_Garde"}

I have to preprocess the data and calculate the Chi-Square value for each unigram in each category, order them by their value and then merge the categories.

So far I only did the preprocessing:

from mrjob.job import MRJob
from mrjob.step import MRStep
import re

WORD_RE = re.compile(r"[\w']+")

with open('stopwords.txt') as f:
    global stopwords
    stopwords = f.read().splitlines()

with open('delimiters.txt') as f:
    global delimiters
    delimiters = f.read().splitlines()

class MRChiSquareTest(MRJob):

    def mapper(self, _, line):
        for word in WORD_RE.findall(line):
            yield (word.lower(), 1)
    
    def reducer(self, word, counts):
        if len(word) != 0 and word not in stopwords and any(x in word for x in delimiters) == False:
            yield (word, sum(counts))
       
if __name__ == '__main__':
    MRChiSquareTest.run()

After running my code I get the following output(shortened):

"reviews"       1
"reviewtext"    22
"reviewtime"    22
"ripe"  1
"roldan"        1
"roller"        1
"room"  4
"rust"  1
"sand"  2

It includes also the categories as values and there is no separation between the categories and their values.

How would I iterate through the dataset to get the necessary values for a Chi-Square value calculation for each unigram?
How can I read the categories and attach them to the values? How can I achieve separation? Should I yield in one of the steps the category and an array that contains the word with the sum of the count?
How can I merge categories after the calculation?

Is there a particular reason you are not using json library to parse the line rather than apply regex over the entire JSON object string? For example, `reviewtime` and `reviewtext` probably should not be in your output... Where is your method for calculating any Chi-Square value? All you're doing is running a word-count, where words are already "merged", and you are discarding the category completely — OneCricketeer, Apr 20 '22 at 21:50
No particular reason, it is how I saw it in the documentation of MrJob. And I didn‘t implement the method yet as I don‘t understand how to iterate over the dataset. And yes I know it also includes categories in the input this was in the question. — Ario, Apr 21 '22 at 10:35
I'm saying your output ignores the category, so your question is unclear. The dataset is read line by line, and returned as your line variable in the mapper function. So, you can do `json.loads(line).get("category")`, for example. However, I'd recommend using Spark rather than mapreduce, anyway https://spark.apache.org/docs/latest/sql-data-sources-json.html — OneCricketeer, Apr 21 '22 at 13:09

MapReduce job that calculates chi-square values in Python

0 Answers0