I am writing a MapReduce job in Python using MrJob. The sample of my dataset which is in JSON:
{"reviewerID": "A2VNYWOPJ13AFP", "asin": "0981850006", "reviewerName": "Amazon Customer \"carringt0n\"", "helpful": [6, 7], "reviewText": "This was a gift for my other husband. He's making us things from it all the time and we love the food. Directions are simple, easy to read and interpret, and fun to make. We all love different kinds of cuisine and Raichlen provides recipes from everywhere along the barbecue trail as he calls it. Get it and just open a page. Have at it. You'll love the food and it has provided us with an insight into the culture that produced it. It's all about broadening horizons. Yum!!", "overall": 5.0, "summary": "Delish", "unixReviewTime": 1259798400, "reviewTime": "12 3, 2009", "category": "Patio_Lawn_and_Garde"}
{"reviewerID": "A2E5XXXC07AGA7", "asin": "B00002N66D", "reviewerName": "James", "helpful": [1, 1], "reviewText": "This is a very nice spreader. It feels very solid and the pneumatic tires give it great maneuverability and handling over bumps. The control arm is solid metal, not a cable, which gives you precise control and will last a long time. The settings take some experimentation with your various products to get it right, but that is true of any spreader. It has good distribution... probably flings material a little farther on the right side than the left, but it is far more even than my crappy Edgeguard ever was.", "overall": 5.0, "summary": "Nice spreader", "unixReviewTime": 1354492800, "reviewTime": "12 3, 2012", "category": "Patio_Lawn_and_Garde"}
I have to preprocess the data and calculate the Chi-Square value for each unigram in each category, order them by their value and then merge the categories.
So far I only did the preprocessing:
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
WORD_RE = re.compile(r"[\w']+")
with open('stopwords.txt') as f:
global stopwords
stopwords = f.read().splitlines()
with open('delimiters.txt') as f:
global delimiters
delimiters = f.read().splitlines()
class MRChiSquareTest(MRJob):
def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def reducer(self, word, counts):
if len(word) != 0 and word not in stopwords and any(x in word for x in delimiters) == False:
yield (word, sum(counts))
if __name__ == '__main__':
MRChiSquareTest.run()
After running my code I get the following output(shortened):
"reviews" 1
"reviewtext" 22
"reviewtime" 22
"ripe" 1
"roldan" 1
"roller" 1
"room" 4
"rust" 1
"sand" 2
It includes also the categories as values and there is no separation between the categories and their values.
- How would I iterate through the dataset to get the necessary values for a Chi-Square value calculation for each unigram?
- How can I read the categories and attach them to the values? How can I achieve separation? Should I yield in one of the steps the category and an array that contains the word with the sum of the count?
- How can I merge categories after the calculation?