4

Me and a friend of mine are working on a rather large JSON file. We want to perform MapReduce on parts of this file, being as speedy as possible. As it appears to be hard to feed a JSON file directly into a "mrjob job", we attempted to write the needed data into a text file (where each line is an array element, exctracted from the json). This intermediate step takes way too much time, because of disc write operations.

Below is an example of our mrjob test file.

from mrjob.job import MRJob
import json

class ReduceData(MRJob):

    def mapper(self, _, line):
        lineJSON = json.loads(line)
        yield lineJSON[2], 1

    def reducer(self, key, values):
        yield key, sum(values)

if __name__ == '__main__':
    ReduceData.run()

The code above is ran as follows:

$ python reducedata.py data.txt

read_json is illustrated below

import ijson

f = open('testData.json')
parser = ijson.parse(f)

if __name__ == '__main__':
    for prefix, event, value in parser:
        if (prefix, event) == ('data.item', 'start_array'):
            item = []
        elif prefix == 'data.item.item' and value is not None:
            item.append(value)
        elif (prefix, event) == ('data.item', '
            item = []
            # yield data as output, or something?

With the above mentioned, I have two questions:

1) Is there a way to provide the output from read_json.py as input into reducedata.py without performing write to disc operations?

2) If 1) is possible, how to I specify the output? mrjob expects a file, and invokes the mapper line by line. Each yield (bottom comment) in read_json.py is supposed to be a "line".

Thanks in advance!

-Superdids

Superdids
  • 77
  • 1
  • 7

0 Answers0