I'm new to MapReduce and MRjob, I am trying to read a csv file that I want to process using MRjob in python. But it has about 5 columns with JSON strings(eg. {}) or an array of JSON strings (eg. [{},{}]), some of them are nested.
My mapper so far looks as follows:
from mrjob.job import MRJob
import csv
from io import StringIO
class MRWordCount(MRJob):
def mapper(self, _, line):
l = StringIO(line)
reader = csv.reader(l) # returns a generator.
for cols in reader:
columns = cols
yield None, columns
I get the error -
_csv.Error: field larger than field limit (131072)
But that seems to happen because my code separates the JSON strings into separate columns as well (because of the commas inside).
How do I make this, so that the JSON strings are not split? Maybe I'm overlooking something?
Alternatively, is there any other ways I could read this file with MRjob that would make this process simpler or cleaner?