0

I'm new to MapReduce and MRjob, I am trying to read a csv file that I want to process using MRjob in python. But it has about 5 columns with JSON strings(eg. {}) or an array of JSON strings (eg. [{},{}]), some of them are nested.

My mapper so far looks as follows:

from mrjob.job import MRJob
import csv
from io import StringIO

class MRWordCount(MRJob):
    def mapper(self, _, line):
        l = StringIO(line)
        reader = csv.reader(l) # returns a generator.

        for cols in reader:
            columns = cols

        yield None, columns

I get the error -

_csv.Error: field larger than field limit (131072)

But that seems to happen because my code separates the JSON strings into separate columns as well (because of the commas inside).

How do I make this, so that the JSON strings are not split? Maybe I'm overlooking something?

Alternatively, is there any other ways I could read this file with MRjob that would make this process simpler or cleaner?

Rabbir
  • 47
  • 1
  • 2
  • 6
  • what about to convert your csv to psv file? – Evhz Mar 24 '19 at 11:14
  • @Evhz could you elaborate please? – Rabbir Mar 24 '19 at 11:33
  • replace commas in the csv file, unless, these commas are within json `{ }` delimiters, by a pipe `|` character. Then you could use a notation like `load_csv(sep='|')` to load your content. – Evhz Mar 24 '19 at 12:26

1 Answers1

1

Your JSON string is not surrounded by quote characters so every comma in that field makes the csv engine think its a new column. take a look here what you are looking for is quotechar change your data so that you json is surrounded with a special character (The default is ") and adjust your csv reader accordingly

Nullman
  • 4,179
  • 2
  • 14
  • 30
  • Just took a sample of the data and saw it in notepad, the JSON strings are surrounded by double quotation ( " ) marks. – Rabbir Mar 24 '19 at 11:31
  • and `"` dont appear inside the json? – Nullman Mar 24 '19 at 11:34
  • In some cases they dont, like this "[{'index': '4', 'value': 'EMEA'}]" But in some cases they do, like this "{""browser"": ""Firefox"", ""browserVersion"": ""not available in demo dataset"", ""deviceCategory"": ""desktop""}" – Rabbir Mar 24 '19 at 11:34
  • interesting, that should parse correctly. try adding `quotechar = '"'` in your `csv.reader`, perhaps your default is something different – Nullman Mar 24 '19 at 11:36
  • In some cases they dont, like this "[{'index': '4', 'value': 'EMEA'}]" But in some cases they do, like this "{""browser"": ""Firefox"", ""browserVersion"": ""not available in demo dataset"", ""deviceCategory"": ""desktop""}" – Rabbir Mar 24 '19 at 11:36
  • if even some do have `"` inside the JSON it will ruin everything. you should wrap your json with a different character, one that isnt in the json string – Nullman Mar 24 '19 at 11:37
  • any suggestions as to how I could change the double quotations on the inside? – Rabbir Mar 24 '19 at 11:40
  • since you know how many columns you have, you can find the indices of the JSON substring by counting commas from the beginning and from the end. then just replace the characters – Nullman Mar 24 '19 at 11:46