0

I'm trying to load a very large jsonl file (>50 GB) using chunks in pandas

reader = pd.read_json("January.jsonl", lines = True, chunksize = 10000)

for chunk in reader:
    df = chunk   

This code starts, runs for a while an then returns this error

 self._parse_no_numpy()

  File "C:\Users\anaconda3\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None

ValueError: Expected object or value

Is there a problem with my file or what else? sample from my file

  • What does your JSON data look like? – forgetso Dec 09 '20 at 19:55
  • It's a db of tweets, I added a screen of the first lines since they are quite long and not very useful to copy-paste. This jsonl has been generated with [Hydrator](https://github.com/DocNow/hydrator). Before opening an issue on their Git I wanted to be sure that my code was ok. Is is ok or am I doing something wrong in these two lines? Note that the file is a .txt just because the original .jsonl was too large and I had to sample the first lines with R – Leonardo Sanna Dec 09 '20 at 23:56

1 Answers1

1

You seem to have malformed JSON data in your file. For example, try loading the following "JSON" data - note that id 77 is malformed.

{"created_at": "2019-01-01 23:45:01", "id":1}
{"created_at": "2019-01-01 23:45:01", "id":2}
{"created_at": "2019-01-01 23:45:01", "id":3}
{"created_at": "2019-01-01 23:45:01", "id":4}
{"created_at": "2019-01-01 23:45:01", "id":5}
{"created_at": "2019-01-01 23:45:01", "id":6}
{"created_at": "2019-01-01 23:45:01", "id":7}
{"created_at": "2019-01-01 23:45:01", "id":8}
{"created_at": "2019-01-01 23:45:01", "id":11}
{"created_at": "2019-01-01 23:45:01", "id":22}
{"created_at": "2019-01-01 23:45:01", "id":33}
{"created_at": "2019-01-01 23:45:01", "id":44}
{"created_at": "2019-01-01 23:45:01", "id":55}
{"created_at": "2019-01-01 23:45:01", "id":66}
{i"created_at": "2019-01-01 23:45:01", "id":77}

{"created_at": "2019-01-01 23:45:01", "id":88}
{"created_at": "2019-01-01 23:45:01", "id":99}

Then run this code.

>>> import pandas as pd
>>> reader = pd.read_json("January.jsonl", lines=True, chunksize=1)
>>> for r in reader:
...     print(r)

And view the output:

12 2019-01-01 23:45:01  55
            created_at  id
13 2019-01-01 23:45:01  66
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 779, in __next__
    obj = self._get_object_parser(lines_json)
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 753, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 857, in parse
    self._parse_no_numpy()
  File "/home/user/anaconda3/envs/project/lib/python3.7/site-packages/pandas/io/json/_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

The error is the same as the one you received. You will need to find the malformed data and fix it. You could try reading the JSON data line by line to find out where the error(s) exists and extract the lines to inspect them.

f = open("January.jsonl")
lines=f.readlines()
for line_no, line in enumerate(lines):
     try:
         data = json.loads(line)
     except Exception:
         print(line_no)
         print(line)
forgetso
  • 2,194
  • 14
  • 33