0

I am working with huge JSON files, with size ranging from 100-300 MB. Hence in order to save disk space (and computation time?), I converted the JSON file into a .json.gz file and proceeded like this :

with gzip.GzipFile(json_file, 'r') as f:
    return json.loads(f.read().decode('utf-8'))

json.loads didn't cause any issues with memory usage, but I would like to increase the speed, and hence I tried py-yajl (not to be confused with yajl-py, which I tried as well, but that took much more time since I was parsing the streamed JSON), like this :

yajl.loads(f.read().decode('utf-8'))

But as I have seen on sites claiming that yajl is faster than json and simplejson libraries, I couldn't see an improvement in execution time. On the contrary, it took a bit more time as compared to json. Am I missing something here? In what cases, yajl is supposed to be faster than json/simplejson? Does the speed depend on the structure of a JSON file as well?

My JSON file looks like this :

[
    {
        "bytes_sent": XXX,
        "forwardedfor": "-",
        "hostip": "XXX",
        "hostname": "XXX",
        "https": "on",
        "landscapeName": "XXX",
    },
    ...
]

I am aware that this is a subjective question and is likely to be closed, but I couldn't clear my doubts anywhere, and, at the same time, I would like to know about the difference b/w these libraries in more detail, hence asking here.

Jarvis
  • 8,494
  • 3
  • 27
  • 58
  • `yajl.loads(f.read().decode('utf-8'))` != `gzip.GzipFile(json_file, 'r').read().decode('utf-8')` – dsgdfg Feb 01 '19 at 06:35

1 Answers1

0

If you are reading the entire structure into memory at once anyway, the external library offers no benefits. The motivation for a tool like that is that it allows you to process things piecemeal, without having to load the entire thing into memory first, or at all. If your JSON is a list of things, process one thing at a time, via the callbacks the library offers.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • My JSON is a list of objects actually, like I posted in the question above. I tried streaming this JSON using `yajl-py` library and making appropriate callback methods for to handle data inside objects, but it didn't speed up the process. I am looking to actually speed up, with memory usage being second priority. – Jarvis Feb 01 '19 at 06:31
  • The `f.read()` already loads the entire thing into memory. You want to give the library a file handle and let it consume it as it goes. This is all in the documentation (which I read for the first time some three minutes ago). – tripleee Feb 01 '19 at 06:34
  • Thanks, I will try that. Slipped out of my mind, that one. But the thing is, it's a gzip file. How to proceed in that case? – Jarvis Feb 01 '19 at 06:35
  • 1
    The `with` in your question shows exactly how to open a file handle for reading as `gz`. – tripleee Feb 01 '19 at 06:36
  • It still didn't help. I even tried [this](https://pastebin.com/vUqY1bFJ). Am I mistaken somewhere? – Jarvis Feb 01 '19 at 06:56
  • Not really in a place where I can test that. The `pass` statements certainly look superfluous, but that's inconsequential. Maybe accept this answer (or post one of your own and accept that if you prefer) and ask a new question with your actual code (including `import`s) and enough data to repro? See also the guidance for creating a [mcve]. – tripleee Feb 01 '19 at 07:08