1

According to the official documentation (https://pypi.org/project/jsonslicer/), the basic configuration of Json Slicer yields 586.5K objects/sec, ijson with Python at the back-end yields 32.2K objects/sec, while ijson with C back-end (ijson.yajl2_cffi) yields 75.7Kobjects/sec.

When I used the two libraries on a deeply nested JSON file of 5GB size, it was expected that Json Slicer, being a wrapper on YAJL (https://lloyd.github.io/yajl/) would execute faster than ijson with pure Python implementation. However, the time taken by Json Slicer was 607.8014905452728 sec, while ijson took 308.19292974472046 sec.

According to various sources (https://lpetr.org/2016/05/30/faster-json-parsing-python-ijson/ , http://explique.me/Ijson/), ijson with C backend should work faster than that with pure Python back-end. However, the time taken by ijson with C backend was 2016.68796280378929 sec.

This behaviour was observed during multiple runs on different occasions on different set of Json files of various sizes.

My system configuration is Intel i7 with 20GB RAM. Multiprocessing was not used during the execution of the script.

Please can somebody explain the root cause of such strange behaviour? Also please let me know a solution to this.

2 Answers2

1

All of these libraries and implementations perform some kind of pattern match to seek out the parts of the file you want to inspect. Then the selected elements must be acted on, typically by promoting the content into a Python data structure.

By default ijson copies a sub-document into a Python dict/list once the initial pattern match has occurred. This incurs cost. If you want to manipulate part of the document in Python (rather than extract a scalar/string), you still need to convert. With a C implementation doing the parsing, this necessitates turning C variable into a Python one.

When working with a deep data structure that you must load a lot of, you might find that the benefits of parsing the file in C are dwarfed by the conversion cost into Python, whereupon the pure Python implementation bypasses some of conversion costs.

jsonslicer is (for me) incredibly fast for simple extraction tasks, but it too will hit similar limitations as the complexity of the extraction grows. I have a 970 MB uncompressed JSON file consisting of a list of documents. In my case, doing a trivial string select from a 3rd level property is as follows:

with gzip.open('big_docs.json.gz') as file:
for thing in JsonSlicer(file, (None, 'key')):
    # print(thing)
    if thing == 'special_value':
        pass


with gzip.open('big_docs.json.gz') as file:
    for thing in ijson.items(file, 'item'):
        if thing['key'] == 'special_value':
            pass
  • ijson (using default C binding) - 363s
  • ijson (pure Python) - 686s
  • jsonslicer - 15s

In this unfair comparison jsonslicer is able to avoid converting anything but a string into Python at almost zero cost, so the result is startlingly fast. My naive ijson usage must do a lot more work.

In this instance, if I dig two layers deeper without changing the interaction, I get:

  • ijson - 350s
  • ijson(pure python) - 667s
  • jsonslicer - 16s

jsonslicer slows down a bit with more to check, whereas ijson does the same but gains time by modelling smaller document sections. More complex uses change the balance. Proving that casting from C to Python is why your unseen code is slower would be difficult, but it might be contributing!

nerdstrike
  • 181
  • 2
  • 4
1

Be aware that since 2.4 ijson has a new yajl2_c backend written all in C (i.e., not using cffi or ctypes), which is ~10x faster than the other ones. And since 2.5 this is the one picked up by default, if present in your installation (probably what happened in your case, hence you saw a faster execution). And since 2.6 there is a new kvitems method that avoid constructing a full object and iterates over its members, which is useful in some situations.

Historically there was no yajl2_c backend, so when people referred to "the C backend" was actually the yajl2_cffi backend. On top of that, people might still be under the impression that ijson defaults to the python backend.

So to answer your question: you are probably running ijson with the yajl2_c backend, and is running faster than JsonSlicer.

Rodrigo Tobar
  • 569
  • 4
  • 13