very large JSON handling in Python

Question

I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure. This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex. I am aware that in order to do this I would first have to convert it into an already memory-mappable format like hdf5 (if you have suggestion for the format, I'll happily take them; the dataset includes categorical features among other things).

Two important facts about the dataset:

The data is structured as a list and each dict-style JSON object is then on a single line. This means that I can very easily convert it to line-delimited JSON by parsing it and removing square brackets and commas, which is good.
The JSON objects are deeply nested and there's variability in the presence of keys among them. This means that if I use a JSON reader for line-delimited JSON that reads chunks sequentially (like pandas.read_json() with specified lines=True and chunksize=int) the resulting dataframes after flattening (pd.json_normalize) might not have the same columns, which is bad for streaming them into an hdf5 file.

Before I spend an awful lot of time with writing a script that extracts me all possible keys and streams each column of a chunk one-by-one to the hdf5-file and inserts NaNs wherever needed: Does anyone know a more elegant solution to this problem? Your help would be really appreciated.

P.S. Unfortunately I can't really share any data, but I hope that the explanations above describe the structure well enough. If not I will try to provide similar examples.

Please provide enough code so others can better understand or reproduce the problem. — Community, Dec 20 '22 at 06:42

score 0 · Answer 1 · answered Dec 20 '22 at 07:28

0

As a general rule, what you need is a stream/event-oriented JSON parser. See for example json-stream. Such a parser can handle input of any size with a fixed amount of memory. Instead of loading the entire JSON to memory, the parser calls your functions in response to individual elements in the tree. You can write your processing in callback functions. If you need to do more complex or repeated processing of this data, it might make sense to store it in a database first.

answered Dec 20 '22 at 07:28

jurez

4,436
2
12
20

Thanks for the answer! Unfortunately that is pretty much as far as I came by myself. I am actually already running a Couchbase Server with all the data in it. This was suitable for funning queries on specific features but now I am trying to funnel everything into LightGBM and therefore I need to get it into a tabular format. Maybe I'll just write a script that extracts me all possible keys and then appends chunks to a .csv file, padding the missing values with NaNs. – Lionel Peer Dec 20 '22 at 08:52

very large JSON handling in Python

1 Answers1