I have a very large JSON file (~30GB, 65e6 lines) that I would like to process using some dataframe structure. This dataset does of course not fit into my memory and therefore I ultimately want to use some out-of-memory solution like dask or vaex. I am aware that in order to do this I would first have to convert it into an already memory-mappable format like hdf5 (if you have suggestion for the format, I'll happily take them; the dataset includes categorical features among other things).
Two important facts about the dataset:
- The data is structured as a list and each dict-style JSON object is then on a single line. This means that I can very easily convert it to line-delimited JSON by parsing it and removing square brackets and commas, which is good.
- The JSON objects are deeply nested and there's variability in the presence of keys among them. This means that if I use a JSON reader for line-delimited JSON that reads chunks sequentially (like pandas.read_json() with specified lines=True and chunksize=int) the resulting dataframes after flattening (pd.json_normalize) might not have the same columns, which is bad for streaming them into an hdf5 file.
Before I spend an awful lot of time with writing a script that extracts me all possible keys and streams each column of a chunk one-by-one to the hdf5-file and inserts NaNs wherever needed: Does anyone know a more elegant solution to this problem? Your help would be really appreciated.
P.S. Unfortunately I can't really share any data, but I hope that the explanations above describe the structure well enough. If not I will try to provide similar examples.