I have json file, which size is 100 gb. It scheme looks like:
json_f = {"main_index":{"0":3,"1":7},"lemmas":{"0":["test0", "test0"],"1":["test1","test1"]}}
*"lemmas" elements contain large lists with words. Len of "lemmas" elements about 2kk.
As a result I need it whole in memory as:
- List of "lemmas"
[["test0", "test0"], ["test1","test1"]]
- Or a pd.DataFrame of json_f, which I'll process further to 1.
What I have tried:
- pd.read_json - gives a memory error (which is not about RAM, as I can see, cause I have 256 GB);
- ijson and iteratively load it. But something goes wrong with real file (on example data its ok) - kernel is busy but the iterator does not grow.
f = open("json_f.json", 'rb')
texts = []
for j in ijson.items(f, 'lemmas.0'):
texts.append(j)
One of my thoughts is to split that to some less weighted files, load after that and merge. But I ran into the same problem load it first. I would be very grateful for tips on how to deal with this.