0

I have json file, which size is 100 gb. It scheme looks like:

json_f = {"main_index":{"0":3,"1":7},"lemmas":{"0":["test0", "test0"],"1":["test1","test1"]}}

*"lemmas" elements contain large lists with words. Len of "lemmas" elements about 2kk.

As a result I need it whole in memory as:

  1. List of "lemmas"
[["test0", "test0"], ["test1","test1"]]
  1. Or a pd.DataFrame of json_f, which I'll process further to 1.

What I have tried:

  1. pd.read_json - gives a memory error (which is not about RAM, as I can see, cause I have 256 GB);
  2. ijson and iteratively load it. But something goes wrong with real file (on example data its ok) - kernel is busy but the iterator does not grow.
f = open("json_f.json", 'rb')

texts = []

for j in ijson.items(f, 'lemmas.0'):
    texts.append(j)

One of my thoughts is to split that to some less weighted files, load after that and merge. But I ran into the same problem load it first. I would be very grateful for tips on how to deal with this.

SollPicher
  • 11
  • 1
  • Maybe load your JSON into a memory-backed store, like Redis, and use the Python API to query. Start with a subset, measure memory usage, and scale up if your system allows. – Alex Reynolds May 08 '23 at 21:24
  • 1
    _gives a memory error (which is not about RAM, as I can see, cause I have 256 GB)_ That's no guarantee. First, other things running on your computer are taking up some of that memory; second, the memory usage of objects is generally much more than their textual size in a file; and third, it literally says "memory error"! What ELSE do you suppose the problem would be, if not RAM?! – John Gordon May 08 '23 at 22:15
  • @JohnGordon I use cloud computing for this and track memory usage - memory error comes up far away from whole RAM filled (~ 80gb, left vacant 150gb+). After research I found that It may be a point of limitation for python process weight (but I am not sure). – SollPicher May 08 '23 at 23:17
  • @AlexReynolds I'll look into it, thanks. But I don't think I can take the file out of my environment. – SollPicher May 08 '23 at 23:21

1 Answers1

0

Your usage of ijson doesn't populate the list because you are using an inappropriate function.

ijson.items yield multiple objects only if you give a prefix that matches multiple objects. Those usually traverse a list of elements, so you'll see somewhere else n the prefix the word item.

OTOH, what you want to traverse iteratively is the lemmas object, which has many keys and values -- and you want to accumulate only the values. Using ijson.kvitems should do the trick:

for key, lemmas in ijson.kvitens(f, 'lemmas'):
    # key is "0", "1", ...
    # value is ["test0", "test0"], ["test1", "test1"], ...

This should allow you to traverse the entire file and do something sensible with it. Note that trying to load all these lemmas into memory might still not be possible if there are too many of them. In that case you could, as you were suggesting, use ijson to break down the file into smaller ones that could be processed separately, which might or not be possible depending on what you're trying to do.

Rodrigo Tobar
  • 569
  • 4
  • 13