1

I am attempting to load a dataset from file containing approximately 3 million json-serialized objects. Each object is a large nested dictionary containing a variety of types -- ints, floats, lists, and other dictionaries.

The size of the file on disk is approximately 60GB. I have over 128GB of ram so I should be able to fit the whole set in memory. However, when I load the data into a large dictionary using the following code, the size of used memory increases to at least 110GB (it might even get larger but I stopped the script before it grew any more.)

What could explain the memory overhead I am seeing when I try to load this data? Why would 60GB on disk translate to 110GB or more in memory? As far as I can tell, the only overhead here should be the result of creating list containers for objects, and for assigning those lists a key name in the results dictionary. That can't possibly account for almost twice as much memory as the data itself -- can it?

def load_by_geohash(file, specificity=7):
    results = defaultdict(list)
    filename = os.path.join(DATADIR, file)

    with open(filename, 'r') as f:
        updates = (json.loads(line) for line in f)
        for update in updates:
            geo_hash = update['geohash'][:specificity]
            results[geo_hash].append(update)

    return results
vaer-k
  • 10,923
  • 11
  • 42
  • 59
  • What sort of things are in `update['geohash'][:specificity]`? `int` objects? `float` objects? `str` objects? – juanpa.arrivillaga Feb 06 '18 at 21:41
  • 1
    If your data is a tiny object, then the size of the container matters. For example, I am pretty sure that the data in a Python `int` is significantly larger than the size of the data, especially for small numbers. – Mad Physicist Feb 06 '18 at 21:41
  • Can you load subsets of your data with increasing sizes and make a graph of memory consumption? Each single new object has a certain minimum memory footprint. – Jongware Feb 06 '18 at 21:42
  • The `geo_hash` keys are strings. Each update object is a large nested dictionary. – vaer-k Feb 06 '18 at 21:42
  • Without seeing your input, this is impossible to reproduce or answer definitively. – Mad Physicist Feb 06 '18 at 21:43

1 Answers1

5

Yes, it easily could. Consider a simple case of a list of strings:

>>> import json
>>> from sys import getsizeof
>>> x = ['a string', 'another string', 'yet another']
>>> sum(map(getsizeof, x)) + getsizeof(x)
268
>>> len(json.dumps(x).encode())
45
>>>

In Python, everything is an object. So every (well, most) individual objects have at least sys.getsizeof(object()) overhead. Note, an empty string on my system:

>>> getsizeof('')
49

Note, this discrepancy is even greater for dict objects, consider:

>>> d
{'a': 'a string', 'b': 'another string', 'c': 'yet another'}
>>> sum(map(getsizeof, d)) + sum(map(getsizeof, d.values())) + getsizeof(d)
570
>>> len(json.dumps(d).encode())
60

It is downright enormous for the case of an empty dict:

>>> getsizeof({}), len(json.dumps({}).encode())
(240, 2)

Now, there are various options for storing your data more compactly. But this depends on your use-case.

Here is a related question about the memory usage of many dictionaries. There is also an example of using numpy arrays and namedtuple objects to store data more compactly. Note, using namedtuple objects might be what you need, the memory savings are potentially huge because you don't need to store actual string objects for the keys. If your sub-dictionary structure is regular, I suggest replacing those nested update dict's with nested namedtuple objects.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172