I am attempting to load a dataset from file containing approximately 3 million json-serialized objects. Each object is a large nested dictionary containing a variety of types -- ints, floats, lists, and other dictionaries.
The size of the file on disk is approximately 60GB. I have over 128GB of ram so I should be able to fit the whole set in memory. However, when I load the data into a large dictionary using the following code, the size of used memory increases to at least 110GB (it might even get larger but I stopped the script before it grew any more.)
What could explain the memory overhead I am seeing when I try to load this data? Why would 60GB on disk translate to 110GB or more in memory? As far as I can tell, the only overhead here should be the result of creating list containers for objects, and for assigning those lists a key name in the results dictionary. That can't possibly account for almost twice as much memory as the data itself -- can it?
def load_by_geohash(file, specificity=7):
results = defaultdict(list)
filename = os.path.join(DATADIR, file)
with open(filename, 'r') as f:
updates = (json.loads(line) for line in f)
for update in updates:
geo_hash = update['geohash'][:specificity]
results[geo_hash].append(update)
return results