2

While turning a DEAP's Logbook (essentially, a list of Dictionaries) with around 10 MM entries into a Dataframe for further processing, I got a message about RAM overflow in Google Colab.

I'm using the DEAP package for some experiments, as my machine is slow and old I've been helping my self with the Colab service from Google. The result of the simulation is a DEAP's Logbook, this is a list of dictionaries. Each dictionary is a summary of important values of a screenshot of the simulation. I've been turning this list of dictionaries into Dataframes for analysis, but for the biggest simulations the process crashes due to it exceeding the allotted RAM.

The dictionaries store this kind of values:

logbook[-1]
{'avg': 16.72180244532359,
 'b_ratio': 5,
 'best': 0.006420736818512296,
 'births': 80160,
 'cx_pb': 0.9,
 'exp': 128,
 'k_par': 6,
 'k_sur': 6,
 'med': 2.6377157552245727,
 'mut_pb': 0.9,
 'mut_sig': 7.5,
 'pop': 160,
 'rep': 40,
 'seed': 112,
 'std': 20.059567935625164,
 'worst': 55.23488779660829}

And the logbooks that I'm interested in storing as pandas dataframes have between 10MM and 12MM. Later, I'll reduce that count to around a fifth.

After pickling and unpickling the logbook I see that I'm using around 7.7GB from the allotted 12.7GB.

I've tried:

from itertools import chain
fitness_res = pd.DataFrame(list(chain.from_iterable(logbook)))

and

pop_records = [record for record in logbook]
fitness_res = pd.DataFrame(pop_records)

without success.

The error I got is:

Your session crashed after using all available RAM. View runtime logs

I expect to have a dataframe with all the data in the DEAP's Logbook.

EloyRD
  • 304
  • 3
  • 12

1 Answers1

1

DataFrame in pandas loads all of the data into memory. The approaches you were using are using additional memory to load the data before passing it to pandas to store in the DataFrame; e.g.

from itertools import chain
fitness_res = pd.DataFrame(list(chain.from_iterable(logbook)))

means that before you pass your data into pd.DataFrame you are creating a list of all read values.

Whereas with the 2nd approach:

pop_records = [record for record in logbook]
fitness_res = pd.DataFrame(pop_records)

You are creating a list using list comprehension that yet again loads all of the data into memory before passing it to pandas.

My suggestion is that you use the pandas data loading functionality directly on the pickled file using pandas.read_pickle:

fitness_res = pd.read_pickle(pickle_file_path)
sophros
  • 14,672
  • 11
  • 46
  • 75
  • I'm testing it right away. Thanks for pointing my main mistake! – EloyRD Jul 02 '19 at 12:04
  • I'm assuming that if that doesn't work I need to rerun my simulation with a more sparse data collector. – EloyRD Jul 02 '19 at 12:04
  • Yes, or use something else than pickle storage and process data in chunks to limit the memory consumption at any time. – sophros Jul 02 '19 at 12:05
  • Hi. It worked! I din't know that one can pass a pickle of a list of dictionaries directly to pandas. Thanks! – EloyRD Jul 03 '19 at 07:07