2

I have a simulation that generates data every step. I store the data in memory. After a certain number of steps, I create a pandas dataframe from the stored data and write(append) to file using df.to_csv. I can create a dataframe from a list of dictionaries or a dictionary of lists (where values are lists). Which of these would give me better performance and memory management?

Option A:

data = []
d={'a':1, 'b': 2} # Data from one step
data.append(d)
d={'a':2, 'b': 3} # Data from another step
data.append(d)
# data = [{'a':1, 'b':2}, {'a':2, 'b':3}]
df = pd.DataFrame(data)
with open(output_file, 'a') as f:
    df.to_csv(f, sep=",", index=False, header=f.tell()==0) 
    # Header added at first write

OR Option B:

data = {'a':[], 'b':[]}
data['a'].append(1) # Data from one step
data['b'].append(2)
data['a'].append(2) # Data from another step
data['b'].append(3)
# data = {'a':[1,2], 'b':[2,3]}
df = pd.DataFrame(data)
with open(output_file, 'a') as f:
    df.to_csv(f, sep=",", index=False, header=f.tell()==0) 
    # Header added at first write

I have 10^5 to 10^8 steps of data. I want to break down the writing into parts. That is,

  1. Store data for n steps in memory
  2. Create dataframe and write data from n steps
  3. Clear past n steps data from memory and repeat 1 and 2 for next n steps and so on...

So you can essentially imagine the above snippets as being in loops. I want the memory to be freed after every n-step of data. I am assuming re-declaring the data variable will achieve this, in both cases. If not, please advice how I can get this done so that memory usage is not cumulatively increasing.

Rithwik
  • 1,128
  • 1
  • 9
  • 28
  • Are you doing computation with the dataframe before writing it to csv? – Henry Ecker Jun 18 '21 at 16:29
  • @HenryEcker No, I simply create the dataframe and write to file. After the whole simulation finishes, then I do read from the file and process the data using dataframes. That is why I started using dataframes for writing as well. – Rithwik Jun 18 '21 at 16:31
  • 1
    If you have so many data, You can considere using 2D numpy array, and pass the column names separately while creating the dataframe, instead of using dictionary/list. That way, performing other operations might also come handy. – ThePyGuy Jun 18 '21 at 16:34
  • @Don'tAccept True, that is also a possibility. I am unsure which is most efficient currently. – Rithwik Jun 18 '21 at 16:39

1 Answers1

2

This seems to cover row oriented vs column oriented databases. Here is a really good answer covering the differences between the two. List with many dictionaries VS dictionary with few lists?

In summary the conversion is more expensive when assigning a row oriented structure as every individual dictionary must be read. However improvements would be marginal as best.

fthomson
  • 773
  • 3
  • 9
  • That is useful information. Thanks for sharing. It does seem to suggest row operations are slightly more expensive. But it doesn't really solve the question of which is better for my use case, where the main concern is repeated operations of -maintaining huge amount of data in memory and writing to file. – Rithwik Jun 19 '21 at 07:47