2

I'm doing a procedure where some values are created on each iteration (not many values at all: only ~50 values per iteration, of which some are short 4-5 character strings, but most are 2-3 digit integers). There are roughly 3 thousand iterations.

Right now, I use a pandas dataframe to store those ~50 values for a given iteration, then append the df to a list of dataframes (dflist), and once all 3K iterations are done, I concatenate the 3K dataframes (since they all have the same column names) using something like:

df_final = pd.concat(dflist,axis=0)

Is there a better way to do this procedure, eg. just use a numpy array and append the values along axis 0, and in the end turn the full numpy array into a Pandas dataframe with the given set of column names?

I ask because after many iterations (~200 out of the 3 thousand), the code slows down substantially and system memory usage slowly creeps up, and between iterations, as far as I can tell, all my values are being overwritten on each iteration except for this list of pandas dataframes which seems to be the only thing that grows after each iteration. I'm using Python 2.7. This behavior happens when I run my script in the Spyder GUI or just from the command line.

One other thing: even though the values I actually save out are relatively small (the ~50 values per iteration), the data I go through to extract those summary values is very large. So the original csv is ~10 GB with ~200million rows, and I chunk it using pd.read_csv with a given chunksize, which is roughly 50K lines. Then for those 50K lines, I get about 50 values. But I would have thought that each chunk would be independent and since values are getting overwritten memory usage shouldn't be growing like it does.

Example df:

    CHFAC   Bygoper  Change MinB  NumB  NumCombos   Total
0   abc3    574936022   +    1     1    1   11
1   abc3    574936022   -    1     0    0   0
2   abc3    574936022   +    2     1    1   11
3   abc3    574936022   -    2     0    0   0
4   abc3    574936022   +    5     1    1   11
5   abc3    574936022   -    5     0    0   0
6   abc3    574936022   +    10    1    1   11
7   abc3    574936022   -    10    0    0   0
sambajetson
  • 193
  • 1
  • 9
  • show us one or 2 sample df you are concat, it's hard to tell how many columns you have, index and such based on your post. – Steven G Oct 21 '16 at 17:06

1 Answers1

1

You can be creative and use list to store your data and then create your final dataframe at the end of the loop. It's hard to use your example since we don't have any idea of your creation process. I will give a generic answer showing the creation of a 2 column dataframe based on a loop of 10 iterations where each iteration has different lenght of output

import pandas as pd
from random import randint
col1_val, col2_val = [], []
for i in range(10):
    random_len = range(randint(0, 9))
    col1 = random_len
    col2 = random_len
    col1_val.extend(col1)
    col2_val.extend(col2)
pd.DataFrame({'col1':col1_val, 'col2':col2_val})

Out[110]:

    col1  col2
0      0     0
1      1     1
2      2     2
3      3     3
4      4     4
5      0     0
6      1     1
7      2     2
8      3     3

now let's look at speed, using the list method :

import time
st = time.time()
col1_val, col2_val = [], []
for i in range(10000):
    random_len = range(randint(0, 9))
    col1 = random_len
    col2 = random_len
    col1_val.extend(col1)
    col2_val.extend(col2)
pd.DataFrame({'col1':col1_val, 'col2':col2_val})
print time.time()-st
0.0499999523163

using your method:

st = time.time()
dflist = []
for i in range(10000):
    random_len = range(randint(0, 9))
    col1 = random_len
    col2 = random_len
    dflist.append(pd.DataFrame({'col1':col1, 'col2':col2}))
pd.concat(dflist)
print time.time()-st
7.21199989319

so for 10000 iteration, it would be about 180x faster

Steven G
  • 16,244
  • 8
  • 53
  • 77
  • Ok so that should be an improvement over the current method of appending to a list of dataframes. So what are the performance diffs of your proposed method of extending a column: col1_val.extend(col1) vs. appending all those values at once to a numpy array? E.g. vals = np.vstack((vals,[1,2])) for the 2 column example? – sambajetson Oct 21 '16 at 18:46
  • @sambajetson extending or appending a numpy array is not very efficient memory-wise and I would not recommend that. you can [read this answer](http://stackoverflow.com/questions/13215525/how-to-extend-an-array-in-place-in-numpy) as an example of why you should not approach it that way – Steven G Oct 21 '16 at 18:54