I'm doing a procedure where some values are created on each iteration (not many values at all: only ~50 values per iteration, of which some are short 4-5 character strings, but most are 2-3 digit integers). There are roughly 3 thousand iterations.
Right now, I use a pandas dataframe to store those ~50 values for a given iteration, then append the df to a list of dataframes (dflist), and once all 3K iterations are done, I concatenate the 3K dataframes (since they all have the same column names) using something like:
df_final = pd.concat(dflist,axis=0)
Is there a better way to do this procedure, eg. just use a numpy array and append the values along axis 0, and in the end turn the full numpy array into a Pandas dataframe with the given set of column names?
I ask because after many iterations (~200 out of the 3 thousand), the code slows down substantially and system memory usage slowly creeps up, and between iterations, as far as I can tell, all my values are being overwritten on each iteration except for this list of pandas dataframes which seems to be the only thing that grows after each iteration. I'm using Python 2.7. This behavior happens when I run my script in the Spyder GUI or just from the command line.
One other thing: even though the values I actually save out are relatively small (the ~50 values per iteration), the data I go through to extract those summary values is very large. So the original csv is ~10 GB with ~200million rows, and I chunk it using pd.read_csv with a given chunksize, which is roughly 50K lines. Then for those 50K lines, I get about 50 values. But I would have thought that each chunk would be independent and since values are getting overwritten memory usage shouldn't be growing like it does.
Example df:
CHFAC Bygoper Change MinB NumB NumCombos Total
0 abc3 574936022 + 1 1 1 11
1 abc3 574936022 - 1 0 0 0
2 abc3 574936022 + 2 1 1 11
3 abc3 574936022 - 2 0 0 0
4 abc3 574936022 + 5 1 1 11
5 abc3 574936022 - 5 0 0 0
6 abc3 574936022 + 10 1 1 11
7 abc3 574936022 - 10 0 0 0