OK I am experimenting with pandas to load around a 30GB csv file with 40 million+ rows and 150+ columns in to HDFStore. The majority of the columns are strings, followed by numerical and dates.
I have never really used numpy, pandas or pytables before but have played around with data frames in R.
I am currently just storing a sample file of around 20000 rows in to HDFStore. When I try to read the table from HDFStore the table is loaded to memory and memory usage goes up by ~100MB
f=HDFStore('myfile.h5')
g=f['df']
Then I delete the variable containing the DataFrame:
del g
At the point the memory usage decreases by about 5MB
If I again load the data into g using g=f['df']
, the memory usage shoots up another 100MB
Cleanup only happens when I actually close the window.
The way the data is organized, I am probably going to divide the data into individual tables with the max table size around 1GB which can fit into memory and then use it one at a time. However, this approach will not work if I am not able to clear memory.
Any ideas on how I can achieve this?