I find that the sparse versions of a dataframe are actually much larger when saved to disk than dense versions. What am I doing wrong?
test = pd.DataFrame(ones((4,4000)))
test.ix[:,:] = nan
test.ix[0,0] = 47
test.to_hdf('test3', 'df')
test.to_sparse(fill_value=nan).to_hdf('test4', 'df')
test.to_pickle('test5')
test.to_sparse(fill_value=nan).to_pickle('test6')
....
ls -sh test*
200K test3 16M test4 164K test5 516K test6
Using version 0.12.0
I would ultimately like to efficiently store 10^7 by 60 arrays, with about 10% density, then pull them into Pandas dataframes and play with them.
Edit: Thanks to Jeff for answering the original question. Follow-up question: This appears to only give savings for pickling, and not when using other formats like HDF5. Is pickling my best route?
print shape(array_activity) #This is just 0s and 1s
(1020000, 60)
test = pd.DataFrame(array_activity)
test_sparse = test.to_sparse()
print test_sparse.density
0.0832333496732
test.to_hdf('1', 'df')
test_sparse.to_hdf('2', 'df')
test.to_pickle('3')
test_sparse.to_pickle('4')
!ls -sh 1 2 3 4
477M 1 544M 2 477M 3 83M 4
This is data that, as a list of indices in a Matlab .mat file, is less than 12M. I was eager to get it into an HDF5/Pytables format so that I could grab just specific indices (other files are much larger, and take much longer to load into memory), and then readily do Pandasy things to them. Perhaps I am not going about this the right way?