Pandas sparse dataframe larger on disk than dense version

Question

I find that the sparse versions of a dataframe are actually much larger when saved to disk than dense versions. What am I doing wrong?

test = pd.DataFrame(ones((4,4000)))
test.ix[:,:] = nan
test.ix[0,0] = 47

test.to_hdf('test3', 'df')
test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

test.to_pickle('test5')
test.to_sparse(fill_value=nan).to_pickle('test6')

....
ls -sh test*
200K test3   16M test4  164K test5  516K test6

Using version 0.12.0

I would ultimately like to efficiently store 10^7 by 60 arrays, with about 10% density, then pull them into Pandas dataframes and play with them.

Edit: Thanks to Jeff for answering the original question. Follow-up question: This appears to only give savings for pickling, and not when using other formats like HDF5. Is pickling my best route?

print shape(array_activity) #This is just 0s and 1s
(1020000, 60)

test = pd.DataFrame(array_activity)
test_sparse = test.to_sparse()
print test_sparse.density
0.0832333496732

test.to_hdf('1', 'df')
test_sparse.to_hdf('2', 'df')
test.to_pickle('3')
test_sparse.to_pickle('4')
!ls -sh 1 2 3 4
477M 1  544M 2  477M 3   83M 4

This is data that, as a list of indices in a Matlab .mat file, is less than 12M. I was eager to get it into an HDF5/Pytables format so that I could grab just specific indices (other files are much larger, and take much longer to load into memory), and then readily do Pandasy things to them. Perhaps I am not going about this the right way?

add a compression filter, see here: http://pandas.pydata.org/pandas-docs/dev/io.html#compression — Jeff, Feb 06 '14 at 21:02
With a dense dataframe and complevel=9 and complib='blosc', that drops us from 544M to 26M. Far better, but still not keeping up with 12M. Trying compression with the sparse dataframe throws a TypeError: `TypeError: cannot properly create the storer for: [_TABLE_MAP] [group->/test_sparse (Group) '',value->,table->True,append->True,kwargs->{'encoding': None}]` — jeffalstott, Feb 06 '14 at 21:15
hmm....that's not the correct format;it should save it with table=False; but that's the default too. let me take a look. — Jeff, Feb 06 '14 at 22:00
can you post the frame that you saved (in dense format is ok), compressed pls! on say a dropbox link? — Jeff, Feb 06 '14 at 22:01
Here you go: https://www.dropbox.com/s/d374qoxeua5n3gn/store_compressed.h5 — jeffalstott, Feb 06 '14 at 22:49

Jeff · Accepted Answer · 2014-02-07T00:10:05.180

you are creating a frame that has 4000 columns, and only 4 rows; sparse is dealt with rows-wise, so reverse the dimensions.

In [2]: from numpy import *

In [3]: test = pd.DataFrame(ones((4000,4)))

In [4]: test.ix[:,:] = nan

In [5]: test.ix[0,0] = 47

In [6]: test.to_hdf('test3', 'df')

In [7]: test.to_sparse(fill_value=nan).to_hdf('test4', 'df')

In [8]: test.to_pickle('test5')

In [9]: test.to_sparse(fill_value=nan).to_pickle('test6')

In [11]: !ls -sh test3 test4 test5 test6
164K test3  148K test4  160K test5   36K test6

Followup. Your store that you supplied was written in table format, and as a result saved the dense version (Sparse is not supported for table format which are very flexible and queryable, see docs.

Furthermore, you may want to experiment with saving your file using 2 different representations of the sparse format.

so, here's a sample session:

df = 
In [1]: df = pd.read_hdf('store_compressed.h5','test')

In [2]: type(df)
Out[2]: pandas.core.frame.DataFrame

In [3]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [4]: df.to_sparse(kind='integer').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [5]: df.to_sparse(kind='block').to_hdf('test_block.h5','test',mode='w',complib='blosc',complevel=9)

In [6]: df.to_sparse(kind='integer').to_hdf('test_integer.h5','test',mode='w',complib='blosc',complevel=9)

In [7]: df.to_hdf('test_dense_fixed.h5','test',mode='w',complib='blosc',complevel=9)

In [8]: df.to_hdf('test_dense_table.h5','test',mode='w',format='table',complib='blosc',complevel=9)

In [9]: !ls -ltr *.h5
-rwxrwxr-x 1 jreback users 57015522 Feb  6 18:19 store_compressed.h5
-rw-rw-r-- 1 jreback users 30335044 Feb  6 19:01 test_block.h5
-rw-rw-r-- 1 jreback users 28547220 Feb  6 19:02 test_integer.h5
-rw-rw-r-- 1 jreback users 44540381 Feb  6 19:02 test_dense_fixed.h5
-rw-rw-r-- 1 jreback users 57744418 Feb  6 19:03 test_dense_table.h5

IIRC their is a bug in 0.12 in that to_hdf doesn't pass all the arguments thru, so you prob want to use:

with get_store('test.h5',mode='w',complib='blosc',complevel=9) as store:
    store.put('test',df)

These are stored basically as a collection of SparseSeries so if the density is low and non-contiguous then it will not be as minimal as far as size goes. Pandas sparse suite deals better with a smaller number of contiguous blocks, though YMMV. scipy provides some sparse handling tools as well.

Though IMHO, these are pretty trivial sizes for HDF5 files anyhow, you can handle gigantic number of rows; and files sizes into the 10's and 100's of gigabytes can easily be handled (though recommend).

Furthermore you might to consider using a table format if this is indeed a lookup table as you can query.

Pandas sparse dataframe larger on disk than dense version

1 Answers1