Why are CSV files smaller than HDF5 files when writing with Pandas?

Question

import numpy as np
import pandas as pd

df = pd.DataFrame(data=np.zeros((1000000,1)))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
11M test.csv  16M test.h5

If I use an even larger dataset then the effect is even bigger. Using an HDFStore like below changes nothing.

store = pd.HDFStore('test.h5', table=True)
store['df'] = np.zeros((1000000,1))
store.close()

Edit: Never mind. The example is bad! Using some non-trivial numbers instead of zeros changes the story.

from numpy.random import rand
import pandas as pd

df = pd.DataFrame(data=rand(10000000,1))
df.to_csv('test.csv')
df.to_hdf('test.h5', 'df')

ls -sh test*
260M test.csv  153M test.h5

Expressing numbers as floats should take less bytes than expressing them as strings of characters with one character per digit. This is generally true, except in my first example, in which all the numbers were '0.0'. Thus, not many characters were needed to represent the number, and so the string representation was smaller than the float representation.

score 5 · Answer 1 · answered Mar 09 '15 at 04:17

5

Briefly:

csv files are 'dumb': it is one character at a time, so if you print the (say, four-byte) float 1.0 to ten digits you really use that many bytes -- but the good news is that csv compresses well, so consider .csv.gz.
hdf5 is a meta-format and the No Free Lunch theorem still holds: the entries and values need to be stored somewhere. Which may make hdf5 larger.

But you are overlooking a larger issue: csv is just text. Which has limited precision -- whereas hdf5 is one of several binary (serialization) formats which store data to higher precision. It really is apples to oranges in that regard too.

answered Mar 09 '15 at 04:17

Dirk Eddelbuettel

360,940
56
644
725

1

In what sense does a CSV have limited precision? You can always write out a CSV that contains the exact same information as a binary file. Generally it's less compact (at least before zipping) and almost always slower, but you shouldn't lose any info unless you intentionally round or truncate before writing out the values. – JohnE Mar 09 '15 at 20:55
True in theory, in practive I never ever seen csv files with sixteen decimals. – Dirk Eddelbuettel Mar 09 '15 at 21:00
2

Yes, I agree with that. Just clarifying that CSV (and text in general) is not inherently less precise than binary. – JohnE Mar 09 '15 at 21:13

score 2 · Accepted Answer · answered Mar 09 '15 at 04:34

For .csv, your method stores characters like this:

999999,0.0<CR>

That's up to 11 characters per value. At 1 million values, this comes to close to 11MB.

HD5 seems to store each value as 16 byte floating point number, never mind that it's the same value over and over. So this is 16 byte * 1,000,000, which is roughly 16 MB.

Store not a 0.0, but some random data, and the .csv quickly blows off to 25MB and more, while the HDF5 file stays the same size. And while the csv file looses accuracy, the HDF5 retains it.

Why are CSV files smaller than HDF5 files when writing with Pandas?

2 Answers2