I am working with an enormous dataset (hundreds of GBs) that has ~40 million identifiers stored as 32-character strings, and for each identifier hundreds or thousands of rows of numeric data.
Both to save space and to make it more efficient to read the data from disk, it seems preferable to not repeat the identifiers over and over again in the dataset. For example, a data table that looks like
verylongstringidentifier1, 1.2
verylongstringidentifier1, 2.3
verylongstringidentifier1, 3.4
.
.
verylongstringidentifier2, 2.1
verylongstringidentifier2, 1.0
.
.
could be stored more efficiently if the string identifiers weren't repeated. One option is to save separate files for each identifier, and I may go this route, but it's somewhat annoying to have millions of separate small files, and possibly inefficient from a disk I/O standpoint.
I am totally new to hdf5, but what I've read about it suggests it should work well for this situation, since the datasets can be stored with the identifiers as the keys. However, when I save to an hdf5 file the resulting file is roughly 40 times larger than what I would get if I simply wrote to a flat csv file. Am I missing something about how hdf5 files are stored, or am I just doing something wrong? The test code below is what I used to verify (and try to diagnose) the problem.
# trying to figure out why hdf5 file sizes are so huge
import time
import string
import random
import numpy as np
import pandas as pd
from pandas import HDFStore
# generate 1000 random 32-character strings
strings = [''.join(random.choices(string.ascii_lowercase, k=32)) for _ in range(1000)]
# for each of these random strings, create 200 rows of three random floats
# concatenate into one big dataframe
df = pd.DataFrame()
for s in strings:
vars = np.random.rand(200,3)
ss = np.full((200,1),s)
s_data = np.concatenate((ss, vars), axis=1)
df = pd.concat([df, pd.DataFrame(s_data)], axis=0)
df.columns = ['string', 'v1', 'v2', 'v3']
# write to one big csv file
df.to_csv('/tmp/test.csv', index=False)
# write to compressed bzip2 file
df.to_csv('/tmp/test.csv.bz2', index=False, compression='bz2')
# write to separate csv files for each string
unique_strings = df.string.unique()
for s in unique_strings:
s_chunk = df[df.string == s]
fname = '/tmp/test_' + s + '.csv.bz2'
# don't need to store the string, since it can be retrieved as the filename
s_chunk[['v1', 'v2', 'v3']].to_csv(fname, index=False, compression='bz2')
# write to hdf5 file with strings as keys
# what I'm trying to do here is *not* save the strings in the datasets, but instead
# use the strings as the names (keys) for the datasets
# My understanding is this would enable me to retrieve the data for a given string
# with pd.read_hdf(h5data, key=<string for which I want data>)
h5data = HDFStore('/tmp/test.h5')
for s in unique_strings:
s_chunk = df[df.string == s]
# don't need to store the string, because we'll use it as the key
s_chunk[['v1', 'v2', 'v3']].to_hdf(h5data, key=s, format='table', complib='bzip2')
h5data.close()
The resulting file sizes:
18M /tmp/test.csv
4.7M /tmp/test.csv.bz2
80M /tmp/test.h5