python h5py: can I store a dataset which different columns have different types?

Question

Suppose I have a table which has many columns, only a few columns is float type, others are small integers, for example:

col1, col2, col3, col4
1.31   1      2     3
2.33   3      5     4
...

How can I store this effectively, suppose I use np.float32 for this dataset, the storage is wasted, because other columns only have a small integer, they don't need so much space. If I use np.int16, the float column is not exact, which also what I wanted. Therefore how do I deal with the situation like this?

Suppose I also have a string column, which make me more confused, how should I store the data?

col1, col2, col3, col4, col5
1.31   1      2     3    "a"
2.33   3      5     4    "b"
...

Edit:

To make things simpler, lets suppose the string column has fix length strings only, for example, length of 3.

I think you can take a look at https://stackoverflow.com/questions/11309739/store-different-datatypes-in-one-numpy-array — Agile_Eagle, Aug 07 '18 at 15:01
You can also use pandas datafram: http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe — Agile_Eagle, Aug 07 '18 at 15:03
As @Agile_Eagle noted, `pandas.DataFrame` is likely your best bet. — Rushabh Mehta, Aug 07 '18 at 15:13
@RushabhMehta However I want to store the data into the hdf5 file, which support compression etc. BTW, how do I assign the type of dataframe column by column, the default one for numeric is 64bit float, which is too large. — an offer can't refuse, Aug 07 '18 at 15:16
See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html) for converting types, and I don't know why compressing is an issue. — Rushabh Mehta, Aug 07 '18 at 15:17
@RushabhMehta I want to take advantage of the compression of hdf5, which make my file smaller. — an offer can't refuse, Aug 07 '18 at 15:21
You can save a structured array with `h5py`: https://stackoverflow.com/questions/44049838/h5py-write-object-dynamically-to-file. But, you might find it easier to save you data in different datasets, one for the floats, another for the ints, etc. — hpaulj, Aug 07 '18 at 17:08
When you say you have a `table`, what exactly do you mean? Python doesn't have a `table` data structure. It has lists and dictionaries. `numpy` doesn't either. It has arrays. `pandas` is more table like, and has its own HDF5 saving methods. — hpaulj, Aug 07 '18 at 17:10

hpaulj · Accepted Answer · 2018-08-07T18:43:18.443

I'm going to demonstrate the structured array approach:

I'm guessing you are starting with a csv file 'table'. If not it's still the easiest way to turn your sample into an array:

In [40]: txt = '''col1, col2, col3, col4, col5
    ...: 1.31   1      2     3    "a"
    ...: 2.33   3      5     4    "b"
    ...: '''


In [42]: data = np.genfromtxt(txt.splitlines(), names=True, dtype=None, encoding=None)

In [43]: data
Out[43]: 
array([(1.31, 1, 2, 3, '"a"'), (2.33, 3, 5, 4, '"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])

With these parameters, genfromtxt takes care of creating a structured array. Note it is a 1d array with 5 fields. Fields dtype are determined from the data.

In [44]: import h5py
...

In [46]: f = h5py.File('struct.h5', 'w')

In [48]: ds = f.create_dataset('data',data=data)
...
TypeError: No conversion path for dtype: dtype('<U3')

But h5py has problems saving the unicode strings (default for py3). There may be ways around that, but here it will be simpler to convert the string dtype to bytestrings. Besides, that'll be more compact.

To convert that, I'll make a new dtype, and use astype. Alternatively I could specify the dtypes in the genfromtxt call.

In [49]: data.dtype
Out[49]: dtype([('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])

In [50]: data.dtype.descr
Out[50]: 
[('col1', '<f8'),
 ('col2', '<i8'),
 ('col3', '<i8'),
 ('col4', '<i8'),
 ('col5', '<U3')]

In [51]: dt1 = data.dtype.descr

In [52]: dt1[-1] = ('col5', 'S3')

In [53]: data.astype(dt1)
Out[53]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

Now it saves the array without problem:

In [54]: data1 = data.astype(dt1)

In [55]: data1
Out[55]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

In [56]: ds = f.create_dataset('data',data=data1)

In [57]: ds
Out[57]: <HDF5 dataset "data": shape (2,), type "|V35">

In [58]: ds[:]
Out[58]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

I could make further modifications, shortening one or more of the int fields:

In [60]: dt1[1] = ('col2','i2')    
In [61]: dt1[2] = ('col3','i2')

In [62]: dt1
Out[62]: 
[('col1', '<f8'),
 ('col2', 'i2'),
 ('col3', 'i2'),
 ('col4', '<i8'),
 ('col5', 'S3')]

In [63]: data1 = data.astype(dt1)

In [64]: data1
Out[64]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

In [65]: ds1 = f.create_dataset('data1',data=data1)

ds1 has a more compact storage, 'V23' vs 'V35'

In [67]: ds1
Out[67]: <HDF5 dataset "data1": shape (2,), type "|V23">

In [68]: ds1[:]
Out[68]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

I have tried using np.records to store into hdf5, it's bigger than directly store using pandas.to_hdf, even the types of each columns are same. any explanation? — an offer can't refuse, Aug 08 '18 at 03:25
Do you mean the file size? Pandas stores a more complex structure, but it may use some degree of compression. HDF5 is a complex file structure, so it is hard match your array bytes size with overall file size. — hpaulj, Aug 08 '18 at 03:45
After I test with larger size file, the np.records is much more space saving than pd.Dataframe. This is a great answer, helps me a lot, thanks! — an offer can't refuse, Aug 08 '18 at 16:01
I got a question very related to this question, please see if you know the answer, thank you. https://stackoverflow.com/q/51789821/2794152 — an offer can't refuse, Aug 10 '18 at 15:35

python h5py: can I store a dataset which different columns have different types?

Edit:

1 Answers1

Linked