-2

I have a dataset with 100,000 entries, each of the form:

{
attr1 float[300]
attr2 float[300]
attr3 float[300]
attr4 float
attr5 float
attr6 float
}

What is the most efficient way to store this in an .hdf5 file?

kcw78
  • 7,131
  • 3
  • 12
  • 44
user7867665
  • 852
  • 7
  • 25
  • 1
    You have 2 (at least) choices for writing/reading HDF5 files with Python:PyTables (aka tables), or h5py. Your example is somewhat trivial, and either will get the job done.The bigger question, is what do you plan to do with the data. This affects how you should organize the data. Do you want to keep all the data in a single dataset? With PyTables, you could create a single table with 100,000 rows. Each row would have 6 columns/fields, each with a different object type. – kcw78 Nov 02 '19 at 23:01
  • yes a single dataset with 100,000 rows would be ideal. – user7867665 Nov 05 '19 at 10:12

1 Answers1

1

Without your data (and the data structure) or a code example, it's hard to provide an example specific to your problem. I created a PyTables example that shows the basic operation. There are a lot of ways to define the table structure and input the data. I like to create a np.dtype and reference with description=. In this example, I create and add the data row-by-row using a list with one tuple. However, if you have all the data, you can create an NumPy structured array and reference with the obj= parameter. This will create the array and populate all in one shot

Here is PyTables example with 100 rows and attr1/2/3 arrays sized to 10 elements. It shows the logic. You can modify to increase the number of rows and array elements.

All of the PyTables table methods are explained here:
PyTables table methods

import tables as tb
import numpy as np

attr1  = np.arange(10.)
attr2  = 2.0*np.arange(10.)
attr3  = 3.0*np.arange(10.)
attr4  = 4.0
attr5  = 5.0
attr6  = 6.0

ds_dt = np.dtype({'names':['attr1', 'attr2', 'attr3',
                           'attr4', 'attr5', 'attr6'],
                  'formats':[(float,10), (float,10), (float,10),
                              float, float, float ] }) 

with tb.File('SO_58674120_tb.h5','w') as h5f:

     tb1 = h5f.create_table('/','my_ds', description=ds_dt)
     for rcnt in range(1,100):
         data_list = [ (rcnt*attr1, rcnt*attr2, rcnt*attr3,
                        rcnt*attr4, rcnt*attr5, rcnt*attr6), ]
         tb1.append(data_list)

You can do the same with h5py. The process is similar, but there are differences. For example, you have to size the dataset with shape=, and add maxshape= if you want to extend the dataset in the future. Also, I only know how to add data by referencing numpy arrays (not lists like PyTables). So I created recarr to hold the intermediate data. Again, if you have all your data, you don't have to load it row by row.

See code below:

import h5py
import numpy as np

attr1  = np.arange(10.)
attr2  = 2.0*np.arange(10.)
attr3  = 3.0*np.arange(10.)
attr4  = 4.0
attr5  = 5.0
attr6  = 6.0

ds_dt = np.dtype({'names':['attr1', 'attr2', 'attr3',
                           'attr4', 'attr5', 'attr6'],
                  'formats':[(float,10), (float,10), (float,10),
                              float, float, float ] }) 
recarr = np.empty((1,), dtype=ds_dt)

with h5py.File('SO_58674120_h5.h5','w') as h5f:

     h5f.create_dataset('my_ds', dtype=ds_dt, shape=(100,), maxshape=(None) )
     for rcnt in range(1,100):
         recarr['attr1']= rcnt*attr1
         recarr['attr2']= rcnt*attr2
         recarr['attr3']= rcnt*attr3
         recarr['attr4']= rcnt*attr4
         recarr['attr5']= rcnt*attr5
         recarr['attr6']= rcnt*attr6
         h5f['my_ds'][rcnt] = recarr[0]
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Thanks a lot for the solution. So the trick is to use `np.dtype()`, very handy. I'll compare both methods and then hopefully mark as correct answer – user7867665 Nov 05 '19 at 10:14
  • I used `np.dtype()` when I created the table because I added the data after I created it. If you have all of your data in 1 record array, you can reference it as `data=` with h5py, or as `obj=` with pytables. The dtype will be inferred from the record array. – kcw78 Nov 05 '19 at 18:36