In PyTables, how to create nested array of variable length?

Question

I'm using PyTables 2.2.1 w/ Python 2.6, and I would like to create a table which contains nested arrays of variable length.

I have searched the PyTables documentation, and the tutorial example (PyTables Tutorial 3.8) shows how to create a nested array of length = 1. But for this example, how would I add a variable number of rows to data 'info2/info3/x' and 'info2/info3/y'?

For perhaps an easier to understand table structure, here's my homegrown example:

"""Desired Pytable output:

DIEM    TEMPUS  Temperature             Data
5       0       100         Category1 <--||-->  Category2
                         x <--| |--> y          z <--|
                        0           0           0
                        2           1           1
                        4           1.33        2.67
                        6           1.5         4.5
                        8           1.6         6.4
5       1       99
                        2           2           0   
                        4           2           2
                        6           2           4
                        8           2           6
5       2       96
                        4           4           0
                        6           3           3
                        8           2.67        5.33


Note that nested arrays have variable length.
"""

import tables as ts

tableDef =      {'DIEM': ts.Int32Col(pos=0),
                'TEMPUS': ts.Int32Col(pos=1), 
                'Temperature' : ts.Float32Col(pos=2),
                'Data': 
                    {'Category1': 
                        {
                        'x': ts.Float32Col(), 
                        'y': ts.Float32Col()
                        }, 
                    'Category2': 
                        {
                        'z': ts.Float32Col(), 
                        }
                    }
                }

# create output file
fpath = 'TestDb.h5'
fh = ts.openFile(fpath, 'w')
# define my table
tableName = 'MyData'
fh.createTable('/', tableName, tableDef)
tablePath = '/'+tableName
table = fh.getNode(tablePath)

# get row iterator
row = table.row
for i in xrange(3):
    print '\ni=', i
    # calc some fake data
    row['DIEM'] = 5
    row['TEMPUS'] = i
    row['Temperature'] = 100-i**2

    for j in xrange(5-i):
        # Note that nested array has variable number of rows
        print 'j=', j,
        # calc some fake nested data
        val1 = 2.0*(i+j)
        val2 = val1/(j+1.0)
        val3 = val1 - val2

        ''' Magic happens here...
        How do I write 'j' rows of data to the elements of 
        Category1 and/or Category2?

        In bastardized pseudo-code, I want to do:

        row['Data/Category1/x'][j] = val1
        row['Data/Category1/y'][j] = val2
        row['Data/Category2/z'][j] = val3
        '''

    row.append()
table.flush()

fh.close()

I have not found any indication in the PyTables docs that such a structure is not possible... but in case such a structure is in fact not possible, what are my alternatives to variable length nested columns?

EArray? VLArray? If so, how to integrate these data types into the above described structure?
some other idea?

Any assistance is greatly appreciated!

EDIT w/ additional info: It appears that the PyTables gurus have already addressed the "is such a structure possible" question:

PyTables Mail Forum - Hierachical Datasets

So has anyone figured out a way to create an analogous PyTable data structure?

Thanks again!

Zinovy Nis · Answer 1 · 2012-03-26T17:15:27.547

I have a similar task: to dump fixed size data with arrays of a variable length.

I first tried using fixed size StringCol(64*1024) fields to store my variable length data (they are always < 64K). But it was rather slow and wasted a lot of disk space, despite blosc compression.

After days of investigation I ended with the following solution:

(spoiler: we store array fields in separate EArray instances, one EArray per one array field)

I store fixed size data in a regular pytables table.

I added 2 additional fields to these tables: arrFieldName_Offset and arrFieldName_Length:

class Particle(IsDescription):
   idnumber  = Int64Col()
   ADCcount  = UInt16Col()
   TDCcount  = UInt8Col()
   grid_i    = Int32Col()
   grid_j    = Int32Col()
   pressure  = Float32Col()
   energy    = FloatCol()
   buffer_Offset = UInt32() # note this field!
   buffer_Length = UInt32() # and this one too!

I also create one EArray instance per each array field:

datatype = StringAtom(1)
buffer = h5file.createEArray('/detector', 'arr', datatype, (0,), "")

Then I add rows corresponding to a fixed size data:

row['idnumber'] = ...
...
row['energy'] = ...
row['buffer_Offset'] = buffer.nrows
# my_buf is a string (I get it from a stream)
row['buffer_Length'] = len(my_buf)
table.append(row)

Ta-dah! Add the buffer into the array.

buffer.append(np.ndarray((len(my_buf),), buffer=my_buf, dtype=datatype))

That's the trick. In my experiments this approach is 2-10x times faster than storing ragged fixed sized arrays (like StringAtom(HUGE_NUMBER)) and the resulting DB is few times smaller (2-5x)

Getting the buffer data is easy. Suppose that row is a single row you read from your DB:

# Open array for reading
buffer = h5file.createEArray('/detector', 'Particle.buffer', datatype, (0,), "")
...
row = ...
...
bufferDataYouNeed = buffer[ row['buffer_Offset'] : row['buffer_Offset'] + row['buffer_Length']]

How would you recommend deleting arrays from the EArray? With this solution it would appear that buffer_Offset and buffer_Length would have to be updated in all subsequent rows. — ToddP, Jan 03 '19 at 00:08
@ToddP seems there's no easy way to delete arrays. I'd recommend not to modify EArray and offsets at all (yes, it's a waste of space) and sometimes, after a number of deletes/inserts, apply defragmentation (in fact just copy to a new EArray). — Zinovy Nis, Jan 04 '19 at 10:15

score 4 · Accepted Answer · answered Jun 23 '11 at 00:50

This is a common thing that folks starting out with PyTables want to do. Certainly, it was the first thing I tried to do. As of 2009, I don't think this functionality was supported. You can look here for one solution "I always recommend":

http://www.mail-archive.com/pytables-users@lists.sourceforge.net/msg01207.html

In short, just put each VLArray in a separate place. If you do that, maybe you don't end up needing VLArrays. If you store separate VLArrays for each trial (or whatever), you can keep metadata on those VLArrays (guaranteed to stay in sync with the array across renames, moves, etc.) or put it in a table (easier to search).

But you may also do well to pick whatever a single time-point would be for your column atom, then simply add another column for a time stamp. This would allow for a "ragged" array that still has a regular, repeated (tabular) structure in memory. For example:

Trial Data
1     0.4, 0.5, 0.45
2     0.3, 0.4, 0.45, 0.56

becomes

Trial Timepoint Data
1     1         0.4
1     2         0.5
...
2     4         0.56

Data above is a single number, but it could be, e.g. a 4x5x3 atom.

If nested VLArrays are supported in PyTables now, I'd certainly love to know!

Alternatively, I think h5py does support the full HDF5 feature-set, so if you're really committed to the nested data layout, you may have more luck there. You'll be losing out on a lot of nice features though! And in my experience, naive neuroscientists end up with quite poor performance since they don't get pytables intelligent choices for data layout, chunking, etc. Please report back if you go that route!

Thanks for the suggestions! Additionally, the mail-list link has several other interesting 'nuggets' of wisdom from Francesc. In the end, because I was concerned with speed and maintaining simplicity, I opted for fixed array size with padded extra space. — plmcw, Jul 05 '11 at 16:35

score 0 · Answer 3 · answered Mar 16 '12 at 15:39

I also ran into this and I ended using a fixed array size. The arrays I was trying to store were of variable len so I created new ones from the with the correct fixed length

I did something along the lines of

def filled_list(src_list, targ_len):
    """takes a varible len() list and creates a new one with a fixed len()"""
    for i in range(targ_len):
        try:
            yield src_list[i]
        except IndexError:
            yield 0

src_list = [1,2,3,4,5,6,7,8,9,10,11]
new_list = [x for x in filled_list(src_list, 100)]

That did the trick for me.

In PyTables, how to create nested array of variable length?

3 Answers3

Linked