pytables and pandas string padding question

Question

I've created a dataset using hdf5cpp library with a fixed size string (requirement). However when loading with pytables or pandas the strings are always represented like:

b'test\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff

The string value of 'test' with the padding after it. Does anyone know a way to suppress or not show this padding data? I really just want 'test' shown. I realise this may be correct behaviour.

My hdf5cpp setup for strings:

strType = H5Tcopy(H5T_C_S1);
status = H5Tset_size(strType, 36);
H5Tset_strpad(strType, H5T_STR_NULLTERM);

kcw78 · Answer 1 · 2020-08-14T16:41:20.963

I can't help with your C Code. It is possible to work with padded strings in Pytables. I can read data written by a C application that creates a struct array of mixed types, including padded strings. (Note: there was an issue related to copying a NumPy struct array with padding. It was fixed in 3.5.0. Read this for details: PyTables GitHub Pull 720.)

Here is an example that shows proper string handling with a file created by PyTables. Maybe it will help you investigate your problem. Checking the dataset's properties would be a good start.

import tables as tb
import numpy as np

arr = np.empty((10), 'S10')
arr[0]='test'
arr[1]='one'
arr[2]='two'
arr[3]='three'

with tb.File('SO_63184571.h5','w') as h5f:
    ds = h5f.create_array('/', 'testdata', obj=arr)
    print (ds.atom)
    
    for i in range(4):
        print (ds[i])
        print (ds[i].decode('utf-8'))

Example below added to demonstrate compound dataset with int and fixed string. This is called a Table in PyTables (Arrays always contain homogeneous values). This can be done a number of ways. I show the 2 methods I prefer:

Create a record array and reference with the description= or obj= parameter. This is useful when already have all of your data AND it will fit in memory.
Create a record array dtype and reference with the description= parameter. Then add the data with the .append() method. This is useful when all of your data will NOT fit in memory, OR you need to add data to an existing table.

Code below:

recarr_dtype = np.dtype( 
                { 'names':   ['ints', 'strs' ], 
                  'formats': [int, 'S10'] } )
a = np.arange(5)
b = np.array(['a', 'b', 'c', 'd', 'e']) 
recarr = np.rec.fromarrays((a, b), dtype=recarr_dtype) 

with tb.File('SO_63184571.h5','w') as h5f:
    ds1 = h5f.create_table('/', 'compound_data1', description=recarr)
    
    for i in range(5):
        print (ds1[i]['ints'], ds1[i]['strs'].decode('utf-8'))

    ds2 = h5f.create_table('/', 'compound_data2', description=recarr_dtype)
    ds2.append(recarr)
    
    for i in range(5):
        print (ds2[i]['ints'], ds2[i]['strs'].decode('utf-8'))

Thanks, would it be possible to post a compounded dataset containing and int and a fixed string? I notice pytables adds a lot of structural additions.. — Moet, Aug 14 '20 at 10:56

pytables and pandas string padding question

1 Answers1