1

I'm attempting to read a variable length string from an HDF5 dataset using the C API. The original C++ code that worked uses v1.8.15.1 of the HDF5 library. I decided to drop to C for debugging, as I have more control.

Unfortunately, I'm in a very tough predicament where I'm locked into using GCC 4.8.5 for my library (due to dependencies), but I'm trying to include this native component in a Python 3.7 Anaconda package. In this environment, I'm being forced to use a version of HDF5 (v1.10.6-hb1b8bf9_0) that was compiled using a modern version of GCC, so I'm hitting the GCC 5 ABI break boundary. I'm dealing with conflicting requirements and I'd rather find a solution to this problem, as this is the only issue between me and success. The uglier / more difficult solution involves custom library builds and lugging around custom-built conda packages - which, I would like to avoid.

Note: I'm only calling functions on the HDF5 library that use primitive C types (i.e., no STL types), so theoretically, I'm thinking this should be possible - all the other APIs seem to work fine. Maybe my assumptions is flawed. Anyway, I'll pose the question to see if anyone can offer some insight before I dig deeper into the HDF5 library.

When viewing the dataset in HDFView, it has the following properties:

DATATYPE  H5T_STRING{
    STRSIZE H5T_VARIABLE;
    STRPAD H5T_STR_NULLTERM;
    CSET H5T_CSET_UTF8;
    CTYPE H5T_C_S1;}

The code I sandboxed to read the dataset is:

#include "hdf5.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char const *argv[])
{
    hid_t hfile, dset, space_id;
    int storage_size;
    herr_t status;
    char* s;
    int i;

    // open file and dataset
    hfile = H5Fopen("input.hdf5", H5F_ACC_RDONLY, H5P_DEFAULT);
    dset = H5Dopen(hfile, "/path/to/my_dataset", H5P_DEFAULT);

    // create memtype
    // Note: I removed the status checks for readability, they are all zero
    memtype = H5Tcopy(H5T_C_S1);
    status = H5Tset_size(memtype, H5T_VARIABLE);
    status = H5Tset_strpad(memtype, H5T_STR_NULLTERM);
    status = H5Tset_cset(memtype, H5T_CSET_UTF8);

    // get the storage size and space_id
    storage_size = H5Dget_storage_size(dset);
    space_id = H5Dget_space(dset);

    // allocate string buffer
    char* s = (char*)malloc(storage_size * sizeof(char));
    memset(s, 0, storage_size);

    // read string from dataset
    status = H5Dread(dset, memtype, space_id, H5S_ALL, H5P_DEFAULT, s);

    // printing the string buffer is problematic because it was not populated / null-terminated properly
    //printf("val: %s", s);

    // convert to integers to see what was returned
    for (i = 0; i < storage_size; i++)
        printf("s[%d]: %d\n", i, s[i]);

    free(s); s = 0;
    status = H5Sclose(space_id);
    status = H5Tclose(memtype);
    status = H5Dclose(dset);
    status = H5Fclose(hfile);

    return 0;
}

The output is:

s[0]: -128
s[1]: 38
s[2]: -24
s[3]: 0
s[4]: 0
s[5]: 0
s[6]: 0
s[7]: 0
s[8]: 0
s[9]: 0
s[10]: 0
s[11]: 0
s[12]: 0
s[13]: 0
s[14]: 0
s[15]: 0

To run the code, please create and activate a conda virtual environment with the HDF5 library installed using these commands:

conda create python=3.7 hdf5 -n hdf5TestVenv
conda activate hdf5TestVenv

As you can see, I'm getting junk filled in the provided char buffer. I've tried many variations on the memory type, size, strpad, and cset - none of them worked (most resulted in nothing being populated). Based on the properties from HDFView, I think I'm setting up the call correctly, but again, maybe my assumption that I can call this library from a GCC 4.8.5 library is flawed.

Thanks in advance for any help!

Here is my Makefile:

PROGRAM = hdfTest

INCS = -I. -I$(CONDA_PREFIX)/include
LIBDIRS = -L $(CONDA_PREFIX)/lib
EXTRALIB = -lpthread -lrt -lz -ldl -lm

LIBS = -lhdf5 $(EXTRALIB)
LDFLAGS = $(LIBDIRS) $(LIBS)

CSOURCES = main.c
COBJECTS = $(CSOURCES:.c=.o)
CFLAGS = -DESRI_UNIX $(INCS)
CC = gcc -fPIC -fsigned-char -m64 -Wall -Wextra -Wno-unused-parameter

all: clean $(PROGRAM)

.PHONY: all debug clean

debug: CC += -DDEBUG -g
debug: clean $(PROGRAM)

$(PROGRAM): $(COBJECTS)
    $(CC) -o $@ $(COBJECTS) $(LDFLAGS)

clean:
    $(RM) -f $(COBJECTS) $(PROGRAM)

Update 8/14/20 @ 14:24:

I tried to read the data as a blob directly out of the dataset with the thought that I could convert it externally in my library / external to HDF5, but I keep getting "datatype conversion" errors. If anyone knows how to brute-force the read to a raw memory buffer, this would be enough to get me over the hurdle so I can move on.

Thanks in advance for any help!

ajr
  • 11
  • 4
  • Please provide a [mcve]. As a new user, also take the [tour] and read [ask]. – Ulrich Eckhardt Aug 13 '20 at 19:39
  • mm.. isn't it meant to be a utf8 string? – Swift - Friday Pie Aug 13 '20 at 19:50
  • @UlrichEckhardt I took the tour and read the "How to Ask" article already. I thought I was following all the guidelines. I also thought I provided a minimal reproducible example, but maybe it was not complete enough, so I updated it to be fully complete. Please let me know if I'm sill missing anything. Thanks for your help! – ajr Aug 14 '20 at 10:48
  • @Swift-FridayPie it is a UTF8 string, I set the cset accordingly: `status = H5Tset_cset(memtype, H5T_CSET_UTF8);` – ajr Aug 14 '20 at 10:48
  • @arj but you're trying to use ansi version of C functions (ones that work only with 7/8 bit character). – Swift - Friday Pie Aug 14 '20 at 12:04
  • Isn't UTF8 8-bit? I tried using the C++ interface (e.g., `H5::DataSet`) and experienced the same problem. Are there other functions you think I should try? – ajr Aug 14 '20 at 12:11
  • regarding the makefile: 1) `INCS = -I. -I$(CONDA_PREFIX)/include` The macro: `CONDO_PREFIX` not defined. 2) the file to be compiled is named: `main.cpp` so will be compiled as a C++ file, not C. 3) the `all` target should be proceeded by: `.PHONY: all debug clean` 4) to avoid repeated evaluations of the macros. should use: `:=` not `=` – user3629249 Aug 14 '20 at 18:26
  • regarding: `int main(int argc, char const *argv[])` This will cause the compiler to output two warning messages about unused parameters. Suggest using the other valid signature for `main()` `int main( void )` – user3629249 Aug 14 '20 at 18:30
  • OT: regarding: `char* s = (char*)malloc(storage_size * sizeof(char));` 1) In C, the returned type is `void*` which can be assigned to any pointer. Casting just clutters the code (and is error prone). Suggest removing that cast. 2) the expression: `sizeof( char)` is defined in the C standard as 1. Multiplying anything by 1 has no effect and just clutters the code. Suggest removing that expression. 3) always check (!=NULL) the returned value to assure the operation was successful. – user3629249 Aug 14 '20 at 18:35
  • regarding the statement pair: `char* s = (char*)malloc(storage_size * sizeof(char)); memset(s, 0, storage_size);` Strongly suggest calling: `calloc()` (note its' parameter list) to replace both those statements. – user3629249 Aug 14 '20 at 18:36
  • why bother to `malloc()` and `free()` when you can use the `Variable Length Array` feature of C and simply say: `char s[ storage_size ];` – user3629249 Aug 14 '20 at 18:39
  • @user3629249 sorry for the confusion, I updated the code to be pure C and updated the Makefile with your suggestions. As you could probably tell, this originated as a C++ application. ```CONDA_PREFIX``` is an environment variable set when activating a conda virtual environment - I added those instructions as well. Your other suggestions will not have an impact on the behavior of the application. Thanks for your feedback! – ajr Aug 14 '20 at 18:49

2 Answers2

0

regarding:

for (int i = 0; i < storage_size; i++)
    printf("s[%d]: %d\n", i, s[i]);

an integer int is either 4 or 8 bytes long (depending on the underlying hardware and certain options to the compile statement.)

So each byte is not a separate integer.

also, this will cause the compiler to output a warning about conversion of a char to 'ptr to int' without a cast.

suggest:

for (int i = 0; i < storage_size; i += sizeof( int ) )
    printf("s[%d]: %d\n", i, &s[i]);

However, that will increment i by the sizeof( int ) for each iteration, which you probably do not want. Therefore, suggest a second index:

for (int i = 0, j = 0; i < storage_size; i += sizeof( int ), j++ )
    printf("s[%d]: %d\n", j, &s[i]);

Note: some integer references are stored Big Endian and some are stored Little Endian, you need to keep that in your considerations.

Suggest HDF5 user guide then 'drill' down to the actual user guide, then read: code section: Code Example 6-30. Set the string datatype size to H5T_VARIABLE and Code Example 6-31. Read variable-length strings into Cstrings

user3629249
  • 16,402
  • 1
  • 16
  • 17
  • ```i``` is an integer that's being used to index into the ```s``` char array sorry, none of what you are suggesting is necessary – ajr Aug 14 '20 at 19:07
  • does this mean that you want to display the numeric value of each byte of the array? – user3629249 Aug 14 '20 at 20:12
  • Please see my code comment, I cast the `char` to an `int` so I can see what the HDF5 library populated. I have already read examples `6-30` and `6-31`, that's basically what I'm doing. – ajr Aug 17 '20 at 10:28
0

It turns out the string was being stored in a 2D array within the dataset. The solution ended up being passing the address of the pointer to the char array that was allocated. (sigh)

// allocate string buffer
char* s = (char*)malloc(storage_size * sizeof(char));
memset(s, 0, storage_size);

// read string from dataset
status = H5Dread(dset, memtype, space_id, H5S_ALL, H5P_DEFAULT, &s);

This became apparent after querying the dimensions with the following APIs: H5Sget_simple_extent_ndims, H5Sget_simple_extent_dims

Thanks for your help, everyone!

ajr
  • 11
  • 4