I'm attempting to read a variable length string from an HDF5 dataset using the C API. The original C++ code that worked uses v1.8.15.1
of the HDF5 library. I decided to drop to C for debugging, as I have more control.
Unfortunately, I'm in a very tough predicament where I'm locked into using GCC 4.8.5 for my library (due to dependencies), but I'm trying to include this native component in a Python 3.7 Anaconda package. In this environment, I'm being forced to use a version of HDF5 (v1.10.6-hb1b8bf9_0
) that was compiled using a modern version of GCC, so I'm hitting the GCC 5 ABI break boundary.
I'm dealing with conflicting requirements and I'd rather find a solution to this problem, as this is the only issue between me and success. The uglier / more difficult solution involves custom library builds and lugging around custom-built conda packages - which, I would like to avoid.
Note: I'm only calling functions on the HDF5 library that use primitive C types (i.e., no STL types), so theoretically, I'm thinking this should be possible - all the other APIs seem to work fine. Maybe my assumptions is flawed. Anyway, I'll pose the question to see if anyone can offer some insight before I dig deeper into the HDF5 library.
When viewing the dataset in HDFView, it has the following properties:
DATATYPE H5T_STRING{
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;}
The code I sandboxed to read the dataset is:
#include "hdf5.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(int argc, char const *argv[])
{
hid_t hfile, dset, space_id;
int storage_size;
herr_t status;
char* s;
int i;
// open file and dataset
hfile = H5Fopen("input.hdf5", H5F_ACC_RDONLY, H5P_DEFAULT);
dset = H5Dopen(hfile, "/path/to/my_dataset", H5P_DEFAULT);
// create memtype
// Note: I removed the status checks for readability, they are all zero
memtype = H5Tcopy(H5T_C_S1);
status = H5Tset_size(memtype, H5T_VARIABLE);
status = H5Tset_strpad(memtype, H5T_STR_NULLTERM);
status = H5Tset_cset(memtype, H5T_CSET_UTF8);
// get the storage size and space_id
storage_size = H5Dget_storage_size(dset);
space_id = H5Dget_space(dset);
// allocate string buffer
char* s = (char*)malloc(storage_size * sizeof(char));
memset(s, 0, storage_size);
// read string from dataset
status = H5Dread(dset, memtype, space_id, H5S_ALL, H5P_DEFAULT, s);
// printing the string buffer is problematic because it was not populated / null-terminated properly
//printf("val: %s", s);
// convert to integers to see what was returned
for (i = 0; i < storage_size; i++)
printf("s[%d]: %d\n", i, s[i]);
free(s); s = 0;
status = H5Sclose(space_id);
status = H5Tclose(memtype);
status = H5Dclose(dset);
status = H5Fclose(hfile);
return 0;
}
The output is:
s[0]: -128
s[1]: 38
s[2]: -24
s[3]: 0
s[4]: 0
s[5]: 0
s[6]: 0
s[7]: 0
s[8]: 0
s[9]: 0
s[10]: 0
s[11]: 0
s[12]: 0
s[13]: 0
s[14]: 0
s[15]: 0
To run the code, please create and activate a conda virtual environment with the HDF5 library installed using these commands:
conda create python=3.7 hdf5 -n hdf5TestVenv
conda activate hdf5TestVenv
As you can see, I'm getting junk filled in the provided char
buffer. I've tried many variations on the memory type, size, strpad, and cset - none of them worked (most resulted in nothing being populated). Based on the properties from HDFView, I think I'm setting up the call correctly, but again, maybe my assumption that I can call this library from a GCC 4.8.5 library is flawed.
Thanks in advance for any help!
Here is my Makefile:
PROGRAM = hdfTest
INCS = -I. -I$(CONDA_PREFIX)/include
LIBDIRS = -L $(CONDA_PREFIX)/lib
EXTRALIB = -lpthread -lrt -lz -ldl -lm
LIBS = -lhdf5 $(EXTRALIB)
LDFLAGS = $(LIBDIRS) $(LIBS)
CSOURCES = main.c
COBJECTS = $(CSOURCES:.c=.o)
CFLAGS = -DESRI_UNIX $(INCS)
CC = gcc -fPIC -fsigned-char -m64 -Wall -Wextra -Wno-unused-parameter
all: clean $(PROGRAM)
.PHONY: all debug clean
debug: CC += -DDEBUG -g
debug: clean $(PROGRAM)
$(PROGRAM): $(COBJECTS)
$(CC) -o $@ $(COBJECTS) $(LDFLAGS)
clean:
$(RM) -f $(COBJECTS) $(PROGRAM)
Update 8/14/20 @ 14:24:
I tried to read the data as a blob directly out of the dataset with the thought that I could convert it externally in my library / external to HDF5, but I keep getting "datatype conversion" errors. If anyone knows how to brute-force the read to a raw memory buffer, this would be enough to get me over the hurdle so I can move on.
Thanks in advance for any help!