1

I'm trying to save data (3D numpy array) to an HDF5 file using fsspec in Python, but I'm encountering issues and I am unable to successfully write the data to the file. The bigger picture is I am trying to amend this dataset class to load/write video data. Is it possible to use h5py and fsspec in simultaneously or is it even advised?

The root issue seems to be this spec. But can I work around it or am I doing something wrong?

My first idea was to save the data as follows:

import numpy as np
import fsspec
import h5py


data = np.random.rand(100, 100)

with fsspec.open("./file", mode="wb") as fs_file:
    with h5py.File(fs_file, mode="w") as h5_file:
        h5_file.create_dataset("video", data=data)

-----------------------------------------------------------------
UnsupportedOperation                      Traceback (most recent call last) Cell In[8], line 4 1 data = np.random.rand(100, 100) 3 with fsspec.open("./file", mode="wb") as fs_file: ----> 4     with h5py.File(fs_file, mode="w") as h5_file: 5         h5_file.create_dataset("video", data=data)

File h5py/_objects.pyx:54, in h5py._objects.with_phil.wrapper()

File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File ~/miniconda3/envs/kedro-environment/lib/python3.10/site-packages/h5py/_hl/files.py:604, in File.exit(self, *args) 601 @with_phil 602 def exit(self, *args): 603     if self.id: --> 604         self.close()

File ~/miniconda3/envs/kedro-environment/lib/python3.10/site-packages/h5py/_hl/files.py:586, in File.close(self) 580 if self.id.valid: 581     # We have to explicitly murder all open objects related to the file 582 583     # Close file-resident objects first, then the files. 584     # Otherwise we get errors in MPI mode. 585     self.id._close_open_objects(h5f.OBJ_LOCAL | ~h5f.OBJ_FILE) ... File h5py/h5fd.pyx:185, in h5py.h5fd.H5FD_fileobj_flush()

File h5py/h5fd.pyx:180, in h5py.h5fd.H5FD_fileobj_truncate()
UnsupportedOperation: truncate

My second idea was to first get the binary representation of the h5 file through an in-memory file but something is wrong and I am unsure if I can get a binary representation via h5py:

import h5py
import numpy as np
import fsspec

dataset = np.random.rand(100, 100)

# Create an in-memory HDF5 file
with h5py.File("in_memory_file.h5", driver="core", backing_store=False, mode='w') as h5file:
    # Create the dataset within the in-memory file
    h5file.create_dataset("video", data=dataset)

    # Save the binary representation to a file
    with fsspec.open("binary_representation.h5", "wb") as file:
        file.write(h5file.id.get_file_image())
    
    h5file.close()
    
# Now try to open the saved binary file
with h5py.File("binary_representation.h5", "r") as h5file:
    dataset = h5file["video"]

    # Perform any desired operations with the dataset

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[1], line 20
     17     h5file.close()
     19 # Now try to open the saved binary file
---> 20 with h5py.File("binary_representation.h5", "r") as h5file:
     21     # Access the dataset
     22     dataset = h5file["video"]
     24     # Perform any desired operations with the dataset

File ~/miniconda3/envs/kedro-environment/lib/python3.10/site-packages/h5py/_hl/files.py:567, in File.__init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, fs_strategy, fs_persist, fs_threshold, fs_page_size, page_buf_size, min_meta_keep, min_raw_keep, locking, alignment_threshold, alignment_interval, meta_block_size, **kwds)
    558     fapl = make_fapl(driver, libver, rdcc_nslots, rdcc_nbytes, rdcc_w0,
    559                      locking, page_buf_size, min_meta_keep, min_raw_keep,
    560                      alignment_threshold=alignment_threshold,
    561                      alignment_interval=alignment_interval,
    562                      meta_block_size=meta_block_size,
    563                      **kwds)
    564     fcpl = make_fcpl(track_order=track_order, fs_strategy=fs_strategy,
    565                      fs_persist=fs_persist, fs_threshold=fs_threshold,
    566                      fs_page_size=fs_page_size)
--> 567     fid = make_fid(name, mode, userblock_size, fapl, fcpl, swmr=swmr)
    569 if isinstance(libver, tuple):
    570     self._libver = libver

File ~/miniconda3/envs/kedro-environment/lib/python3.10/site-packages/h5py/_hl/files.py:231, in make_fid(name, mode, userblock_size, fapl, fcpl, swmr)
...
File h5py/_objects.pyx:55, in h5py._objects.with_phil.wrapper()

File h5py/h5f.pyx:106, in h5py.h5f.open()

OSError: Unable to open file (bad object header version number)

Currently I have it implemented as:

with fsspec.open(save_path, mode='wb') as fs_file:
    h5_file = h5py.File(fs_file, mode="w")
    h5_file.create_dataset("video", data=data)

which works (kind of) but gives me an ignored error every time, which I am very unsure about:

ValueError: truncate of closed file
Exception ignored in: 'h5py._objects.ObjectID.__dealloc__'
Traceback (most recent call last):
  File "h5py/_objects.pyx", line 201, in h5py._objects.ObjectID.__dealloc__
  File "h5py/h5fd.pyx", line 180, in h5py.h5fd.H5FD_fileobj_truncate
ValueError: truncate of closed file

I would appreciate any insights or suggestions on how to resolve this issue and successfully save the data. Is there an alternative approach or additional steps I need to take to ensure the data is written correctly?

Thank you in advance for your help!

twinsten
  • 11
  • 4
  • Error discussed in my answer below. The bigger question: Why do you want to use fsspec with h5py? – kcw78 May 30 '23 at 14:54
  • fsspec abstracts over many storage systems, so while it may not offer much for local files, it allows the same code to be used wherever the file is stored. – mdurant May 30 '23 at 17:34
  • @twinsten , can you clarify "kind of" - did it work? – mdurant May 30 '23 at 17:35
  • @twinsten -- HDF5 is a binary file. If you want a HDF5 file, Pandas has native support to create a HDF5 store (built on PyTables). You have to be careful when using Pandas HDF5 with h5py b/c they may use different HDF5 library versions. – kcw78 May 30 '23 at 21:43
  • @mdurant , with "kind of" I meant that the data is written. It is also written correctly as far as I have tested. But I am unsure if this will always be the case. In my understanding the process of writing the h5 object errors out after the writing of all (essential) data but before the final truncation steps. But I don't know what these final steps include and whether they are crucial. – twinsten May 31 '23 at 08:58
  • @kcw78, unfortunately the docs on pandas HDFStore are quite limited. My use case is 3D (video data). To my understanding I can only save pandas data (Dataframes or Series objects) with the pandas HDFStore, which doesn't fit my use case. Do you think I should consider switching to PyTables? – twinsten May 31 '23 at 09:19
  • Without knowing the complete workflow (from data input to final result) it's hard to say which packages to use. I can't comment about 3d Pandas data and HDF5. I have only used Pandas dataframes. Apparently you need a Pandas "panel" for a 3d array? I prefer Numpy arrays for 3D data. I have no idea if PyTables (or Pandas) will work w/ fsspec. – kcw78 May 31 '23 at 18:21
  • hi @twinsten, https://github.com/kedro-org/kedro-plugins/issues/240 has some discussion about HDF5 support in kedro-datasets, feel free to upvote or share your thoughts! – astrojuanlu Jun 16 '23 at 08:00

2 Answers2

0

First, a caveat. I am very familiar with h5py, but have not used fsspec. I followed your link to the h5py requirements for Python file-like objects. It says a file-like object must have these methods: read() (or readinto()), write(), seek(), tell(), truncate() and flush().

The traceback suggests a problem when calling truncate():

File h5py/h5fd.pyx:180, in h5py.h5fd.H5FD_fileobj_truncate()
UnsupportedOperation: truncate

I ran some tests with fsspec, file objects, and get different results depending on how you create the object. The necessary functions are available when you use Python's context manager.

with fsspec.open("test_fsspec", mode='wb') as fs_file:
     dir(fs_file)

This is the output:

['__abstractmethods__', '__class__', '__del__', '__delattr__', '__dict__', '__dir__', 
'__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattr__', 
'__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', 
'__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__next__', 
'__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', 
'__str__', '__subclasshook__', '_abc_impl', '_checkClosed', '_checkReadable', 
'_checkSeekable', '_checkWritable', '_fetch_range', '_open', 'autocommit', 'blocksize', 
'close', 'closed', 'commit', 'compression', 'discard', 'f', 'fileno', 'flush', 'fs', 
'isatty', 'mode', 'path', 'read', 'readable', 'readline', 'readlines', 'seek', 
'seekable', 'tell', 'truncate', 'writable', 'write', 'writelines']

However, you get different results if you don't use the context manager. Not sure if that's a clue.

fs_file = fsspec.open("test_fsspec", mode='wb')
dir(fs_file)

This is the output:

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', 
'__exit__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', 
'__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', 
'__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', 
'__str__', '__subclasshook__', '__weakref__', 'close', 'compression',
'encoding', 'errors', 'fobjects', 'fs', 'full_name', 'mode', 'newline', 'open', 'path']
kcw78
  • 7,131
  • 3
  • 12
  • 44
  • Ah, but it is within a context! – mdurant May 30 '23 at 17:33
  • Ahhh...you are correct. I didn't test that. I will update my answer to reflect that (which also means there isn't a clear explanation why @twinsten is having problems. – kcw78 May 30 '23 at 21:24
  • Interesting! calling fs_file.truncate() inside the context errors out with "UnsupportedOperation: truncate", while running fs_file.truncate() without the context errors out with "AttributeError: 'OpenFile' object has no attribute 'truncate'". Maybe the context manager adds the truncate method? – twinsten May 31 '23 at 09:40
  • I did some more digging and it looks like the explanation is [here](https://github.com/fsspec/filesystem_spec/blob/d69899db6b409ec0a628501e959293798e575ba0/fsspec/core.py#L30) as it says for the OpenFile class: _"These instances are safe to serialize, as the low-level file object is not created until invoked using ``with``."_. The low level-file wrapper supplies the `truncate()` method. – twinsten May 31 '23 at 10:18
0

Meanfile I figured out how to save the binary representation of h5py.File, which was described here:

import h5py
import numpy as np
import fsspec
import io

# Create a sample dataset
dataset = np.random.rand(100, 100)

bio = io.BytesIO()
with h5py.File(bio, 'w') as f:
    f.create_dataset("video", data=dataset)

data = bio.getvalue() # data is a regular Python bytes object.

with fsspec.open("bytes_io.h5", "wb") as file:
    file.write(data)
    
    
# Now try to open the saved binary file
with h5py.File("bytes_io.h5", "r") as h5file:
    # Access the dataset
    dataset = h5file["video"]
twinsten
  • 11
  • 4