0

I am working a lot with pytables and HDF5 data and I have a question regarding the attributes of nodes (the attributes you access via pytables 'node._v_attrs' property).

Assume that I set such an attribute of an hdf5 node. I do that over and over again, setting a particular attribute

(1) always to the same value (so overall the value stored in the hdf5file does not change qualitatively)

(2) always with a different value

How are these operations in terms of speed and memory? What I mean is the following, does setting the attribute really imply deletion of the attribute in the hdf5 file and adding a novel attribute with the same name as before? If so, does that mean every time I reset an existing attribute the size of the hdf5 file is slightly increased and keeps slowly growing until my hard disk is full?

If this is true, would it be more beneficial to check before I reset whether I have case (1) [and I should not store at all but compare data to the attribute written on disk] and only reassign if I face case (2) [i.e. the attribute value in the hdf5file is not the one I want to write to the hdf5 file].

Thanks a lot and best regards, Robert

SmCaterpillar
  • 6,683
  • 7
  • 42
  • 70

1 Answers1

3

HDF5 attribute access is notoriously slow. HDF5 is really built for and around the array data structure. Things like groups and attributes are great helpers but they are not optimized.

That said while attribute reading is slow, attribute writing is even slower. Therefore, it is always worth the extra effort to do what you suggest. Check if the attribute exists and if it has the desired value before writing it. This should give you a speed boost as compared to just writing it out every time.

Luckily, the effect on memory of attributes -- both on disk and in memory -- is minimal. This is because ALL attributes on a node fit into 64 kb of special metadata space. If you try to write more than 64 kb worth of attributes, HDF5 and PyTables will fail.

I hope this helps.

Anthony Scopatz
  • 3,265
  • 2
  • 15
  • 14
  • Note that the 64 kb size limit is only the default. There are two ways of storing larger attributes - dense attribute storage and separate datasets. See [the manual](http://www.hdfgroup.org/HDF5/doc/UG/13_Attributes.html) for more information. – Yossarian Sep 09 '13 at 07:24
  • Do you know how I can turn on dense HDF5 attribute storage from within Python and PyTables? – SmCaterpillar Mar 21 '15 at 12:17