18

I need help to make decision. I have a need to transfer some data in my application and have to make a choice between these 3 technologies. I've read about all technologies a little bit (tutorials, documentation) but still can't decide...

How do they compare?

I need support of metadata (capability to receive file and read it without any additional information/files), fast read/write operations, capability to store dynamic data will be a plus (like Python objects)

Things I already know:

  • NumPy is pretty fast but can't store dynamic data (like Python objects). (What about metadata?)
  • HDF5 is very fast, supports custom attributes, is easy to use, but can't store Python objects. Also HDF5 serializes NumPy data natively, so, IMHO, NumPy has no advantages over HDF5
  • Google Protocol Buffers support self-describing too, are pretty fast (but Python support is poor at present time, slow and buggy). CAN store dynamic data. Minuses - self-describing don't work from Python and messages that are >= 1 MB are serializing/deserializing not very fast (read "slow").

PS: data I need to transfer is "result of work" of NumPy/SciPy (arrays, arrays of complicated structs, etc.)

UPD: cross-language access required (C/C++/Python)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
illegal-immigrant
  • 8,089
  • 9
  • 51
  • 84
  • If you're considering HDF5 at all, use PyTables. http://www.pytables.org/moin It basically lets you build classes to easily and quickly store, recreate, and query metadata and numpy arrays to HDF5. As it's just storing things to HDF5, you should be able to easily access things in C/C++ through the usual libraries. – Joe Kington Nov 08 '10 at 17:43
  • Yes, I Know about PyTables, thy are easy to use and cross-language, but they don't allow me to store python objects... – illegal-immigrant Nov 08 '10 at 18:05

2 Answers2

13

There does seem to be a slight contradiction in your question - you want to be able to store Python objects, but you also want C/C++ access. I think that regardless of which choice you go with, you will need to convert your fancy Python data structures into more static structures such as arrays.

If you need cross-language access, I would suggest using HDF5 as it is a file format which is specifically designed to be independent of language, operating system, system architecture (e.g. on loading it can convert between big-endian and little-endian automatically) and is specifically aimed at users doing scientific/numerical computing. I don't know much about Google Protocol Buffers, so I can't really comment too much on that.

If you decide to go with HDF5, I would also recommend that you use h5py instead of pytables. This is because pytables creates HDF5 files with a whole lot of extra pythonic metadata which makes reading the data in C/C++ a bit more of a pain, whereas h5py doesn't create any of these extras. You can find a comparison here, and they also give a link to the pytables FAQ for their view on the matter so you can decide which suits your needs best.

Another format which is very similar to HDF5 is NetCDF. This also has Python bindings, however I have no experience in using this format so I cannot really comment beyond pointing out that it exists and is also widely used in scientific computing.

Shatu
  • 1,819
  • 3
  • 15
  • 27
DaveP
  • 6,952
  • 1
  • 24
  • 37
  • Thanks for your response. I know that probably none of these 3 can fully satisfy my needs, that's why I created question here, the choice is really a little bit hard. P.S Read Google buffers documentation yesterday and find one interesting thing out there : they aren't designed for transferring large amounts of data (>1MB), so think decision is about NumPY and HDF5.. – illegal-immigrant Nov 09 '10 at 08:07
  • 1
    The metadata PyTables adds is very unobtrusive, just some extra attributes for the datasets. You can turn it off by either settings `tables.parameters.PYTABLES_SYS_ATTRS = False` or opening a file with the named argument `PYTABLES_SYS_ATTRS=False`. – AFoglia Nov 09 '10 at 20:48
  • Also, let me add that PyTables is very easy to use, unlike the C/C++ API h5py translates. The benefit of the h5py approach is that it's quicker to learn both APIs if they are similar. – AFoglia Nov 09 '10 at 20:50
  • Thanks for pointing out how to turn off all this extra meta-data in PyTables, I guess I stand corrected on that point. However h5py provides a very high-level pythonic interface to the HDF5 objects, it is just as easy as accessing a numpy array. It is definitely not just a thin wrapper around the C API. – DaveP Nov 09 '10 at 22:29
4

I don't know about HDF5, but you can store Python objects in NumPy arrays, you just lose all the important functionality by disallowing C-level operations to be performed on the array.

In [17]: x = np.zeros(10, dtype=np.object)
In [18]: x[3] = {'pants', 10}
In [19]: x
Out[19]: array([0, 0, 0, set([10, 'pants']), 0, 0, 0, 0, 0, 0], dtype=object)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Autoplectic
  • 7,566
  • 30
  • 30