15

I've been using python pandas for the last year and I'm really impressed by its performance and functionalities, however pandas is not a database yet. I've been thinking lately on ways to integrate the analysis power of pandas into a flat HDF5 file database. Unfortunately HDF5 is not designed to deal natively with concurrency.

I've been looking around for inspiration into locking systems, distributed task queues, parallel HDF5, flat file database managers or multiprocessing but I still don't have a clear idea on where to start.

Ultimately, I would like to have a RESTful API to interact with the HDF5 file to create, retrieve, update and delete data. A possible use case for this could be building a time series store where sensors can write data and analytical services can be implemented on top of it.

Any ideas about possible paths to follow, existing similar projects or about the convenience/inconvenience of the whole idea will be very much appreciated.

PD: I know I can use a SQL/NoSQL database to store the data instead but I want to use HDF5 because I haven't seen anything faster when it comes to retrieve large volumes of the data.

prl900
  • 4,029
  • 4
  • 33
  • 40

3 Answers3

13

HDF5 works fine for concurrent read only access.
For concurrent write access you either have to use parallel HDF5 or have a worker process that takes care of writing to an HDF5 store.

There are some efforts to combine HDF5 with a RESTful API from the HDF Group intself. See here and here for more details. I am not sure how mature it is.

I recommend to use a hybrid approach and expose it via a RESTful API.
You can store meta-information in a SQL/NoSQL database and keep the raw data (time series data) in one or multiple HDF5 files.

There is one public REST API to access the data and the user doesn't have to care what happens behind the curtains.
That's also the approach we are taking for storing biological information.

Ümit
  • 17,379
  • 7
  • 55
  • 74
  • Thanks Ümit, it's good to know that the whole idea makes sense and there's other people looking in the same direction. It would be good to know about projects making use of parallel HDF5 in python. – prl900 Mar 20 '14 at 21:22
  • "HDF5 works fine for concurrent read only access." Not with pandas read_hdf. – agemO Jul 27 '18 at 11:59
9

I know the following is not a good answer to the question, but it is perfect for my needs, and I didn't find it implemented somewhere else:

from pandas import HDFStore
import os
import time

class SafeHDFStore(HDFStore):
    def __init__(self, *args, **kwargs):
        probe_interval = kwargs.pop("probe_interval", 1)
        self._lock = "%s.lock" % args[0]
        while True:
            try:
                self._flock = os.open(self._lock, os.O_CREAT |
                                                  os.O_EXCL |
                                                  os.O_WRONLY)
                break
            except FileExistsError:
                time.sleep(probe_interval)

        HDFStore.__init__(self, *args, **kwargs)

    def __exit__(self, *args, **kwargs):
        HDFStore.__exit__(self, *args, **kwargs)
        os.close(self._flock)
        os.remove(self._lock)

I use this as

result = do_long_operations()
with SafeHDFStore('example.hdf') as store:
    # Only put inside this block the code which operates on the store
    store['result'] = result

and different processes/threads working on a same store will simply queue.

Notice that if instead you naively operate on the store from multiple processes, the last closing the store will "win", and what the others "think they have written" will be lost.

(I know I could instead just let one process manage all writes, but this solution avoids the overhead of pickling)

EDIT: "probe_interval" can now be tuned (one second is too much if writes are frequent)

Pietro Battiston
  • 7,930
  • 3
  • 42
  • 45
  • I like the solution but it serializes the writes. I am wondering if there is way of managing HDF5 cluster similar to cassandra style cluster replication, where each node replicates other node's new/updated data. – Saikiran Yerram Mar 17 '15 at 14:50
  • The honest answer is "no idea"... but even assuming there was some sort of intelligent and reasonably efficient replication mechanism, I doubt it would be useful for pandas purposes - or better: any conflict solving procedure should know how pandas stores data. – Pietro Battiston Mar 17 '15 at 19:02
  • 1
    Not cluster but there is way to concurrently access/modify HDF5 using MPI. See [here](http://www.hdfgroup.org/HDF5/Tutor/parallel.html). You could imagine using it across network. – Saikiran Yerram Mar 17 '15 at 20:46
  • I know about that... but I expect conflict solving to be enormously simpler problem for concurrent access (in the same "stupid" way that a filesystem has concurrent access, at least) than for replication. – Pietro Battiston Mar 18 '15 at 22:44
  • @DennisGolomazov happy to know this! – Pietro Battiston Oct 22 '16 at 19:48
4

HDF Group has a REST service for HDF5 out now: http://hdfgroup.org/projects/hdfserver/

John Readey
  • 531
  • 3
  • 6