0

The following code shows a problem in the interaction between pytables and threading. I'm creating an HDF file and reading it with 100 concurrent threads:

import threading
import pandas as pd
from pandas.io.pytables import HDFStore, get_store

filename='test.hdf'

with get_store(filename,mode='w') as store:
    store['x'] = pd.DataFrame({'y': range(10000)})

def process(i,filename):
    # print 'start', i                                                                                                                         
    with get_store(filename,mode='r') as store:
        df = store['x']
    # print 'end', i                                                                                                                           
    return df['y'].max

threads = []
for i in range(100):
        t = threading.Thread(target=process, args = (i,filename,))
        t.daemon = True
        t.start()
        threads.append(t)
for t in threads:
    t.join()

The program usually executes cleanly. But now and then I get exceptions like this:

Exception in thread Thread-27:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "crash.py", line 13, in process
    with get_store(filename,mode='r') as store:
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 259, in get_store
    store = HDFStore(path, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 398, in __init__
    self.open(mode=mode, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/pytables.py", line 528, in open
    self._handle = tables.openFile(self._path, self._mode, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
    return obj(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 298, in open_file
    for filehandle in _open_files.get_handlers_by_name(filename):
RuntimeError: Set changed size during iteration

or

   [...]
      File "/usr/local/lib/python2.7/dist-packages/tables/_past.py", line 35, in oldfunc
        return obj(*args, **kwargs)
      File "/usr/local/lib/python2.7/dist-packages/tables/file.py", line 299, in open_file
        omode = filehandle.mode
    AttributeError: 'File' object has no attribute 'mode'

While reducing the code I got very different error messages, some of them indicating memory corruption.

Here are my library versions:

>>> pd.__version__
'0.13.1'
>>> tables.__version__
'3.1.0'

I already have had an error with threads which occured in writing files and I solved it by recompiling hdf5 with options: --enable-threadsafe --with-pthread

Can anyone reproduce the problem? How to solve it?

Scis
  • 2,934
  • 3
  • 23
  • 37
Emanuele Paolini
  • 9,912
  • 3
  • 38
  • 64

3 Answers3

1

Anthony already pointed out that hdf5 (PyTables is basically a wrapper around the hdf5 C library) is not thread-safe. If you want to access an hdf5 file from a web application, you have basically two options:

  1. Use a dedicated process that handles all the hdf5 I/O. Processes/threads of the web application must communicate with this process through, e.g., Unix Domain Sockets. The downside of this approach — obviously — is that it scales very badly. If one web request is accessing the hdf5 file, all other requests must wait.
  2. Implement a read-write locking mechanism that allows concurrent reading, but uses an exclusive lock for writing. Cf. http://en.wikipedia.org/wiki/Readers-writers_problem.

Note that with a mod_wsgi application — depending on the configuration — you have to deal with threads and processes!

I am also currently struggling with using hdf5 as a database backend for a web application. I think the 2nd approach above provides a decent solution. But still, hdf5 is not a database system. If you want a real array database server with a Python interface, have a look at http://www.scidb.org. It is not nearly as light-weight as an hdf5-based solution, though.

weatherfrog
  • 2,970
  • 2
  • 19
  • 17
  • thanks for the suggestions. As I said I have already recompiled the hdf5 library with the --thread-safe option so I think that hdf5 should be thread-safe. What I notice is that I get problems even if I don't access files but only use `numpy` to manage arrays... maybe the threading problem is with it. My actual workaround solution is to put `threads=1` in the configuration of `mod_wsgi`: this seems to solve all problems. I think this means that I'm going multi-process but single-thread. I will give a look to `scidb` as well... – Emanuele Paolini Aug 08 '14 at 16:45
1

One bit that has not been mentioned yet, recompile HDF5 to be thread-safe using:

--enable-threadsafe --with-pthread=DIR

https://support.hdfgroup.org/HDF5/faq/threadsafe.html

I had some hard-to-find bugs in my keras code, which uses HDF5, and this was what solved it.

tsh
  • 2,275
  • 1
  • 14
  • 18
0

PyTables is not fully thread safe. Use multiprocess pools instead.

Anthony Scopatz
  • 3,265
  • 2
  • 15
  • 14
  • Can you elaborate on that? I know almost anything about multiprocessing. My code is run on a web server (apache/mod_wsgi), this is why I need my code to be thread safe. – Emanuele Paolini Aug 05 '14 at 06:08