11

I'm using pandas on a web server (apache + modwsgi + django) and have an hard-to-reproduce bug which now I discovered is caused by pandas not being thread-safe.

After a lot of code reduction I finally found a short standalone program which can be used to reproduce the problem. You can see it below.

The point is: contrary to the answer of this question this example shows that pandas can crash even with very simple operations which do not modify a dataframe. I'm not able to imagine how this simple code snippet could possibly be unsafe with threads...

The question is about using pandas and numpy in a web server. Is it possible? How am I supposed to fix my code using pandas? (an example of lock usage would be helpful)

Here is the code which causes a Segmentation Fault:

import threading
import pandas as pd
import numpy as np

def let_crash(crash=True):
    t = 0.02 * np.arange(100000) # ok con 10000                                                                               
    data = pd.DataFrame({'t': t})
    if crash:
        data['t'] * 1.5  # CRASH
    else:
        data['t'].values * 1.5  # THIS IS OK!

if __name__ == '__main__':
        threads = []
        for i in range(100):
            if True:  # asynchronous                                                                                          
                t = threading.Thread(target=let_crash, args = ())
                t.daemon = True
                t.start()
                threads.append(t)
            else:  # synchronous                                                                                              
                let_crash()
        for t in threads:
            t.join()

My environment: python 2.7.3, numpy 1.8.0, pandas 0.13.1

Community
  • 1
  • 1
Emanuele Paolini
  • 9,912
  • 3
  • 38
  • 64

2 Answers2

5

see caveat in the docs here: http://pandas.pydata.org/pandas-docs/dev/gotchas.html#thread-safety

pandas is not thread safe because the underlying copy mechanism is not. Numpy I believe has an atomic copy operation, but pandas has a layer above this.

Copy is the basis of pandas operations (as most operations generate a new object to return to the user)

It is not trivial to fix this and would come with a pretty heavy perf cost so would need a bit of work to deal with this properly.

Easiest is simply not to share objects across threads or lock them on usage.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • 1
    But no objects are being shared, his DataFrames are local to each thread... This looks terribly similar to [this](https://github.com/numpy/numpy/issues/4642) and the solution there was to be more careful about releasing the GIL, see [here](https://github.com/numpy/numpy/pull/4648). Are you sure you are not releasing the GIL somewhere where a call to a Python API function is needed? – Jaime Sep 11 '14 at 11:20
  • nope - these all involve numpy/Numexpr (no c or cython code involved here) so could be a problem there – Jeff Sep 11 '14 at 12:11
  • Actually, their IS some cython code involved, but in the column access. ``data['t']`` for various types of checking / indexing. Should be thread safe though. – Jeff Sep 11 '14 at 12:23
  • @Jaime when you speak of a "solution" you refer to my code or maybe to the underlying pandas/numpy code? Do you think I should notify this as an issue to the pandas/numpy developers? – Emanuele Paolini Sep 11 '14 at 13:05
  • If Jeff is who I think he is, there now is at least one pandas and one numpy developer aware of this... It may still be worth creating a specific issue in github, see [here](https://github.com/pydata/pandas/issues). What I meant by solution, is that someone was having a similar segfault issue with threading and operations with a specific numpy dtype in numpy 1.8. That was a bug that is now fixed in numpy 1.9. The links in my comment point to the issue and the fix. – Jaime Sep 11 '14 at 14:07
  • @Jeff if the numpy operations involve record arrays with named fields, then it is very likely the same issue I linked above. I can reproduce the crash with 64 bit builds of np 1.8.1 and pd 0.14.1, but not with 32 bit builds of np 1.9 and pd 0.12. That's all I have easy access too. So it may either be a now solved numpy issue, or a recently introduced pandas issue. Or something else altogether... – Jaime Sep 11 '14 at 14:17
  • @Jaime could be. The internals changed quite substantially in 0.13 (so its possible 0.12 was doing something slightly different). – Jeff Sep 11 '14 at 14:18
  • works for me on numpy 1.10 (master) / 2.7 / 64-bit, and on 0.14.1 / 2.7 / numpy 1.8. Of course its a threading issue so not always reproducible :) – Jeff Sep 11 '14 at 14:21
  • It doesn't crash on 2.7.8/numpy 1.8.2/pandas 0.14.1 but does crash on 3.4.1/numpy 1.8.2/pandas 0.14.1 *for me*. It twice gave a traceback: https://gist.github.com/Veedrac/2d85107f6d56b1281998. – Veedrac Sep 11 '14 at 15:16
  • @Veedrac looks like numexpr is crashing, which I don't think is threadsafe (it itself can use multiple threads/processes). – Jeff Sep 11 '14 at 15:18
  • It's thread-safe if you turn the number of threads it uses down to one... I'll try that. **...** Yup. Running `numexpr.set_num_threads(1)` fixes it. – Veedrac Sep 11 '14 at 15:19
  • yes, you can turn it off as well (to see if its creashing on something else), something like ``from pandas.compute import expressions; expressions.set_use_numexpr(False)`` – Jeff Sep 11 '14 at 15:21
  • I spoke too soon. It only fixes it *most* of the time... I'll test your additional idea. I'll also note I don't have `numexpr` on Python 2. Maybe that's the difference. – Veedrac Sep 11 '14 at 15:22
  • Quick note: it's `pandas.computation`. That does stop any problems from occurring for me. – Veedrac Sep 11 '14 at 15:26
  • So far the problem has gone by setting to not use numexpr in a multithread env. – atejeda Apr 05 '16 at 21:59
0

Configure mod_wsgi to run in a single thread mode.

WSGIDaemonProcess mysite processes=5 threads=1
WSGIProcessGroup mysite
WSGIApplicationGroup %{GLOBAL}

In this case it is using mod_wsgi daemon mode so that processes/threads can be set independently on whatever Apache MPM you are using.

Graham Dumpleton
  • 57,726
  • 6
  • 119
  • 134
  • I have already tried this solution, but threads=1 make the server hang when some request take a long time to be served. – Emanuele Paolini Sep 12 '14 at 19:57
  • 1
    The processes provide the concurrency in that case and why you have more than one process. How many processes did you actually specify? What is the average running time for your requests and what is your throughput? You need to know these to be able to properly provision enough capacity. If only specific URLs have this issue with multithreading, then you can vertically partition your application across multiple mod_wsgi daemon process groups and stick the unsafe URLs in single threaded processes. See the following post: http://blog.dscpl.com.au/2014/02/vertically-partitioning-python-web.html – Graham Dumpleton Sep 12 '14 at 20:24
  • This is interesting. I was convinced that threads=1 was the wrong thing to do... I don't know where to look for the number of processes, I have the default configuration of apache. But pstree tells me there are plenty (more than 100). The problem is with file uploads... when a client starts a file upload (which may take minutes) the server becomes irresponsive. I thought that there was a correlation with the postgresql transaction and tried to use manual-commit and things like that, but without success. – Emanuele Paolini Sep 13 '14 at 06:08
  • I thought a possible solution was to use two server processes, with one dedicated to file uploads. But I was afraid that two uploads at the same time would anyway block each other. Eventually I was able to find the place where the multithreading was broken (the example shown in my question) and rewriting that part using numpy instead of pandas has solved for now. But of course the problem can raise again in some other part of my code. – Emanuele Paolini Sep 13 '14 at 06:13