5

Calling pd.Series.reindex is not thread safe (bug report). My question is why is Series.reindex (which returns a copy and seems like a functionally pure operation) not thread safe, even when no-one is writing to that object's data?

The operation I'm performing is:

s = pd.Series(...)
f(s)  # Success!

# Thread 1:
   while True: f(s)  

# Thread 2:
   while True: f(s)  # Exception !

... which fails for f(s): s.reindex(..., copy=True).

So, why did the threaded call fail? I'm surprised at this, because if there were any thread-not-safe calls, such as populating the Series' index, I would have thought these would done their mutating work already in the main thread.

Pandas does have an open issue that .copy is not threadsafe. However, the discussion there is around issues of people reading and writing to the object at the same time.

The maintainers marked the .reindex not-thread-safe issue as a duplicate of the .copy issue. I'm suspicious that it has the same cause, but if .copy is the source, then I suspect almost all of pandas is not thread safe in any situation, ever for 'functionally pure' operations.

import traceback
import pandas as pd
import numpy as np
from multiprocessing.pool import ThreadPool

def f(arg):
    s,idx = arg
    try:
        # s.loc[idx].values   # No problem
        s.reindex(idx) # Fails
    except Exception:
        traceback.print_exc()
    return None


def gen_args(n=10000):
    a = np.arange(0, 3000000)
    for i in xrange(n):
        if i%1000 == 0:
            # print "?",i
            s = pd.Series(data=a, index=a)
            f((s,a)) # <<< LOOK. IT WORKS HERE!!!
        yield s, np.arange(0,1000)

# for arg in gen_args():
#     f(arg)   # Works fine

t = ThreadPool(4)
for result in t.imap(f, gen_args(), chunksize=1):
    # print "==>", result
    pass
wjandrea
  • 28,235
  • 9
  • 60
  • 81
user48956
  • 14,850
  • 19
  • 93
  • 154

0 Answers0