Calling pd.Series.reindex
is not thread safe (bug report).
My question is why is Series.reindex (which returns a copy and seems like a functionally pure operation) not thread safe, even when no-one is writing to that object's data?
The operation I'm performing is:
s = pd.Series(...)
f(s) # Success!
# Thread 1:
while True: f(s)
# Thread 2:
while True: f(s) # Exception !
... which fails for f(s): s.reindex(..., copy=True)
.
So, why did the threaded call fail? I'm surprised at this, because if there were any thread-not-safe calls, such as populating the Series' index, I would have thought these would done their mutating work already in the main thread.
Pandas does have an open issue that .copy
is not threadsafe. However, the discussion there is around issues of people reading and writing to the object at the same time.
The maintainers marked the .reindex
not-thread-safe issue as a duplicate of the .copy
issue. I'm suspicious that it has the same cause, but if .copy
is the source, then I suspect almost all of pandas is not thread safe in any situation, ever for 'functionally pure' operations.
import traceback
import pandas as pd
import numpy as np
from multiprocessing.pool import ThreadPool
def f(arg):
s,idx = arg
try:
# s.loc[idx].values # No problem
s.reindex(idx) # Fails
except Exception:
traceback.print_exc()
return None
def gen_args(n=10000):
a = np.arange(0, 3000000)
for i in xrange(n):
if i%1000 == 0:
# print "?",i
s = pd.Series(data=a, index=a)
f((s,a)) # <<< LOOK. IT WORKS HERE!!!
yield s, np.arange(0,1000)
# for arg in gen_args():
# f(arg) # Works fine
t = ThreadPool(4)
for result in t.imap(f, gen_args(), chunksize=1):
# print "==>", result
pass