I wrote some program in python, which runs steps similar to, but more complicate than following steps:
STEP 1: Given a BATCH of lists with same length, each element of one list represents the number of states it may have, I need to DFS all the possible states(represented by 0,1,2...) of one list and get them in one list. e.g. input [[1,2,1], [2,2,2]]
, the output of this step should be [[0,0,0],[0,1,0],[0,0,0],[0,0,1],[0,1,0],[0,1,1],[1,0,0],[1,0,1],[1,1,0],[1,1,1]]
STEP 2: Calculate some value related to the output of STEP 1, and return a dict with the form:
{"0,0,0": 0.1,
"0,1,0":0.2,
"0,0,1":0.56,
"0,1,0":0.68,
"0,1,1":0.3242,
"1,0,0":0.8987,
"1,0,1":0.214,
"1,1,0":0.2,
"1,1,1":0.9}
STEP 3: In this step I need to process a BATCH of lists. While processing, I need to lookup(just read operation) the dict returned by STEP 2 very frequently, and generate one tuple for each list in the batch.
I found my program really slow in STEP 1 and STEP 3. STEP 2 can only be down in python by some reason. What's more, lists in batch are independent from each other, they only share the same dict in STEP 3. So I want to use multi threading to process these lists in parallel.
Since python has GIL, threading
module doesn't work.
Then I tried multiprocessing
, even slower(I guess it is because of context switching and data transferring).
Then I used c++11 to write a .so module containing functions receiving PyObject
and returning PyObject
. I used POXIS threads, but it always raised SegmentFault
error as long as I use more than one thread. I carefully read the document of python-C-API, and found GIL is still needed,
refer here. So this doesn't help at all.
Then I used Cython
by declaring types of all variables by cdef
, it did accelerate but not that much.
I'm losing myself in this problem. Can anyone help me? I'll be really really grateful.