2

I have to run multiple simulations of the same model with varying parameters (or random number generator seed). Previously I worked on a server with many cores, where I used python multiprocessing library with apply_async. This was very handy as I could decide the maximum number of cores to occupy and simulations would just go into a queue.

As I understand from other questions, multiprocessing works on pbs clusters as long as you work on just one node, which can be fine for now. However, my code doesn't always work.

To let you understand my kind of code:

import functions_library as L
import multiprocessing as mp
if __name__ == "__main__":

    N = 100

    proc = 50
    pool = mp.Pool(processes = proc)



    seed = 342
    np.random.seed(seed)

    seeds = np.random.randint(low=1,high=100000,size=N)

    resul = []
    for SEED in seeds:

        SEED = int(SEED)

        resul.append(pool.apply_async(L.some_function, args = (some_args)))
        print(SEED)

    results = [p.get() for p in resul]

    database = pd.DataFrame(results)


    database.to_csv("prova.csv")

The function creates 3 N=10000 networkx graphs and perform some computations on them, then returns a simple short python dictionary.

The weird thing I cannot debug is the following error message:

multiprocessing.pool.MaybeEncodingError: Error sending result: >''. >Reason: 'RecursionError('maximum recursion depth exceeded while calling a >Python object')'

What's strange is that I run multiple istances of the code on different nodes. 3 times the code correctly worked, whereas most of the times it returns the previous error. I tried lunching different number of parallel simulation, from 7 to 20 (# cores of the nodes), but there doesn't seem to be a pattern, so I guess it's not a memory issue.

In other questions similar error seems to be related to pickling strange or big objects, but in this case the only thing that comes out of the function is a short dictionary, so it shouldn't be related to that. I also tried increasing the allowed recursion depth with the sys library at the beginning og the work but didn't work up to 15000.

Any idea to solve or at least understand this behavior?

Joel
  • 22,598
  • 6
  • 69
  • 93
tidus95
  • 359
  • 2
  • 14
  • I guess we'll need to see sources of "some_function()" – Samuel Oct 27 '19 at 14:37
  • I basically use networkx to create three different random graphs and compute 4 different centrality measures on them. Some of these might be memory intensive to compute, but it doesn't explain why sometimes it worked and doing fewer parallel computations doesn't improve. Tell me if you need further details – tidus95 Oct 27 '19 at 20:10
  • Can you check if you are using `find_cliques_recursive()` method ? – Samuel Oct 28 '19 at 16:42
  • 1
    Apparently it was related to eigenvector_centrality() not converging. When running outside of multiprocessing it correctly returns a networkx error, whereas inside it onyly this recursion error is returned. Thanks! I wrongly thought it was related to the pbs cluster and debugged it wrongly – tidus95 Oct 29 '19 at 09:51

1 Answers1

1

It was related to eigenvector_centrality() not converging. When running outside of multiprocessing it correctly returns a networkx error, whereas inside it only this recursion error is returned.

I am not aware if this is a weird very function specific behavior or sometimes multiprocessing cannot handle some library errors.

tidus95
  • 359
  • 2
  • 14