1

I need to speed up processing time for tasks which involve rather large data sets that are loaded from up to 1.5GB large pickle files with CSV data. I started with python's multiprocessing but for unpickleable class objects I had to switch to pathos. I got some parallel code running and replicated the results obtained from the usual serial runs : so far so good. but processing speed is far from anything useful with the central main node running full throttle while the actual sub-processes (sometimes hundreds in total) run serially and not parallel, separated by huge gaps of time with the main node again running crazy - for what? It 'best' worked with ProcessPool, while apipe and amap were alike without a difference.

below my code excerpt, first with the parallel attempt, second with the serial portion. both give the same result, but the parallel approach is way slower. Importantly each parallel sub-process uses approx the same amount of time as in the serial loop. All variables are preloaded much earlier in a long processing pipeline.

#### PARALLEL multiprocessing
if me.MP:

    import pathos
    cpuN     = pathos.multiprocessing.cpu_count() - 1
    pool     = pathos.pools.ProcessPool( cpuN)  # ThreadPool ParallelPool

    argsIn1  = []  # a mid-large complex dictionary
    argsIn2  = []  # a very large complex dictionary (CSV pickle of 400MB or more)
    argsIn3  = []  # a list of strings
    argsIn4  = []  # a brief string
    for Uscr in UID_lst:
        argsIn1.append( me)
        argsIn2.append( Db)
        argsIn3.append( UID_lst)
        argsIn4.append( Uscr)

    result_pool  = pool.amap( functionX, argsIn1, argsIn2, argsIn3, argsIn4)

    results      = result_pool.get()

    for result in results:
        [ meTU, U]  = result
        me.v[T][U]  = meTU[U]   # insert result !

#### SERIAL processing
else:

    for Uscr in UID_lst:

        meTU, U     = functionX( me, Db, UID_lst, Uscr)
        me.v[T][U]  = meTU[U]   # insert result !


I tested this code on two linux machines, an i3 CPU (with 32GB RAM, slackware 14.2, python 3.7) and on a 2*Xeon box (with 64GB RAM, slackware current, python 3.8). pathos 0.2.6 was installed with pip3. as said, both machines show the same speed problem with the code shown here.

What do I miss here ?

ADDENDUM : it appears that only the very first PID is doing the entire job through all items in UID_lst - while the other 10 sub-processes are idle waiting for nothing, as seen with top and with os.getpid(). cpuN is 11 in this example.

ADDENDUM 2 : sorry about this new revision but running this code under different loads (many more jobs to solve) involved finally more than just one sub-process busy, but just after a long time ! here a top output :

top - 14:09:28 up 19 days,  4:04,  3 users,  load average: 6.75, 6.20, 5.08
Tasks: 243 total,   6 running, 236 sleeping,   0 stopped,   1 zombie
%Cpu(s): 48.8 us,  1.2 sy,  0.0 ni, 49.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64061.6 total,   2873.6 free,  33490.9 used,  27697.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  29752.0 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 5441 userx     20   0 6597672   4.9g  63372 S 100.3   7.9  40:29.08 python3 do_Db_job
 5475 userx     20   0 6252176   4.7g   8828 R 100.0   7.5   9:24.46 python3 do_Db_job
 5473 userx     20   0 6260616   4.7g   8828 R 100.0   7.6  17:02.44 python3 do_Db_job
 5476 userx     20   0 6252432   4.7g   8828 R 100.0   7.5   5:37.52 python3 do_Db_job
 5477 userx     20   0 6252432   4.7g   8812 R 100.0   7.5   1:48.18 python3 do_Db_job
 5474 userx     20   0 6253008   4.7g   8828 R  99.7   7.5  13:13.38 python3 do_Db_job
 1353 userx     20   0    9412   4128   3376 S   0.0   0.0   0:59.63 sshd: userx@pts/0
 1354 userx     20   0    7960   4692   3360 S   0.0   0.0   0:00.20 -bash
 1369 userx     20   0    9780   4212   3040 S   0.0   0.0  31:16.80 sshd: userx@pts/1
 1370 userx     20   0    7940   4632   3324 S   0.0   0.0   0:00.16 -bash
 4545 userx     20   0    5016   3364   2296 R   0.0   0.0   3:01.76 top
 5437 userx     20   0   19920  13280   6544 S   0.0   0.0   0:00.07 python3
 5467 userx     20   0       0      0      0 Z   0.0   0.0   0:00.00 [git] <defunct>
 5468 userx     20   0 3911460   2.5g   9148 S   0.0   4.0  17:48.90 python3 do_Db_job
 5469 userx     20   0 3904568   2.5g   9148 S   0.0   4.0  16:13.19 python3 do_Db_job
 5470 userx     20   0 3905408   2.5g   9148 S   0.0   4.0  16:34.32 python3 do_Db_job
 5471 userx     20   0 3865764   2.4g   9148 S   0.0   3.9  18:35.39 python3 do_Db_job
 5472 userx     20   0 3872140   2.5g   9148 S   0.0   3.9  20:43.44 python3 do_Db_job
 5478 userx     20   0 3844492   2.4g   4252 S   0.0   3.9   0:00.00 python3 do_Db_job
27052 userx     20   0    9412   3784   3052 S   0.0   0.0   0:00.02 sshd: userx@pts/2
27054 userx     20   0    7932   4520   3224 S   0.0   0.0   0:00.01 -bash

it appears to me that max 6 sub-processes will run at any given time which may correspond with the psutil.cpu_count(logical=False) = 6, instead of the pathos.multiprocessing.cpu_count() = 12... ?

pisti
  • 71
  • 10
  • this doesn't answer your question but I would strongly recommend you use consistent naming for variables; usually `my_variable` for most everything, and `CamelCase` for classes. your variables look like javascript names and mix underscores with camel case; a big no-no – 123 Oct 07 '20 at 19:06

1 Answers1

1

Actually, the problem got solved - which it turns out never was one in first place at this development stage of my code. The problem was elsewhere : the variables the worker processes are supplied with are very large, many gigabytes at times. This scenario will keep the main/central node busy for ever with variable dill/undill (like pickle/unpickle) even on the new dual-xeon machine, not to speak about the old i3 CPU box. With the former I saw up to 6 or 7 workers active (out of 11) while the latter never even got to more than 1 active worker, and even for the former machine it took a huge amount of time, dozens of minutes, to see a few workers accumulating in top.

so, I will need to adjust the code so that each worker will have to re-read the huge variables from disk/network - which also takes some time but it make sense to free the central node from this silly repeated task but instead give it a chance to do the job it was designed for, which is scheduling and organizing the workers' show.

I am also happy to say that the resulting output, a CSV file (wc: 36722 1133356 90870757), is identical for the parallel run as compared to the traditional serial version.

Having said this, I am truly surprised how convenient it is to use python/pathos - and not having to change the relevant worker code between the serial and parallel runs !

pisti
  • 71
  • 10
  • I'm glad you got it sorted. – Mike McKerns Oct 08 '20 at 16:12
  • pathos with ProcessPool and amap seems to work very nicely, thank you, Mike ! but here one more question : i compute hundreds or thousands of jobs through a fix set of PIDs (CPU cores minus 1) at full speed but at some point toward the process end an ever increasing number of PIDs turns idle. is there a way to keep the pool busy at full throttle till all jobs are done ? it seems to me that PIDs use pre-allocated sub-pools of jobs, but once finished, a given PID turns idle. if job loads are unequally batched we may not max out our CPU bandwidth till the very end, right ? – pisti Oct 15 '20 at 03:57
  • if you are running a lot of jobs, with a long-lived pool, you might things can slow down. In that case, you might try to clear the pool (with `clear`) which effectively resets the pool to a new instance. – Mike McKerns Oct 15 '20 at 12:15