1

I am having an issue with CPU affinity and linear integer programming in MOSEK. My program parallelizes using the multiprocessing module in Python, thus MOSEK is running concurrently on each process. The machine has 48 cores so I run 48 concurrent processes using the Pool class. Their documentation states that the API is thread safe.

After starting the program, below is the output from top. It shows that ~50% of the CPU is idle. Shown is only the first 20 lines of the top output.

top - 22:04:42 up 5 days, 14:38,  3 users,  load average: 10.67, 13.65, 6.29
Tasks: 613 total,  47 running, 566 sleeping,   0 stopped,   0 zombie
%Cpu(s): 46.3 us,  3.8 sy,  0.0 ni, 49.2 id,  0.7 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem:   503.863 total,  101.613 used,  402.250 free,    0.482 buffers
GiB Swap:   61.035 total,    0.000 used,   61.035 free.   96.250 cached Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
115517 njmeyer   20   0  171752  27912  11632 R  98.7  0.0   0:02.52 python
115522 njmeyer   20   0  171088  27472  11632 R  98.7  0.0   0:02.79 python
115547 njmeyer   20   0  171140  27460  11568 R  98.7  0.0   0:01.82 python
115550 njmeyer   20   0  171784  27880  11568 R  98.7  0.0   0:01.64 python
115540 njmeyer   20   0  171136  27456  11568 R  92.5  0.0   0:01.91 python
115551 njmeyer   20   0  371636  31100  11632 R  92.5  0.0   0:02.93 python
115539 njmeyer   20   0  171132  27452  11568 R  80.2  0.0   0:01.97 python
115515 njmeyer   20   0  171748  27908  11632 R  74.0  0.0   0:03.02 python
115538 njmeyer   20   0  171128  27512  11632 R  74.0  0.0   0:02.51 python
115558 njmeyer   20   0  171144  27528  11632 R  74.0  0.0   0:02.28 python
115554 njmeyer   20   0  527980  28728  11632 R  67.8  0.0   0:02.15 python
115524 njmeyer   20   0  527956  28676  11632 R  61.7  0.0   0:02.34 python
115526 njmeyer   20   0  527956  28704  11632 R  61.7  0.0   0:02.80 python

I checked the MOSEK parameters section of the documentation and I didn't see anything related to CPU affinity. They have some flags related to multithreading within the optimizer. These flags are set to off as default, and when redundantly setting it to off there is no change.

I checked the cpu affinity of the running python jobs and many of them are bound to the same cpu. But, the weird part is I can't set the cpu affinity, or at least it appears to be changed again soon after I change it.

I picked one of the jobs and set the cpu affinity by running taskset -p 0xFFFFFFFFFFFF 115526. I do this 10 times with 1 second in between. Here is the cpu affinity mask after each taskset call.

pid 115526's current affinity mask: 10
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 7
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 200000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 47
pid 115526's current affinity mask: ffffffffffff
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47
pid 115526's current affinity mask: 800000000000
pid 115526's new affinity mask: ffffffffffff
pid 115526's current affinity list: 0-47

It seems like something is continually changing the cpu affinity during run time.

I have also tried setting the cpu affinity of the parent process, but it has the same effect.

Here is the code I am running.

import mosek
import sys
import cPickle as pickle
import multiprocessing
import time

def mosekOptim(aCols,aVals,b,c,nCon,nVar,numTrt):
    """Solve the linear integer program.


    Solve the program
    max c' x
    s.t. Ax <= b

    """

    ## setup mosek
    with mosek.Env() as env, env.Task() as task:
        task.appendcons(nCon)
        task.appendvars(nVar)
        inf = float("inf")


        ## c
        for j,cj in enumerate(c):
            task.putcj(j,cj)


        ## bounds on A
        bkc = [mosek.boundkey.fx] + [mosek.boundkey.up
                                     for i in range(nCon-1)]

        blc = [float(numTrt)] + [-inf for i in range(nCon-1)]
        buc = b


        ## bounds on x
        bkx = [mosek.boundkey.ra for i in range(nVar)]
        blx = [0.0]*nVar
        bux = [1.0]*nVar

        for j,a in enumerate(zip(aCols,aVals)):
            task.putarow(j,a[0],a[1])

        for j,bc in enumerate(zip(bkc,blc,buc)):
            task.putconbound(j,bc[0],bc[1],bc[2])

        for j,bx in enumerate(zip(bkx,blx,bux)):
            task.putvarbound(j,bx[0],bx[1],bx[2])

        task.putobjsense(mosek.objsense.maximize)

        ## integer type
        task.putvartypelist(range(nVar),
                            [mosek.variabletype.type_int
                             for i in range(nVar)])

        task.optimize()

        task.solutionsummary(mosek.streamtype.msg)

        prosta = task.getprosta(mosek.soltype.itg)
        solsta = task.getsolsta(mosek.soltype.itg)

        xx = mosek.array.zeros(nVar,float)
        task.getxx(mosek.soltype.itg,xx)

    if solsta not in [ mosek.solsta.integer_optimal,
                   mosek.solsta.near_integer_optimal ]:
        print "".join(mosekMsg)
        raise ValueError("Non optimal or infeasible.")
    else:
        return xx


def reps(secs,*args):
    start = time.time()
    while time.time() - start < secs:
        for i in range(100):
            mosekOptim(*args)


def main():
    with open("data.txt","r") as f:
        data = pickle.loads(f.read())

    args = (60,) + data

    pool = multiprocessing.Pool()
    jobs = []
    for i in range(multiprocessing.cpu_count()):
        jobs.append(pool.apply_async(reps,args=args))
    pool.close()
    pool.join()

if __name__ == "__main__":
    main()

The code unpickles data I precomputed. These objects are the contsraints and coefficients for the linear program. I have the code and this data file hosted in this repository.

Has anyone else experience this behavior with MOSEK? Any suggestions for how to proceed?

nick
  • 319
  • 1
  • 4
  • 18
  • @ali_m I'm running it from the shell. The jobs are currently running and I've just been trying to tweak it as its running. I found the parent process ids using `ps -o ppid=103841` and then ran the same `taskset` command on those, but the result is the same as before. The cpu usage actually starts to rise if you watch `top`, but then quickly falls back to ~50%. So it looks like setting the parent cpu affinity works, but only for a couple seconds. – nick Feb 29 '16 at 18:56
  • @ali_m I added the code snippet that starts the child processes and waits for them to finish. I also included the output of `pstree -s 103841` in the question. Here it is as well `init───screen───bash───python───python───2*[{python}]` – nick Feb 29 '16 at 21:25
  • @ali_m no worries about sounding repetitive. I'm happy to make sure. I posted an image of the htop output showing the tree. When hitting "A" on PID `103785` it shows all cpus selected. When hitting "A" on PID `103841` only 1 cpu is selected. I tried running `taskset -p 0xffffffffffff 103785` and again for `103841`. I checked htop after running both and the affinity for `103841` showed all selected, but after checking back in a few seconds it went down to a single cpu. – nick Feb 29 '16 at 21:46
  • @ali_m If I highlight all of them, then the cpu affinity in htop shows all cpus. But when isolating some of the children they are bound to a single cpu. – nick Feb 29 '16 at 22:08
  • Hmm, I've never come across that sort of behavior before. Maybe it is MKL-specific - it might be interesting to try to reproduce this using a version of numpy that isn't linked against MKL. Could you turn this into an [MCVE](http://stackoverflow.com/help/mcve) (you might be able to replace `wrapper` etc. with some trivial functions in order to illustrate the problem)? In the meantime I will tidy up some of my comments where you've added the details to your question already. – ali_m Feb 29 '16 at 22:16
  • @ali_m I have a MCVE. Though it doesn't appear to be a numpy issue. (I had assumed it was since so many people have had issues with it). It ended up being a [Mosek](http://www.mosek.com) issue. I have the code hosted as a [GitHub repo](https://github.com/nickjmeyer/cpuAffinityMCVE). The Mosek documentation says the [API is thread safe](http://docs.mosek.com/7.0/capi/The_optimizers_for_continuous_problems.html#sec-solve-parallel). Should I post the code here and revise the question? Or do you suggest I make a new post? – nick Mar 01 '16 at 23:12
  • Glad you managed to narrow the issue down. If I were you I would edit this post to reflect the new info and trim out any irrelevant details above. Be sure to update the title as well. I'm not familiar with MOSEK, so I doubt I'll be able to offer any more help here. Good luck! – ali_m Mar 01 '16 at 23:33
  • @ali_m Will do. Thank you for the help! – nick Mar 01 '16 at 23:43

1 Answers1

2

I contacted support, and they suggested setting MSK_IPAR_NUM_THREADS to 1. My problem takes fractions of a second to solve, so it never looked like it was using multiple cores. Should have checked the docs for default values.

In my code, I added task.putintparam(mosek.iparam.num_threads,1) right after the with statement. This fixed the problem.

nick
  • 319
  • 1
  • 4
  • 18