4

I am trying to find a way to execute CPU intensive parallel jobs over a cluster. My objective is to schedule one job per core, so that every job hopefully gets 100% CPU utilization once scheduled. This is what a have come up with so far:

FILE build_sshlogin.sh

#!/bin/bash

serverprefix="compute-0-"
lastserver=15
function worker {
    server="$serverprefix$1"; 
    free=$(ssh $server /bin/bash << 'EOF'
        cores=$(grep "cpu MHz" /proc/cpuinfo | wc -l)
        stat=$(head -n 1 /proc/stat)
        work1=$(echo $stat | awk '{print $2+$3+$4;}')
        total1=$(echo $stat | awk '{print $2+$3+$4+$5+$6+$7+$8;}')
        sleep 2;
        stat=$(head -n 1 /proc/stat)
        work2=$(echo $stat | awk '{print $2+$3+$4;}')
        total2=$(echo $stat | awk '{print $2+$3+$4+$5+$6+$7+$8;}')

        util=$(echo " ( $work2 - $work1 ) / ($total2 - $total1) " | bc -l );
        echo " $cores * (1 - $util) " | bc -l | xargs printf "%1.0f"
    EOF
    )

    if [ $free -gt 0 ] 
    then 
        echo $free/$server
    fi
}

export serverprefix
export -f worker

seq 0 $lastserver | parallel -k worker {}

This script is used by GNU parallel as follows:

parallel --sshloginfile <(./build_sshlogin.sh) --workdir $PWD command args {1} :::  $(seq $runs) 

The problem with this technique is that if someone starts another CPU intensive job on a server in the cluster, without checking the CPU usage, then the script will end up scheduling jobs to a core that is being used. In addition, if by the time the first jobs finishes, the CPU usage has changed, then the newly freed cores will not be included for scheduling by GNU parallel for the remaining jobs.

So my question is the following: Is there a way to make GNU parallel re-calculate the free cores/server before it schedules each job? Any other suggestions for solving the problem are welcome.

NOTE: In my cluster all cores have the same frequency. If someone can generalize to account for different frequencies, that's also welcome.

user000001
  • 32,226
  • 12
  • 81
  • 108

2 Answers2

6

Look at --load which is meant for exactly this situation.

Unfortunately it does not look at CPU utilization but load average. But if your cluster nodes do not have heavy disk I/O then CPU utilization will be very close to load average.

Since load average changes slowly you probably also need to use the new --delay option to give the load average time to rise.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
1

Try mpstat

mpstat
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db)       07/09/2011

10:25:32 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
10:25:32 PM  all    5.68    0.00    0.49    2.03    0.01    0.02    0.00   91.77    146.55

This is an overall snapshot on a per core basis

$ mpstat -P ALL
Linux 2.6.32-100.28.5.el6.x86_64 (dev-db)       07/09/2011      _x86_64_        (4 CPU)

10:28:04 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
10:28:04 PM  all    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.99
10:28:04 PM    0    0.01    0.00    0.01    0.01    0.00    0.00    0.00    0.00   99.98
10:28:04 PM    1    0.00    0.00    0.01    0.00    0.00    0.00    0.00    0.00   99.98
10:28:04 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
10:28:04 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00

There lot of options, these two give a simple actual %idle per cpu. Check the manpage.

jim mcnamara
  • 16,005
  • 2
  • 34
  • 51
  • Thanx for the answer, and this may improve the accuracy of the utilization estimation. Maybe I should edit my post a bit, because the main question is not so much how to estimate the CPU utilization, but how to tell GNU parallel to maintain an UPDATED version of the utilization after an amount of time. For example, if after an hour of execution, GNU parallel decides to schedule the next job, I would like it schedule based on the CPU utilization at that time, and not based on the utilization at the beginning of the execution. – user000001 Dec 27 '12 at 08:19