tracking status/progress in gnu parallel

Question

I've implemented parallel in one of our major scripts to perform data migrations between servers. Presently, the output is presented all at once (-u) in pretty colors, with periodic echos of status from the function being executed depending on which sequence is being run (e.g. 5/20: $username: rsyncing homedir or 5/20: $username: restoring account). These are all echoed directly to the terminal running the script, and accumulate there. Depending on the length of time a command is running, however, output can end up well out of order, and long running rsync commands can be lost in the shuffle. Butm I don't want to wait for long running processes to finish in order to get the output of following processes.

In short, my issue is keeping track of which arguments are being processed and are still running.

What I would like to do is send parallel into the background with (parallel args command {#} {} ::: $userlist) & and then track progress of each of the running functions. My initial thought was to use ps and grep liberally along with tput to rewrite the screen every few seconds. I usually run three jobs in parallel, so I want to have a screen that shows, for instance:

1/20: user1: syncing homedir
current file: /home/user1/www/cache/file12589015.php

12/20: user12: syncing homedir
current file: /home/user12/mail/joe/mailfile

5/20: user5: collecting information
current file:

I can certainly get the above status output together no problem, but my current hangup is separating the output from the individual parallel processes into three different... pipes? variables? files? so that it can be parsed into the above information.

--tag could be useful if someone has an idea about piping the output so that it can be analyzed periodically (although out current output shows all the arguments already neatly formatted), but --line-buffer and --tmux are not helpful in this case. i'm not particularly worried about lines overlapping from output (this is rare), and scraping information from tmux rather than the running process seems to be an extra step (plus our machines dont have tmux installed by default) — Andrej, Aug 17 '16 at 12:32

Ole Tange · Answer 1 · 2016-08-17T19:03:40.517

2

Not sure if this is much better:

echo hello im starting now
sleep 1
# start parallel and send the job to the background
temp=$(mktemp -d)
parallel --rpl '{log} $_="Working on@arg"' -j3 background {} {#} ">$temp/{1log} 2>&1;rm $temp/{1log}" ::: foo bar baz foo bar baz one two three one two three :::+ 5 6 5 3 4 6 7 2 5 4 6 2 &
while kill -0 $!  2>/dev/null ; do
    cd "$temp"
    clear
    tail -vn1 *
    sleep 1
done
rm -rf "$temp"

It make a logfile for each job. Tails all logfiles every second and removes the logfile when a jobs is done.

The logfiles are named 'working on ...'.

edited Aug 17 '16 at 19:03

answered Aug 17 '16 at 18:54

Ole Tange

31,768
5
86
104

tailing a logfile per process is a good idea, since it can be easily deleted immediately afterward. i think i will end up combining everything, using '>$temp/synclog.{}.log' as part of the background() command in parallel instead of the perl replacement, and print that status line with the lsof and ps parts. – Andrej Aug 18 '16 at 15:18

Andrej · Answer 2 · 2016-08-17T14:19:43.160

I believe that this is close to what I need, though it isnt very tidy and probably isnt optimal:

#!/bin/bash

background() { #dummy load. $1 is text, $2 is number, $3 is position
        echo $3: starting sleep...
        sleep $2
        echo $3: $1 slept for $2
}

progress() {
        echo starting progress loop for pid $1...
        while [ -d /proc/$1 ]; do
                clear
                tput cup 0 0
                runningprocs=`ps faux | grep background | egrep -v '(parallel|grep)'`
                numprocs=`echo "$runningprocs" | wc -l`
                for each in `seq 1 ${numprocs}`; do
                        line=`echo "$runningprocs" | head -n${each} | tail -n1`
                        seq=`echo $line | rev | awk '{print $3}' | rev`
                        # print select elements from the ps output
                        echo working on `echo $line | rev | awk '{print $3, $4, $5}' | rev`
                        # print the last line of the log for that sequence number
                        cat logfile.log | grep ^$seq\: | tail -n1
                        echo
                done
                sleep 1
        done
}

echo hello im starting now
sleep 1
export -f background
# start parallel and send the job to the background
parallel -u -j3 background {} {#} '>>' logfile.log ::: foo bar baz foo bar baz one two three one two three :::+ 5 6 5 3 4 6 7 2 5 4 6 2 &
pid=$!
progress $pid
echo finished!

I'd rather not depend on scraping all information from ps and would prefer to get the actual line output of each parallel process, but a guy's gotta do what a guy's gotta do. regular output sent to a logfile for parsing later on.

tracking status/progress in gnu parallel

2 Answers2