7

I have a bash script to upload data to a site. I was getting slow upload speeds, so I started running it in parallel, 5 at the same time, using xargs and -N1.

However, the problem is that the server asks me to solve a captcha if I run it 5 at a time, whereas it works fine with 1 at a time.

I figure this is because all the processes start at exactly the same time, I'm getting flagged.

Anyway so here's the question, is there any way for me to add a wait (say 1 second) between starting processes in xargs/gnu parallel?

The only thing I could come up with is using pgrep script | wc -1 to count the script instances, and sleep for that number of seconds.

However, this is really not optimal, are there any better ways of doing this?

Pavan Manjunath
  • 27,404
  • 12
  • 99
  • 125
Amir
  • 2,082
  • 4
  • 22
  • 28

5 Answers5

6

If the upload takes a random amount of time you just need the first 5 to start with a 1-5 second delay:

cat list | parallel -j5 [ {#} -lt 6 ] \&\& sleep {#}\; upload {}
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
  • Changed the beginning bit to find, and this is actually working! I don't even really understand it, but thanks alot :) – Amir Mar 12 '12 at 17:24
2

Rather than using xargs, I think you just want a loop, as in

for i in {1..5}; do sleep 5; your-command & done

This forks off the commands every 5 seconds. For an increasing delay (if that's needed):

for i in {1..5}; do ((w=i*5)); sleep $w; your-command & done

Another alternative:

files="a.txt b.txt c.txt"
for i in $files; do upload-command $i& sleep 5; done
Jim Garrison
  • 85,615
  • 20
  • 155
  • 190
  • I'm using xargs with find like this: find . -type f -name "*.txt" -print0 | xargs -0 -n 1 -P 5 /path/to/script/ , and there are alot more than 5 files that need to be processed. Don't think that'll work, sorry for not being clear enough >. – Amir Mar 10 '12 at 08:03
  • @lelouch So /path/to/script is a script you have written. Why not re-write it to take 5 arguments and use -n 5 -P 1 as argument to xargs. – William Pursell Mar 10 '12 at 13:39
  • That didn't occur to me, but yeah that seems the only way to do this. Thanks :) – Amir Mar 10 '12 at 18:15
2

This might work for you (uses GNU parallel):

 find . -type f -name "*.txt" -print | parallel 'script {} & sleep 1'

Here's a terminal session showing an example run:

for x in {a..c};do for y in {1..3};do echo $x >>$x;done;done
ls
a  b  c
cat a
a
a
a
cat /tmp/job
#!/bin/bash
sed -i -e '1e date' -e 's/./\U&/' $1
sleep 5
sed -i '${p;s,.*,date,e}' $1
find . -type f -name "?" -print | parallel '/tmp/job {} & sleep 1'
cat ?
Sat Mar 10 20:25:10 GMT-1 2012
A
A
A
Sat Mar 10 20:25:15 GMT-1 2012
Sat Mar 10 20:25:09 GMT-1 2012
B
B
B
Sat Mar 10 20:25:14 GMT-1 2012
Sat Mar 10 20:25:08 GMT-1 2012
C
C
C
Sat Mar 10 20:25:13 GMT-1 2012

As you can see each job is started a second apart i.e. file c starts at 08 finishes at 13, file b 09 to 14 and file a 10 to 15.

potong
  • 55,640
  • 6
  • 51
  • 83
  • Finally got that to work, needed a -q switch in parallel.. However, this causes all the files to be processed at once since they go to the background. I tried 'sleep ; script {}', and that didn't work either. I think I'll need to do as Jim said... – Amir Mar 10 '12 at 18:10
  • I've included a example. The crux of the matter is the backgrounding of `/tmp/job {} & sleep 1` followed by the sleep for a second. N.B. this is GNU parallel not Moreutils. – potong Mar 10 '12 at 20:34
  • Yeah it works, what I mean is, is there a way to limit the number of processes? -J5 doesn't work anymore, this will keep going until it processes all the hundreds of files.. – Amir Mar 11 '12 at 03:31
1

GNU parallel has a --delay option which can be used for this purpose. It prevents jobs from being started all at the same time and guarantees a minimum delay between starts. Using

cat list | parallel -j5 --delay 5s upload {}

will ensure that every upload execution is at least 5 seconds apart.

0

You can pause your script execution after every process using

read -p "Press [Enter] key to continue..".

Now you can decide, at your own will when to start the next process..

I agree this involves manual intervention. But as there are only 5 processes to be started in this particular case, it should work out fine.

EDIT: As read stops your automation you can use

sleep 5 

which ll sleep for 5s.

Pavan Manjunath
  • 27,404
  • 12
  • 99
  • 125
  • Unfortunately that won't work for me >.< I can do what I want to do from the browser without a problem, but I'm using bash & curl to automate it all. – Amir Mar 10 '12 at 04:36
  • But in this case since xargs starts all the processes almost instantly, won't they all just sleep for 5 seconds then start at once, causing the same problem? – Amir Mar 10 '12 at 04:56