Let's say I have a file with lots of URLs and I want to download them in parallel using arbitrary number of processes. How can I do it with bash?
Asked
Active
Viewed 8,965 times
5 Answers
10
Have a look at man xargs
:
-P max-procs --max-procs=max-procs
Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time.
Solution:
xargs -P 20 -n 1 wget -nv <urs.txt
-
Oh, that's very slick. did not know about -P – Richard June Mar 16 '11 at 15:20
-
In case the original link vanishes, the recommended command (with useless use of cat removed) is: `xargs -P 20 -n 1 wget -nv
– Gordon Davisson Mar 16 '11 at 15:26
1
If you just want to grab each URL(regardless of number) then the answer is easy:
#!/bin/bash
URL_LIST="http://url1/ http://url2/"
for url in $URL_LIST ; do
wget ${url} & >/dev/null
done
If you want to only create a limited number of pulls, say 10. Then you would do something like this:
#!/bin/bash
URL_LIST="http://url1/ http://url2/"
function download() {
touch /tmp/dl-${1}.lck
wget ${url} >/dev/null
rm -f /tmp/dl-${1}.lck
}
for url in $URL_LIST ; do
while [ 1 ] ; do
iter=0
while [ $iter -lt 10 ] ; do
if [ ! -f /tmp/dl-${iter}.lck ] ; then
download $iter &
break 2
fi
let iter++
done
sleep 10s
done
done
Do note I haven't actually tested it, but just banged it out in 15 minutes. but you should get a general idea.

Richard June
- 728
- 4
- 7
1
You could use something like puf which is designed for that sort of thing, or you could use wget/curl/lynx in combination with GNU parallel.

Cakemox
- 25,209
- 6
- 44
- 67
0
http://puf.sourceforge.net/ puf does this "for a living" and has a nice running status of the complete process.

olemd
- 381
- 1
- 4
0
I do stuff like this a lot. I suggest two scripts.
the parent only determines the appropriate loading factors and
launches a new child when there is
1. more work to do
2. not past some various limits of loadavg or bandwidth
# my pref lang is tcsh so, this is just a rough approximation
# I think with just a few debug runs, this could work fine.
# presumes a file with one url to download per line
#
NUMPARALLEL=4 # controls how many at once
#^tune above number to control CPU and bandwidth load, you
# will not finish fastest by doing 100 at once.
# Wed Mar 16 08:35:30 PDT 2011 , dianevm at gmail
while : ; do
WORKLEFT=`wc -l < $WORKFILE`
if [ WORKLEFT -eq 0 ];
echo finished |write sysadmin
echo finished |Mail sysadmin
exit 0
fi
NUMWORKERS=`ps auxwwf|grep WORKER|grep -v grep|wc -l`
if [ $NUMWORKERS -lt $NUMPARALLEL]; then # time to fire off another 1
set WORKTODO=`head -1 $WORKFILE`
WORKER $WORKTODO & # worker could just be wget "$1", ncftp, curl
tail -n +2 $WORKFILE >TMP
SECEPOCH=`date +%s`
mv $WORKFILE $WORKFILE.$SECSEPOCH
mv TMP $WORKFILE
else # we have NUMWORKERS or more running.
sleep 5 # suggest this time be close to ~ 1/4 of script run time
fi
done

dianevm
- 132
- 3
-
Oh, also, unless you have separate ISPs, or bandwidth limitations or something, you USUALLY are not going to have any total faster download speed, by doing it in parallel – dianevm Mar 16 '11 at 15:40