Multithreaded downloading with shell script

Question

Let's say I have a file with lots of URLs and I want to download them in parallel using arbitrary number of processes. How can I do it with bash?

score 10 · Accepted Answer · edited Jun 18 '17 at 05:20

10

Have a look at man xargs:

-P max-procs --max-procs=max-procs

         Run  up  to max-procs processes at a time; the default is 1.  If
         max-procs is 0, xargs will run as many processes as possible  at
         a  time.

Solution:

xargs -P 20 -n 1 wget -nv <urs.txt

edited Jun 18 '17 at 05:20

Hritik

103
3

answered Mar 16 '11 at 15:17

fab

116
1
2

Oh, that's very slick. did not know about -P – Richard June Mar 16 '11 at 15:20
In case the original link vanishes, the recommended command (with useless use of cat removed) is: `xargs -P 20 -n 1 wget -nv – Gordon Davisson Mar 16 '11 at 15:26

score 1 · Answer 2 · answered Mar 16 '11 at 15:19

If you just want to grab each URL(regardless of number) then the answer is easy:

#!/bin/bash
URL_LIST="http://url1/ http://url2/"

for url in $URL_LIST ; do
    wget ${url} & >/dev/null
done

If you want to only create a limited number of pulls, say 10. Then you would do something like this:

#!/bin/bash
URL_LIST="http://url1/ http://url2/"

function download() {
    touch /tmp/dl-${1}.lck
    wget ${url} >/dev/null
    rm -f /tmp/dl-${1}.lck
}

for url in $URL_LIST ; do
    while [ 1 ] ; do
        iter=0
        while [ $iter -lt 10 ] ; do
            if [ ! -f /tmp/dl-${iter}.lck ] ; then
                download $iter &
                break 2
            fi
            let iter++
        done
        sleep 10s
    done
done

Do note I haven't actually tested it, but just banged it out in 15 minutes. but you should get a general idea.

score 1 · Answer 3 · answered Mar 16 '11 at 15:20

1

You could use something like puf which is designed for that sort of thing, or you could use wget/curl/lynx in combination with GNU parallel.

answered Mar 16 '11 at 15:20

Cakemox

25,209
6
44
67

Which would looke like this: cat urlfile | parallel -j50 wget – Ole Tange Mar 17 '11 at 00:21

score 0 · Answer 4 · answered Mar 16 '11 at 15:20

0

http://puf.sourceforge.net/ puf does this "for a living" and has a nice running status of the complete process.

answered Mar 16 '11 at 15:20

olemd

381
1
4

score 0 · Answer 5 · answered Mar 16 '11 at 15:38

I do stuff like this a lot. I suggest two scripts.
the parent only determines the appropriate loading factors and 
launches a new child when there is 
1. more work to do
2. not past some various limits of loadavg or bandwidth

# my pref lang is tcsh so, this is just a rough approximation
# I think with just a few debug runs, this could work fine.

# presumes a file with one url to download per line
# 
NUMPARALLEL=4 # controls how many at once
#^tune above number to control CPU and bandwidth load, you
# will not finish  fastest by doing 100 at once.
# Wed Mar 16 08:35:30 PDT 2011 , dianevm at gmail

 while : ; do
      WORKLEFT=`wc -l  < $WORKFILE`
      if [ WORKLEFT -eq 0 ];
           echo finished |write sysadmin
           echo finished |Mail sysadmin
           exit 0
           fi
      NUMWORKERS=`ps auxwwf|grep WORKER|grep -v grep|wc -l`
      if [ $NUMWORKERS -lt $NUMPARALLEL]; then  # time to fire off another 1
           set WORKTODO=`head -1 $WORKFILE`
           WORKER $WORKTODO &  # worker could just be wget "$1", ncftp, curl
           tail -n +2 $WORKFILE >TMP
           SECEPOCH=`date +%s`
           mv $WORKFILE $WORKFILE.$SECSEPOCH
           mv TMP $WORKFILE
        else # we have NUMWORKERS or more running.
           sleep 5  # suggest this time  be close to ~ 1/4 of script run time
        fi
  done

Oh, also, unless you have separate ISPs, or bandwidth limitations or something, you USUALLY are not going to have any total faster download speed, by doing it in parallel — dianevm, Mar 16 '11 at 15:40

Multithreaded downloading with shell script

5 Answers5

Linked