8

I know how to use urllib to download a file. However, it's much faster, if the server allows it, to download several part of the same file simultaneously and then merge them.

How do you do that in Python? If you can't do it easily with the standard lib, any lib that would let you do it?

Manoj Govindan
  • 72,339
  • 21
  • 134
  • 141
Bite code
  • 578,959
  • 113
  • 301
  • 329
  • 7
    Usually, it is *not* faster to download several parts of a file in parallel, because you are still limited to the bottleneck of your network connection. Only when the server allows only limited bandwidth *per connection*, you will get an improvement. – Sven Marnach Mar 14 '11 at 14:32
  • If you're somehow under the impression that this will achieve download speeds you get with a torrent, you'll be disappointed to find out that won't help you in a single client, single server case. It's *is* faster *if* you're downloading from multiple sources *and* that each has less upload bandwidth than your total download bandwidth. No download is faster than direct download from a single server with an (available) upload bandwidth greater than or equal to your download bandwidth. – André Caron Mar 14 '11 at 15:04
  • 3
    @Sven: That's false. In the real world, it usually is much faster, on most connections to most servers, to download with multiple streams. It's unfortunate, and the causes can be hard to track down, but it's there and a whole lot of people have no choice but to deal with it. – Glenn Maynard Mar 14 '11 at 15:46
  • 1
    +1 to Glenn. In theory it should not be faster, in practice it almost always is. – Bite code Mar 14 '11 at 16:14
  • @Glenn, e-satis: I did a few tests downloading random files from random servers and came to the conclusion that you generally get a speed-up as long as the bottleneck is not on the client side. Since my own internet connection is rather slow, my experience was that you usually *don't* get a speed-up. – Sven Marnach Mar 16 '11 at 12:35
  • 1
    @Sven: It's nowhere near that simple; your network isn't every network. – Glenn Maynard Mar 16 '11 at 15:05
  • @Glenn: Actually, my last comment was meant to concede the point to you, the last sentence only being an explanation how I came to the conclusion in my first comment. I still disagree with e-satis's "almost always", since if the bottleneck is on the client side, it's obviously impossible to get a speed-up, and this just happens to be the situation I'm usually in. – Sven Marnach Mar 16 '11 at 15:50

2 Answers2

18

Although I agree with Gregory's suggestion of using an existing library, it's worth noting that you can do this by using the Range HTTP header. If the server accepts byte-range requests, you can start several threads to download multiple parts of the file in parallel. This snippet, for example, will only download bytes 0..65535 of the specified file:

import urllib2
url = 'http://example.com/test.zip'
req = urllib2.Request(url, headers={'Range':'bytes=0-65535'})
data = urllib2.urlopen(req).read()

You can determine the remote resource size and see whether the server supports ranged requests by sending a HEAD request:

import urllib2

class HeadRequest(urllib2.Request):
    def get_method(self):
        return "HEAD"

url = 'http://sstatic.net/stackoverflow/img/sprites.png'
req = HeadRequest(url)
response = urllib2.urlopen(req)
response.close()
print respose.headers

The above prints:

Cache-Control: max-age=604800
Content-Length: 16542
Content-Type: image/png
Last-Modified: Thu, 10 Mar 2011 06:13:43 GMT
Accept-Ranges: bytes
ETag: "c434b24eeadecb1:0"
Date: Mon, 14 Mar 2011 16:08:02 GMT
Connection: close

From that we can see that the file size is 16542 bytes ('Content-Length') and the server supports byte ranges ('Accept-Ranges: bytes').

efotinis
  • 14,565
  • 6
  • 31
  • 36
  • so the file of the size 65535 bytes can be split into 5 buffers = 13107 Does that mean that range of each buffer will come out to be `req = urllib2.Request(url, headers={'Range':'bytes=0-13107'})` `req2 = urllib2.Request(url, headers={'Range':'bytes=13108-26214'})` `req3 = urllib2.Request(url, headers={'Range':'bytes=26215-39321'})` `req4 = urllib2.Request(url, headers={'Range':'bytes=39322-52429'})` `req5 = urllib2.Request(url, headers={'Range':'bytes=52430-65535'})` if yes how do I put them together using `data = urllib2.urlopen(req).read()` ? – Ciasto piekarz Jun 25 '14 at 06:00
  • @san, a simple way would be to start 5 threads, each one calling urlopen().read() for a range and storing the result in a thread-safe container (a list or dict will do in this case). Then wait in the main thread for those threads to finish (join()) and combine the parts. – efotinis Jun 25 '14 at 06:30
  • that is the question, how do I combine parts ? – Ciasto piekarz Jun 25 '14 at 09:24
  • @san: just add them together, making sure they are in order (by storing the starting offset of each part and sorting them based on that.) – efotinis Jun 25 '14 at 18:36
  • I am a bit unsure, does the range `{'Range':'bytes=0-13107'})` has to start with 0 or 1 ? considering the first few bytes may contain file header with necessary information that I do not want to loose or corrupt the file !!! – Ciasto piekarz Jun 27 '14 at 03:45
  • @efotinis please help me with my first attempt: http://stackoverflow.com/questions/24585885/tryng-to-split-the-file-download-buffer-to-into-separate-threads – Ciasto piekarz Jul 05 '14 at 11:25
6

PycURL can do it. PycURL is a Python interface to libcurl. It can be used to fetch objects identified by a URL from a Python program, similar to the urllib Python module. PycURL is mature, very fast, and supports a lot of features.

Gregory
  • 473
  • 8
  • 14