wget Vs urlretrieve of python

Question

I have a task to download Gbs of data from a website. The data is in form of .gz files, each file being 45mb in size.

The easy way to get the files is use "wget -r -np -A files url". This will donwload data in a recursive format and mirrors the website. The donwload rate is very high 4mb/sec.

But, just to play around I was also using python to build my urlparser.

Downloading via Python's urlretrieve is damm slow, possible 4 times as slow as wget. The download rate is 500kb/sec. I use HTMLParser for parsing the href tags.

I am not sure why is this happening. Are there any settings for this.

Thanks

I would ignore transfer speeds (megabytes/MB and megabits/Mb are completely different!) and compare the two using the commands `time wget http://example.com/file` and `time python urlretrieve_downloader.py` — dbr, Jun 10 '09 at 19:47
ahh I meant 500Kb only.. sorry for the lower case...my bad... both are in bytes. .. .5MB/sec and 4Mb/sec — Kapil D, Jun 10 '09 at 22:16
Both are in bytes? So you have a 32megabit connection? Probably not. I'm pretty sure it's 500 kilobytes and 4 megabits. Seems too convenient to have an exact 1/8 slowdown. — Kenan Banks, Jun 11 '09 at 00:18
how are you measuring the wget download speed? Is wget showing a status message you can post here? — Kenan Banks, Jun 11 '09 at 06:56

Kenan Banks · Answer 1 · 2009-06-10T18:20:42.797

40

Probably a unit math error on your part.

Just noticing that 500KB/s (kilobytes) is equal to 4Mb/s (megabits).

edited Jun 10 '09 at 18:20

answered Jun 10 '09 at 18:14

Kenan Banks

207,056
34
155
173

Xuan · Answer 2 · 2010-07-01T16:49:00.747

urllib works for me as fast as wget. try this code. it shows the progress in percentage just as wget.

import sys, urllib
def reporthook(a,b,c): 
    # ',' at the end of the line is important!
    print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
    #you can also use sys.stdout.write
    #sys.stdout.write("\r% 3.1f%% of %d bytes" 
    #                 % (min(100, float(a * b) / c * 100), c)
    sys.stdout.flush()
for url in sys.argv[1:]:
     i = url.rfind('/')
     file = url[i+1:]
     print url, "->", file
     urllib.urlretrieve(url, file, reporthook)

score 3 · Answer 3 · edited Jun 10 '09 at 18:12

3

import subprocess

myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])

edited Jun 10 '09 at 18:12

SilentGhost

307,395
66
306
293

answered Jun 10 '09 at 18:10

nosklo

217,122
57
293
297

score 3 · Answer 4 · edited Jun 10 '09 at 19:47

3

As for the html parsing, the fastest/easiest you will probably get is using lxml As for the http requests themselves: httplib2 is very easy to use, and could possibly speed up downloads because it supports http 1.1 keep-alive connections and gzip compression. There is also pycURL which claims to be very fast (but more difficult to use), and is build on curllib, but I've never used that.

You could also try to download different files concurrently, but also keep in mind that trying to optimize your download times too far may be not very polite towards the website in question.

Sorry for the lack of hyperlinks, but SO tells me "sorry, new users can only post a maximum of one hyperlink"

edited Jun 10 '09 at 19:47

Kenan Banks

207,056
34
155
173

answered Jun 10 '09 at 18:46

Added some links for ya, newb :) – Kenan Banks Jun 10 '09 at 19:47
I am not sure if parsing is the problem ... Its retrieving and storing file which is causing the delay... – Kapil D Jun 10 '09 at 22:21

score 3 · Answer 5 · answered Jun 11 '09 at 13:47

Transfer speeds can be easily misleading.. Could you try with the following script, which simply downloads the same URL with both wget and urllib.urlretrieve - run it a few times incase you're behind a proxy which caches the URL on the second attempt.

For small files, wget will take slightly longer due to the external process' startup time, but for larger files that should be come irrelevant.

from time import time
import urllib
import subprocess

target = "http://example.com" # change this to a more useful URL

wget_start = time()

proc = subprocess.Popen(["wget", target])
proc.communicate()

wget_end = time()


url_start = time()
urllib.urlretrieve(target)
url_end = time()

print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s"  % (url_end - url_start)

score 1 · Answer 6 · answered Oct 16 '11 at 07:13

Since python suggests using urllib2 instead of urllib, I take a test between urllib2.urlopen and wget.

The result is, it takes nearly the same time for both of them to download the same file.Sometimes, urllib2 performs even better.

The advantage of wget lies in a dynamic progress bar to show the percent finished and the current download speed when transferring.

The file size in my test is 5MB.I haven't used any cache module in python and I am not aware of how wget works when downloading big size file.

score 0 · Answer 7 · answered Feb 28 '10 at 09:35

0

You can use wget -k to engage relative links in all urls.

answered Feb 28 '10 at 09:35

Alex

43,191
44
96
127

score 0 · Answer 8 · answered Jun 10 '09 at 13:55

0

There shouldn't be a difference really. All urlretrieve does is make a simple HTTP GET request. Have you taken out your data processing code and done a straight throughput comparison of wget vs. pure python?

answered Jun 10 '09 at 13:55

Corey Goldberg

59,062
28
129
143

wget Vs urlretrieve of python

8 Answers8

Linked