5

Recently I'm working on a program which can download manga from a online manga website.It works but a bit slow.So I decide to use multithreading/processing to speed up downloading.Here are my questions:

  1. which one is better?(this is a python3 program)

  2. multiprocessing,I think,will definitely work.If I use multiprocessing,what is the suitable amount of processes?Does it relate to the number of cores in my CPU?

  3. multithreading will probably work.This download work obviously needs much time to wait for pics to be downloaded,so I think when a thread starts waiting,python will make another thread work.Am I correct?
    I've read 《Inside the New GIL》by David M.Beazley.What's the influence of GIL if I use multithreading?

laike9m
  • 18,344
  • 20
  • 107
  • 140
  • I guess IO bound tasks shouldn't be effected by GIL, right? I don't know what do you mean by "slow", is it because the processing time or the download time is slow? – Shang Wang Mar 12 '13 at 02:29
  • what I mean is download time – laike9m Mar 12 '13 at 02:38
  • Then I think both are OK. But I guess multithreading/multiprocessing only make sense if you have limitation on download speed for each connection. – Shang Wang Mar 12 '13 at 02:45
  • Can you explain more on "only make sense if you have limitation on download speed for each connection"?Thx. – laike9m Mar 12 '13 at 02:52
  • Because some download servers put a throughput limit on each connection . If you have 2 threads downloading at the same time, there will be 2 connections, which should make your download faster. But if they don't have the limit, then your 2 threads are just sharing your bandwidth, which doesn't achieve any speed up. – Shang Wang Mar 12 '13 at 02:57
  • I know what you mean,you are right.Well,things here is kind of different.It's true I'm downloading a maybe 300MB manga,but in fact what I download is piles of pics.For example connection1 makes a request to download pic001,and then it requests to download pic002.I've thought about the "share bandwidth" thing,but I am not really sure that in a single pic downloading,bandwith will be conpletely used.And in some other manga downloading software,they all uses multithreading.I think they wouldn't do it for no reason – laike9m Mar 12 '13 at 03:15
  • One thing to keep in mind: Traditionally web browsers limit themselves to something like 8 total connections, 4 to a given domain, 2 to a given protocol:host:port. You can go a little beyond that, but that's the kind of ballpark generic small-ish websites are designed to deal with. – abarnert Mar 16 '13 at 04:01
  • @abarnert This is new for me.I guess what you mean is that I should limit my threads to 8 or less? – laike9m Mar 16 '13 at 08:03
  • @laike9m: I'd _start_ with 4 or 8. You can tweak it and see if higher numbers actually give you any real performance benefit. But there's a good chance it won't—in which case you should stick with a small number. More concurrent connections means more strain on the server, and if it's not hurting you to play nice, why not play nice? – abarnert Mar 16 '13 at 08:52

3 Answers3

4

You're probably going to be bound by either the server's upload pipe (if you have a faster connection) or your download pipe (if you have a slower connection).

There's significant startup latency associated with TCP connections. To avoid this, HTTP servers can recycle connections for requesting multiple resources. So there are two ways for your client to avoid this latency hit:

(a) Download several resources over a single TCP connection so your program only suffers the latency once, when downloading the first file

(b) Download a single resource per TCP connection, and use multiple connections so that hopefully at every point in time, at least one of them will be downloading at full speed

With option (a), you want to look into how to recycle requests with whatever HTTP library you're using. Any good one will have a way to recycle connections. http://python-requests.org/ is a good Python HTTP library.

For option (b), you probably do want a multithread/multiprocess route. I'd suggest only 2-3 simultaneous threads, since any more will likely just result in sharing bandwidth among the connections, and raise the risk of getting banned for multiple downloads.

The GIL doesn't really matter for this use case, since your code will be doing almost no processing, spending most of its time waiting bytes to arrive over the network.

The lazy way to do this is to avoid Python entirely because most UNIX-like environments have good building blocks for this. (If you're on Windows, your best choices for this approach would be msys, cygwin, or a VirtualBox running some flavor of Linux, I personally like Linux Mint.) If you have a list of URL's you want to download, one per line, in a text file, try this:

cat myfile.txt | xargs -n 1 --max-procs 3 --verbose wget

The "xargs" command with these parameters will take a whitespace-delimited URL's on stdin (in this case coming from myfile.txt) and run "wget" on each of them. It will allow up to 3 "wget" subprocesses to run at a time, when one of them completes (or errors out), it will read another line and launch another subprocess, until all the input URL's are exhausted. If you need cookies or other complicated stuff, curl might be a better choice than wget.

picomancer
  • 1,786
  • 13
  • 15
  • 1
    Thank you for your detailed explanation,especially for the TCP latency part.I've heard of the Requests lib,but in this program I use httplib2. And I choose Python in that Python can deal with strings easily,which is an important part of the program. – laike9m Mar 15 '13 at 09:45
  • And you talked about bandwidth share,which is a problem that disturbs me.In this situation(downloading pics),does one thread use the full bandwidth or just a part?Actually,what I'm asking is "is there a way to find this out?"If you could help me with this,it would be very much helpful.Thx again. – laike9m Mar 15 '13 at 09:56
  • 2
    TCP builds reliable transport on top of the unreliable IP protocol. Basically the connection starts slow, then sends packets faster and faster, until some of them are lost. (TCP requests retransmission of lost packets, so this is no big deal.) Then it keeps going at that speed, occasionally trying to speed up again (and backing off if this results in packet loss). The end result is a self-balancing system with a data flow rate more-or-less exactly equal to what the connections involved can sustain. See any good book or tutorial on networking for more. – picomancer Mar 15 '13 at 19:32
  • @abarnert Thanks.Yes my requests are GETs,and I guess most websites(especially comic websites)are browser-oriented sites?As for (b),I really don't know if the website has disabled it. – laike9m Mar 16 '13 at 08:04
  • @laike9m: Sorry, I've been sidetracking things on something that may be irrelevant. Are you downloading lots of little files (individual page images), or a smaller number of big files (60-page CBR archives)? If it's the latter, keepalives aren't a big deal. – abarnert Mar 16 '13 at 08:48
  • @abarnert It's the former case.Usually a manga consists of hundreds of images(even more than a thousand),and they are in different pages. – laike9m Mar 16 '13 at 09:15
  • @abarnert on this page [link](https://code.google.com/p/httplib2/),it says "(httplib2)Supports HTTP 1.1 Keep-Alive, keeping the socket open and performing multiple requests over the same connection if possible.".Does it mean that when I finished downloading img001 and begin to download img002,httplib2 is able to keep the 2nd request in the same connection as the 1st request? – laike9m Mar 16 '13 at 09:19
  • @laike9m: Yes, that's basically what it means. Assuming the server doesn't stop listening a few seconds after you send the first request, or disconnect you as soon as it finishes sending the first response, `httplib2` will send the second request on the same socket, and if it arrives within the server's timeout window, you'll get the second response on the same socket. – abarnert Mar 16 '13 at 09:34
  • @laike9m: It's easy to test each of the servers you care about (there probably aren't too many) to see which support keepalives (and which of those support full pipelining, and which do support keepalives but their timeouts are so short that it's useless, and so on, if it turns out to matter). – abarnert Mar 16 '13 at 09:36
  • @abarnert It seems not so easy for me..could you tell me what commands or tools should I use and I'll test the server myself. – laike9m Mar 16 '13 at 09:48
  • @laike9m: Honestly, I'd just test it with netcat, but I can't explain how to do that in a comment. Let me sleep on it and see if I have a good easy answer tomorrow. – abarnert Mar 16 '13 at 09:49
  • @abarnert Thank you so much for your comments,I really learned a lot! – laike9m Mar 16 '13 at 09:52
  • @abarnert I read some passages about how to use netcat,but none involves "test if the server supports keep-alive".So I hope that you can provide a good answer.Should I raise another question for you? – laike9m Mar 17 '13 at 14:07
  • @laike9m: You have to know how to craft an appropriate HTTP request, and read a response, but the basics are: Copy the request into your clipboard, netcat the server, and paste it. When the response comes back, paste it again. If you get two responses, the server obviously supports keepalive. Then try a pipeline request (change the headers, and paste twice in a row without waiting—or just cat the requests through a pipe). If either fails, then you have to figure out why, by looking at the responses and/or the timing. – abarnert Mar 18 '13 at 17:08
  • @laike9m: If you don't know HTTP well enough that the details are obvious, it's probably easier to find a library or tool that can test things at a higher level, taking care of the details for you. I believe libcurl (and its python bindings) can try pipelining, falling back to traditional keepalive or separate connections automatically as appropriate, and take care of quirks of well-known servers, and give you enough feedback to know what it's doing. – abarnert Mar 18 '13 at 17:10
  • @laike9m: Finally… do you really need to know the details? I don't think there's much harm in trying _both_. Create a pool of, say, 4 or 8 connections, and also tell `requests` (or whatever you plan to use) to reuse connections as well as possible, tune the pool size (and maybe some header details) to get the best performance, and if it's good enough, you're done. – abarnert Mar 18 '13 at 17:11
  • @abarnert Well,in fact this is what I've done before raising this question(I made a Process pool which contains 4 processes).I thought it would be much faster than using a single process,and it did became faster in most cases,but sometimes much slower(even stuck).I 'm not sure about the reason and then I see your comment about "test if the server supports keep-alive".Though network speed instability is what I think the main reason,it is worth knowing more about the server side.(have to go to sleep,different time zones make life hard :) – laike9m Mar 18 '13 at 17:33
  • Four separate processes can't be sharing connections. (Well, they _could_, but it would be pretty hard to do that by accident…) So everything I said is basically a red herring. Maybe the server is just slow and clunky, or maybe your network connection is. In that case, writing the code to handle timeouts and failures (and possibly resume downloads, if possible) is probably more important than concurrency or keepalives. – abarnert Mar 18 '13 at 17:49
1

It doesn't really matter. It is indeed true that threads waiting on IO won't get in the way of other threads running, and since downloading over the Internet is an IO-bound task, there's no real reason to try to spread your execution threads over multiple CPUs. Given that and the fact that threads are more light-weight than processes, it might be better to use threads, but you honestly aren't going to notice the difference.

How many threads you should use depends on how hard you want to hit the website. Be courteous and take care that your scraping isn't viewed as a DOS attack.

Cairnarvon
  • 25,981
  • 9
  • 51
  • 65
  • So how many threads is suitable,irrespective of being viewed as attack?(would 10 be ok?)What I download sometimes reaches several million bytes,otherwise I will not think of using multithreading. – laike9m Mar 12 '13 at 02:50
  • It really depends on how many individual things you're downloading, how many threads your system can comfortably handle, and how good your Internet connection is. If it were me, I'd benchmark 16, 32, and 64 threads to begin with and make a decision based on that. Always benchmark questions of performance. – Cairnarvon Mar 12 '13 at 02:57
0

You don't really need multithreading for this kind of tasks.. you could try single thread async programming using something like Twisted

zzk
  • 1,347
  • 9
  • 15
  • 2
    @laike9m: Twisted is amazingly cool, and definitely worth learning. But for this case, threads will have a much lower learning curve, so I wouldn't bother. Put it on your list of things to look into for your next project. (Also check out [PEP3156](http://www.python.org/dev/peps/pep-3156/)/tulip and [`gevent`](http://www.gevent.org) while you're at it.) – abarnert Mar 16 '13 at 08:49