Order of files downloaded by a multithreaded program is not constant

Question

Im using the program from: here

to download many urls at once. It works fine, but the order of the urls in the queue that is received is not the same as their order in the urls list, and its also not constant (changes from run to run).

What can I do to either make their order constant or to know which url belongs to which index in the queue that is received.

Thanks.

score 3 · Accepted Answer · answered Feb 20 '11 at 21:57

Change fetch to read like this:

def fetch(url):
    return (url, urllib2.urlopen(url).read())

The, instead of a queue full of strings, each one containing a result, you get a queue full of tuples, each tuple containing the url, then a result.

You aren't going to be able to get back a queue in which things are always the same order because multithreading is not deterministic about stuff like that. So the best thing to do is make sure each thing is tagged so you can identify it later.

score 1 · Answer 2 · answered Feb 20 '11 at 21:42

1

You can just add the index number to the URL...

urls = [
    (0, 'http://www.google.com/'),
    (1, 'http://www.lycos.com/'),
    (2, 'http://www.bing.com/'),
    (3, 'http://www.altavista.com/'),
    (4, 'http://achewood.com/'),
]

def fetch(index, url):
    data = urllib2.urlopen(url).read()
    # ... do whatever you need using index ...

answered Feb 20 '11 at 21:42

6502

112,025
15
165
265

Can you please elaborate? I dont understand how this will help. The urls already are indexed. You can do urls.index(google.com).... Thanks – quilby Feb 20 '11 at 21:50
If you need to do some other multithreaded processing just after receiving the specific URL then having the index may be helpful inside the fetch function. If you just need to know which data is from which url then you can just have fetch returning `(url, data)` as suggested by Omnifarious. – 6502 Feb 20 '11 at 22:05

Order of files downloaded by a multithreaded program is not constant

2 Answers2