multithreaded crawler with different proxy for each thread, the right way?

Question

I'm going to write a multithreaded crawler that is planned to run on about 10M pages, in order to speed thing up I need to fetch about 10~ different pages simultaneity.

Each of the crawler-threads will use a different proxy and push the results to a queue, on the other side I'll have a few more workers that will fetch the results from the queue, parse and insert them to a DB.

Is that the right approach? will I have problems saving too many results in the queue? should I be worried about locks? (using the queue module). Which HTTP library will be the best for my needs? (httplib2/urllib2).

when creating each thread, should I pass new instances of the request object to each thread or should I move the request object and use its "getPage" function in the thread?

Thanks.

score 0 · Answer 1 · answered Jun 17 '12 at 13:25

0

Try requests library (documantation part for proxies)

answered Jun 17 '12 at 13:25

Aleksei astynax Pirogov

2,483
15
19

score 0 · Answer 2 · answered Jun 17 '12 at 14:46

0

Scrapy's the way to go.

Here's a page describing how to set up the proxy middleware to use multiple proxies: http://mahmoud.abdel-fattah.net/2012/04/16/using-scrapy-with-different-many-proxies/

answered Jun 17 '12 at 14:46

Acorn

49,061
27
133
172

multithreaded crawler with different proxy for each thread, the right way?

2 Answers2