0

I'm going to write a multithreaded crawler that is planned to run on about 10M pages, in order to speed thing up I need to fetch about 10~ different pages simultaneity.

Each of the crawler-threads will use a different proxy and push the results to a queue, on the other side I'll have a few more workers that will fetch the results from the queue, parse and insert them to a DB.

Is that the right approach? will I have problems saving too many results in the queue? should I be worried about locks? (using the queue module). Which HTTP library will be the best for my needs? (httplib2/urllib2).

when creating each thread, should I pass new instances of the request object to each thread or should I move the request object and use its "getPage" function in the thread?

Thanks.

YSY
  • 1,226
  • 3
  • 13
  • 19

2 Answers2

0

Try requests library (documantation part for proxies)

0

Scrapy's the way to go.

Here's a page describing how to set up the proxy middleware to use multiple proxies: http://mahmoud.abdel-fattah.net/2012/04/16/using-scrapy-with-different-many-proxies/

Acorn
  • 49,061
  • 27
  • 133
  • 172