I'm going to write a multithreaded crawler that is planned to run on about 10M pages, in order to speed thing up I need to fetch about 10~ different pages simultaneity.
Each of the crawler-threads will use a different proxy and push the results to a queue, on the other side I'll have a few more workers that will fetch the results from the queue, parse and insert them to a DB.
Is that the right approach? will I have problems saving too many results in the queue? should I be worried about locks? (using the queue module). Which HTTP library will be the best for my needs? (httplib2/urllib2).
when creating each thread, should I pass new instances of the request object to each thread or should I move the request object and use its "getPage" function in the thread?
Thanks.