Best way of parallelising this webcrawling loop?

Question

I am making a webcrawler, and I have some "sleep" functions that make the crawl quite long. For now I am doing :

for speciality in range(1,25):
    for year in range(1997, 2017):
        for quarter in [1,2]:
            deal_with (driver, year, quarter, speciality, ok)

The deal_with function is opening several webpages, waiting a few second for complete html download before moving on. The execution time is then very long : there is 25 * 10 * 2 = 500 loops, with no less than a minute by loop.

I would like to use my 4 physical Cores (8 threads) to enjoy parallelism. I read about tornado, multiprocessing, joblib... and can't really make my mind on an easy solution to adapt to my code.

Any insight welcome :-)

You don't really need many CPUs to handle this kind of workload, since web crawling is mostly I/O bound. Try using Tornado — BlackBear, Nov 20 '16 at 17:59
For now my solution is to have several notebooks open, each taking care of a subset of the speciality range... — Romain Jouin, Nov 20 '16 at 18:16
The easiest would probably be using the `thread` module to download more than one webpage at the same time (assuming your web connection has the bandwidth). — martineau, Nov 20 '16 at 19:16
Where is your sleep? Inside the `deal_with` method? Would a solution to move to one loop help? — Parfait, Nov 20 '16 at 19:22

score 0 · Answer 1 · answered Nov 20 '16 at 18:14

0

If you're using python3, I would check out the asycio module. I believe you can just decorate deal_with with @asyncio.coroutine. You will likely have to adjust what deal_with does to properly work with the event loop as well.

answered Nov 20 '16 at 18:14

Alex

1,432
14
26

I have to say I am in python 2 :-/ – Romain Jouin Nov 20 '16 at 18:15
You can always [port](https://docs.python.org/2/library/2to3.html) your code to 3. – Alex Nov 20 '16 at 18:20

score 0 · Accepted Answer · edited May 23 '17 at 10:30

tl;dr Investing in any choice without fully understanding the bottlenecks you are facing will not help you.

At the end of the day, there are only two fundamental approaches to scaling out a task like this:

Multiprocessing

You launch a number of Python processes, and distribute tasks to each of them. This is the approach you think will help you right now.

Some sample code for how this works, though you could use any appropriate wrapper:

import multiprocessing

# general rule of thumb: launch twice as many processes as cores

process_pool = multiprocessing.Pool(8) # launches 8 processes

# generate a list of all inputs you wish to feed to this pool

inputs = []

for speciality in range(1,25):
    for year in range(1997, 2017):
        for quarter in [1,2]:
            inputs.append((driver, year, quarter, speciality, ok))

# feed your list of inputs to your process_pool and print it when done
print(process_pool.map(deal_with, inputs))

If this is all you wanted, you can stop reading now.

Asynchronous Execution

Here, you are content with a single thread or process, but you don't want it to be sitting idle waiting for stuff like network reads or disk seeks to come back - you want it to go on and do other, more important things while it's waiting.

True native asynchronous I/O support is provided in Python 3 and does not exist in Python 2.7 outside of the Twisted networking library.

import concurrent.futures

# generate a list of all inputs you wish to feed to this pool

inputs = []

for speciality in range(1,25):
    for year in range(1997, 2017):
        for quarter in [1,2]:
            inputs.append((driver, year, quarter, speciality, ok))

# produce a pool of processes, and make sure they don't block each other
# - get back an object representing something yet to be resolved, that will
# only be updated when data comes in.

with concurrent.futures.ProcessPoolExecutor() as executor:
    outputs = [executor.submit(input_tuple) for input_tuple in inputs]

    # wait for all of them to finish - not ideal, since it defeats the purpose
    # in production, but sufficient for an example

    for future_object in concurrent.futures.as_completed(outputs):
         # do something with future_object.result()

So What's the Difference?

My main point here it to emphasise that choosing from a list of technologies isn't as hard as figuring out where the real bottleneck is.

In the examples above, there isn't any difference. Both follow a simple pattern:

Have a lot of workers
Allow these workers to pick something from a queue of tasks right away
When one is free, set them to work on the next one right away.

Thus, you gain no conceptual difference altogether if you follow these examples verbatim, even though they use entirely different technologies and claim to use entirely different techniques.

Any technology you pick will be for naught if you write it in this pattern - even though you'll get some speedup, you will be sorely disappointed if you expected a massive performance boost.

Why is this pattern bad? Because it doesn't solve your problem.

Your problem is simple: you have wait. While your process is waiting for something to come back, it can't do anything else! It can't call more pages for you. It can't process an incoming task. All it can do is wait.

Having more processes that ultimately wait is not the true solution. An army of troops that has to march to Waterloo will not be faster if you split it into regiments - each regiment eventually has to sleep, though they may sleep at different times and for different lengths, and what will happen is that all of them will arrive at almost roughly the same time.

What you need is an army that never sleeps.

So What Should You Do?

Abstract all I/O bound tasks into something non-blocking. This is your true bottleneck. If you're waiting for a network response, don't let the poor process just sit there - give it something to do.

Your task is made somewhat difficult in that by default reading from a socket is blocking. It's the way operating systems are. Thankfully, you don't need to get Python 3 to solve it (though that is always the preferred solution) - the asyncore library (though Twisted is comparably superior in every way) already exists in Python 2.7 to make network reads and writes truly in the background.

There is one and only one case where true multiprocessing needs to be used in Python, and that's if you are doing CPU-bound or CPU-intensive work. From your description, it doesn't sound like that's the case.

In short, you should edit your deal_with function to avoid the incipient wait. Make that wait in the background, if needed, using a suitable abstraction from Twisted or asyncore. But don't make it consume your process completely.

Best way of parallelising this webcrawling loop?

2 Answers2

Multiprocessing

Asynchronous Execution

So What's the Difference?

So What Should You Do?