Python - using multi-processing / multi threading for web scraping

Question

I am using wikipedia python package to scrape data of a particular topic

q=['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder', 'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']

Example above, I've searched for NASA. Now I need to obtain the summary for each of the element in the list.

ny = []
for i in range(len(q)):
    y = wikipedia.page(q[i])
    x = y.summary
    ny.append(x)

In doing this whole process i.e. traversing each element of list and retrieving summary of each element, it's taking almost 40-60 seconds for the entire process to be completed (even with a good network connection)

I don't know much about multiprocessing / multithreading. How can i speed up the execution by a considerable time? Any help will be appreciated.

grovina · Accepted Answer · 2017-11-15T15:19:28.303

You can use a processing pool (see documentation).

Here is an example based on your code:

from multiprocessing import Pool


q = ['NASA', 'NASA_insignia', 'NASA_spinoff_technologies', 'NASA_facilities', 'NASA_Pathfinder',
     'List_of_NASA_missions', 'Langley_Research_Center', 'NASA-TLX', 'Budget_of_NASA', 'NASA_(disambiguation)']

def f(q_i):
    y = wikipedia.page(q_i)
    return y.summary

with Pool(5) as p:
    ny = p.map(f, q)

Basically f is applied for each element in q in separate processes. You can determine the number of processes when defining the pool (5 in my example).

Python - using multi-processing / multi threading for web scraping

1 Answers1