Multithreading: URL fetching

Question

I tried to multiprocess a URL-fetching process since it would otherways take a massive amount of time to process the 300k urls I want to process. Somehow my code stops working after a random amount of time and I do not why. Can you help me? I already did some research about this but could not find anything which helped me a lot. Normally I can process around 20k links but then it freezes without an error just no further processing of links and the program still running. Maybe all the processes are congested with bad links? Any way to figure this out?

urls = list(datafull['SOURCEURL'])
#datafull['SOURCEURL'].apply(html_reader)

with futures.ThreadPoolExecutor(max_workers=50) as executor:
     pages = executor.map(html_reader,urls)

My html_reader script:

def html_reader(url):
try:
    os.chdir('/Users/benni/PycharmProjects/Untitled Folder/HTML raw')
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    r = requests.get(url, headers=headers)
    data = r.text
    url = str(url).replace('/','').replace('http:','').replace('https:','')
    name = 'htmlreader_'+url+'.html'
    f = open(name,'a')
    f.write(str(data))
    f.close()
    print(time.time(),' ',url)
    return data
except Exception:
    pass

Thanks a lot!

score 0 · Answer 1 · answered Mar 05 '18 at 13:38

I modified and cleaned up a bit of your code. You could try this one out. In the main method you need to put in the no. of cores available on your machine minus 1 as the value for n_jobs parameter. I use joblib library for multiprocessing so you need to have it installed on your machine.

import requests as rq
import os
from joblib import Parallel, delayed

os.chdir('/Users/benni/PycharmProjects/Untitled Folder/HTML raw')
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}


def write_data(htmlText):
    with open("webdata.txt","w+") as fp:
        fp.write(htmlText)


def get_html(url):
    resp = rq.get(url,headers=headers)
    if (resp.status_code==200):
        write_data(resp.text)
    else:
        println("No Data received for : {0}".format(url))

if __name__ == "__main__":
    urls = list(datafull['SOURCEURL'])
    Parallel(n_jobs="no.of cores on your machine - 1")(delayed(get_html)(link) for link in urls)

I already tried this. However, its not working at all. Nothing is happening. Any idea why its not working? I am running MacOSX. — inneb, Mar 05 '18 at 15:11

Multithreading: URL fetching

1 Answers1