I tried to multiprocess a URL-fetching process since it would otherways take a massive amount of time to process the 300k urls I want to process. Somehow my code stops working after a random amount of time and I do not why. Can you help me? I already did some research about this but could not find anything which helped me a lot. Normally I can process around 20k links but then it freezes without an error just no further processing of links and the program still running. Maybe all the processes are congested with bad links? Any way to figure this out?
urls = list(datafull['SOURCEURL'])
#datafull['SOURCEURL'].apply(html_reader)
with futures.ThreadPoolExecutor(max_workers=50) as executor:
pages = executor.map(html_reader,urls)
My html_reader script:
def html_reader(url):
try:
os.chdir('/Users/benni/PycharmProjects/Untitled Folder/HTML raw')
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url, headers=headers)
data = r.text
url = str(url).replace('/','').replace('http:','').replace('https:','')
name = 'htmlreader_'+url+'.html'
f = open(name,'a')
f.write(str(data))
f.close()
print(time.time(),' ',url)
return data
except Exception:
pass
Thanks a lot!