A script read a list of URLs, I pass that list in a Queue and then I process them with python-newspaper3k. I have a lot of various URLs, many of them aren't very popular websites. The problem is that the processing never ends. Sometimes it reached the end, but there are some processes that dealt with some problem to stop. The problem is when the python-newspaper tries to parse each HTML. The code is
Here I load the URLs in the Queue and then using newspaper I download and parse each HTML.
def grab_data_from_queue():
#while not q.empty(): # check that the queue isn't empty
while True:
if q.empty():
break
#print q.qsize()
try:
urlinit = q.get(timeout=10) # get the item from the queue
if urlinit is None:
print('urlinit is None')
q.task_done()
url = urlinit.split("\t")[0]
url = url.strip('/')
if ',' in url:
print(', in url')
q.task_done()
datecsv = urlinit.split("\t\t\t\t\t")[1]
url2 = url
time_started = time.time()
timelimit = 2
#page = requests.get(url)
#page.raise_for_status()
#print "Trying: " + str(url)
if len(url) > 30:
if photo == 'wp':
article = Article(url, browser_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0')
else:
article = Article(url, browser_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0', fetch_images=False)
imgUrl = ""
#response = get(url, timeout=10)
#article.set_html(response.content)
article.download()
article.parse()
print(str(q.qsize()) + " parse passed")
Then I make the threading
for i in range(4): # aka number of threadtex
try:
t1 = Thread(target = grab_data_from_queue,) # target is the above function
t1.setDaemon(True)
t1.start() # start the thread
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
print(str(exc_tb.tb_lineno) + ' => ' + str(e))
q.join()
Is there a way to find which URL have the problem and it takes to long to exit? If I can't find the URL, is it possible to stop the thread daemon?