1

A script read a list of URLs, I pass that list in a Queue and then I process them with python-newspaper3k. I have a lot of various URLs, many of them aren't very popular websites. The problem is that the processing never ends. Sometimes it reached the end, but there are some processes that dealt with some problem to stop. The problem is when the python-newspaper tries to parse each HTML. The code is

Here I load the URLs in the Queue and then using newspaper I download and parse each HTML.

def grab_data_from_queue():
    #while not q.empty(): # check that the queue isn't empty
    while True:
        if q.empty():
            break
        #print q.qsize()
        try:
            urlinit = q.get(timeout=10) # get the item from the queue
            if urlinit is None:
                print('urlinit is None')
                q.task_done()
            url = urlinit.split("\t")[0]
            url = url.strip('/')
            if ',' in url:
                print(', in url')
                q.task_done()
            datecsv = urlinit.split("\t\t\t\t\t")[1]
            url2 = url
            time_started = time.time()
            timelimit = 2
            #page = requests.get(url)
            #page.raise_for_status()

            #print "Trying: " + str(url)

            if len(url) > 30:

                if photo == 'wp':
                    article = Article(url, browser_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0')
                else:
                    article = Article(url, browser_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0', fetch_images=False)
                    imgUrl = ""

                #response = get(url, timeout=10)
                #article.set_html(response.content)

                article.download()
                article.parse()
                print(str(q.qsize()) + " parse passed")

Then I make the threading

for i in range(4): # aka number of threadtex
    try:
        t1 = Thread(target = grab_data_from_queue,) # target is the above function
        t1.setDaemon(True)
        t1.start() # start the thread
    except Exception as e:
        exc_type, exc_obj, exc_tb = sys.exc_info()
        print(str(exc_tb.tb_lineno) + ' => ' + str(e))


q.join()

Is there a way to find which URL have the problem and it takes to long to exit? If I can't find the URL, is it possible to stop the thread daemon?

Zoe
  • 27,060
  • 21
  • 118
  • 148

0 Answers0