Scraping multiple urls in parallel and inserting lxml element in queue

Question

I am parsing multiple pages at once using lxml module with this piece of code

def read_and_parse_url(url, queue):
    """ Read and parse the url """

    data = urllib2.urlopen(url).read()
    root = lxml.html.fromstring(data)
    queue.put(root)



def fetch_parallel(urls_to_load):
    """ Read and parse urls in parallel """

    result = Queue.Queue()
    processes = [multiprocessing.Process(target = read_and_parse_url, args = (url,result)) for url in urls_to_load]
    for p in processes:
        p.start()

    for p in processes:
        p.join(15) # 15 seconds timeout

    return result

Using the Queue module (result = Queue.Queue()), after it runs and I check for qsize, the size is zero like I had never inserted the data there (it's supposed to be 50+).

If I create the queue with result = multiprocessing.Queue(), qsize() shows the size properly, but then I have a new problem: when I use the get method on the queue I get this error:

Traceback (most recent call last):
  File "test.py", line 329, in <module>
    d = scrape()
  File "test.py", line 172, in scrape
    print parsed_urls.get()
  File "lxml.etree.pyx", line 1021, in lxml.etree._Element.__repr__ (src/lxml/lxml.etree.c:37950)
  File "lxml.etree.pyx", line 863, in lxml.etree._Element.tag.__get__ (src/lxml/lxml.etree.c:36699)
  File "apihelpers.pxi", line 15, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:10557)
AssertionError: invalid Element proxy at 36856848

Some notes: - parsed_urls is just the queue - when I was using the threading module, everything worked perfectly. The only problem is that I coudn't kill threads in a easy way so I switched to multiprocessing module.

What's wrong on using Queue module along with multiprocessing module? It doesn't seem to work.

Any clues? I pretty much searched for all of that and couldn't find any answers.

score 1 · Answer 1 · edited May 23 '17 at 11:56

1

Queue.Queue is for multithreads apps: https://docs.python.org/2/library/queue.html not for multi processes apps.

multiprocessing.Queue is for multiprocesses apps: https://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes

Check my complete answer here: Python Queue usage works in threading but (apparently) not in multiprocessing

edited May 23 '17 at 11:56

Community

1
1

answered Feb 13 '15 at 01:45

Juan Fco. Roco

1,608
14
11

Thanks for sharing this, it really helps understanding better how the modules work. But it doesn't solve the problem, it seems that multiprocessing queue doesn't accept lxml element tree type in some kind of way 'cause inserting any other variable type (int, float, list, dict) on `queue.put()` makes it work. – Thiago Feb 14 '15 at 13:42

Scraping multiple urls in parallel and inserting lxml element in queue

1 Answers1