I am parsing multiple pages at once using lxml
module with this piece of code
def read_and_parse_url(url, queue):
""" Read and parse the url """
data = urllib2.urlopen(url).read()
root = lxml.html.fromstring(data)
queue.put(root)
def fetch_parallel(urls_to_load):
""" Read and parse urls in parallel """
result = Queue.Queue()
processes = [multiprocessing.Process(target = read_and_parse_url, args = (url,result)) for url in urls_to_load]
for p in processes:
p.start()
for p in processes:
p.join(15) # 15 seconds timeout
return result
Using the Queue module (result = Queue.Queue()
), after it runs and I check for qsize
, the size is zero like I had never inserted the data there (it's supposed to be 50+).
If I create the queue with result = multiprocessing.Queue()
, qsize()
shows the size properly, but then I have a new problem: when I use the get
method on the queue I get this error:
Traceback (most recent call last):
File "test.py", line 329, in <module>
d = scrape()
File "test.py", line 172, in scrape
print parsed_urls.get()
File "lxml.etree.pyx", line 1021, in lxml.etree._Element.__repr__ (src/lxml/lxml.etree.c:37950)
File "lxml.etree.pyx", line 863, in lxml.etree._Element.tag.__get__ (src/lxml/lxml.etree.c:36699)
File "apihelpers.pxi", line 15, in lxml.etree._assertValidNode (src/lxml/lxml.etree.c:10557)
AssertionError: invalid Element proxy at 36856848
Some notes:
- parsed_urls
is just the queue
- when I was using the threading
module, everything worked perfectly. The only problem is that I coudn't kill threads in a easy way so I switched to multiprocessing
module.
What's wrong on using Queue
module along with multiprocessing
module? It doesn't seem to work.
Any clues? I pretty much searched for all of that and couldn't find any answers.