I am trying to extract every title of this mailing list while registering how many replies each thread has.
According to Firebug, the Xpath to the <ul>
that contains all the titles is:
/html/body/table[2]/tbody/tr1/td[2]/table/tbody/tr/td/ul
However, if I paste this directly in Scrapy Shell, it will yield an empty list:
scrapy shell http://seclists.org/fulldisclosure/2002/Jul/index.html
response.xpath('/html/body/table[2]/tbody/tr[1]/td[2]/table/tbody/tr/td/ul')
[]
After some trial and error (since I couldn't figure out from the documentation any way to list the immediate sub-elements from a given Selector (please let em know if you know of one), I figured out that the element "tbody" didn't work on Xpath. By removing them, I was able to navigate up to /td
:
almost_email_threads = response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td')
However, if I attempt now to reach "ul" it will not work:
email_threads.xpath('/ul')
[]
Now, what confuses me the most is that running:
response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td//ul')
will give me the ul's, but not in the same order as appearing on the website. It skips threads and in different orders. Furthermore it seems impossible to be able to count the amount of replies per thread.
What am I missing here? It's been a while since I've used Scrapy, but I don't recollect being this hard to figure out, and tutorials for whatever reason do not pull out either on Bing or Google for me.