-1

I am trying to extract every title of this mailing list while registering how many replies each thread has.

According to Firebug, the Xpath to the <ul> that contains all the titles is:

/html/body/table[2]/tbody/tr1/td[2]/table/tbody/tr/td/ul

However, if I paste this directly in Scrapy Shell, it will yield an empty list:

scrapy shell http://seclists.org/fulldisclosure/2002/Jul/index.html
response.xpath('/html/body/table[2]/tbody/tr[1]/td[2]/table/tbody/tr/td/ul')
[]

After some trial and error (since I couldn't figure out from the documentation any way to list the immediate sub-elements from a given Selector (please let em know if you know of one), I figured out that the element "tbody" didn't work on Xpath. By removing them, I was able to navigate up to /td:

almost_email_threads = response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td')

However, if I attempt now to reach "ul" it will not work:

email_threads.xpath('/ul')
[]

Now, what confuses me the most is that running:

response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td//ul')

will give me the ul's, but not in the same order as appearing on the website. It skips threads and in different orders. Furthermore it seems impossible to be able to count the amount of replies per thread.

What am I missing here? It's been a while since I've used Scrapy, but I don't recollect being this hard to figure out, and tutorials for whatever reason do not pull out either on Bing or Google for me.

Community
  • 1
  • 1

1 Answers1

-2

I have never used Firebug, but looking at the HTML page you refer, I'd say that the following XPath expression will give you all top level threads:

//li[not(ancestor::li) and ./a/@name]

Starting from each list element, you then need to count the amount of list children to get the amount of replies to any given thread.

Using the Scrapy shell, this results in:

> scrapy shell http://seclists.org/fulldisclosure/2002/Jul/index.html
In [1]: threads = response.xpath('//li[not(ancestor::li) and ./a/@name]')
In [2]: for thread in threads:
   ...:     print thread, len(thread.xpath('descendant::li'))
<Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="0" href="0">Testing</a> <em'> 0
<Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="1" href="1">full disclosure'> 4
<Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="3" href="3">The Death Of TC'> 1
<Selector xpath='//li[not(ancestor::li) and ./a/@name]' data=u'<li><a name="7" href="7">Re: Announcing '> 24
[...]

Regarding your question on how to list all sub-elements from a given selector, you just need to realize that the result of running an XPath query on a selector is a SelectorList where each list element implements the Selector interface. So you can simply use XPath again to e.g. list all the children:

In [3]: thread.xpath('child::*')
Out[3]: 
[<Selector xpath='child::*' data=u'<a name="309" href="309">it\'s all about '>,
 <Selector xpath='child::*' data=u'<em>Florin Andrei (Jul 31)</em>'>,
 <Selector xpath='child::*' data=u'<ul>\n<li><a name="313" href="313">it\'s a'>]
Markus
  • 3,155
  • 2
  • 23
  • 33
  • Thank you. It seems to work here. I will wait to see if anyone knows the answer to how to list the selectors on Spacy or want to tip in any other advice, if not I will accept your answer here. I upvoted your answer as well. Do you know why both my question and your answer was downvoted? It's hard to improve when they don't comment to improve instead of plain downvoting.. – Oeufcoque Penteano Aug 02 '16 at 19:47
  • 1
    I have updated the answer to show you how to get the number of replies and the children of a selector. Regarding the down-vote of your question I can only speculate; it might be argued that it contains no MCV example and at least one formatting error. Having a look at http://stackoverflow.com/help/how-to-ask might help. But I think it to be just plain rude to downvote without giving an explanatory comment. – Markus Aug 03 '16 at 07:42