Scrapy Shell XPath

Question

I am trying to get links and category from this http://www.npr.org/rss/#feeds news feed website.

This is my xpath in scrapy shell:

a = sel.xpath('//ul[@class="rsslinks"]/li/a/@href').extract()

b = sel.xpath('//ul[@class="rsslinks"]/li/a/text()').extract()

But length of b is one lesser than length of a. I don't know what I am missing here. But this is causing problems in data.

From the image below,the category name is "Most Emailed Stories" but link is for "News Headlines"

Any help would be appreciated Xpath Screen

alecxe · Accepted Answer · 2015-01-05T22:54:59.997

4

This is because of the first link in the results:

<a class="iconlink xml" href="/rss/rss.php?id=1001" target="blank"><strong>News Headlines</strong></a>

As you can see, there is no direct child "text" nodes, only one strong element. Your xpath would not match it.

Add an another slash to get all text nodes from the a tag:

//ul[@class="rsslinks"]/li/a//text()
                         HERE^

edited Jan 05 '15 at 22:54

answered Jan 05 '15 at 19:24

alecxe

462,703
120
1,088
1,195

In my opinion, the wording of your answer is misleading because it suggests that there is a `text` element in the sense of an element node named "text", in the same way as `strong.` Please make it clear that text nodes are not element nodes. – Mathias Müller Jan 05 '15 at 22:54
@MathiasMüller very good point, thanks, I think it should be better now. And, btw, thanks for contributing to `xpath` tag - learning a lot from your answers. – alecxe Jan 05 '15 at 22:55
+ 1, it's fine now. And dito, I can only return the compliment! – Mathias Müller Jan 05 '15 at 23:07

score 1 · Answer 2 · answered Jan 05 '15 at 19:25

1

The text for /rss/rss.php?id=1001 with the label of News Headlines appears to be another level down under <strong> </strong>while the other links are not.

answered Jan 05 '15 at 19:25

Camron_Godbout

1,583
1
15
22

Yes,my bad ! Didn't notice that,thanks for the help ! – m0rpheu5 Jan 05 '15 at 20:16

Scrapy Shell XPath

2 Answers2