2

I am trying to get links and category from this http://www.npr.org/rss/#feeds news feed website.

This is my xpath in scrapy shell:

a = sel.xpath('//ul[@class="rsslinks"]/li/a/@href').extract()

b = sel.xpath('//ul[@class="rsslinks"]/li/a/text()').extract()

But length of b is one lesser than length of a. I don't know what I am missing here. But this is causing problems in data.

From the image below,the category name is "Most Emailed Stories" but link is for "News Headlines"

Any help would be appreciatedXpath Screen

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
m0rpheu5
  • 600
  • 4
  • 16

2 Answers2

4

This is because of the first link in the results:

<a class="iconlink xml" href="/rss/rss.php?id=1001" target="blank"><strong>News Headlines</strong></a>

As you can see, there is no direct child "text" nodes, only one strong element. Your xpath would not match it.

Add an another slash to get all text nodes from the a tag:

//ul[@class="rsslinks"]/li/a//text()
                         HERE^
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • In my opinion, the wording of your answer is misleading because it suggests that there is a `text` element in the sense of an element node named "text", in the same way as `strong.` Please make it clear that text nodes are not element nodes. – Mathias Müller Jan 05 '15 at 22:54
  • @MathiasMüller very good point, thanks, I think it should be better now. And, btw, thanks for contributing to `xpath` tag - learning a lot from your answers. – alecxe Jan 05 '15 at 22:55
  • + 1, it's fine now. And dito, I can only return the compliment! – Mathias Müller Jan 05 '15 at 23:07
1

The text for /rss/rss.php?id=1001 with the label of News Headlines appears to be another level down under <strong> </strong>while the other links are not.

Camron_Godbout
  • 1,583
  • 1
  • 15
  • 22