-1

I'm running this spider on a site. It works fine, but one problem I'm running into is that there are a number of hrefs with "#" as the link.

How can I skip or drop those # links altogether? I'm outputting the links to a file currently, and using the lstrip dumps "" to the file. I've also tried i.replace, but it still drops a blank line in the file.

  • Can you clarify your intention? would you like to remove the '#' from the string? of ignore it all? – omri_saadon Feb 20 '17 at 23:00
  • It's generally a good idea to post the relevant portions of code *here*, rather than host them at some 3rd party link. – Nick T Feb 20 '17 at 23:09

2 Answers2

1

For everything that matches your selector, you're yielding. Conditionally yield, so convert:

for i in selector.extract():
    yield {"url": i.lstrip('#')}

into something like

for i in selector.extract():
    url = i.lstrip('#')
    if url:
        yield {"url": url}
Nick T
  • 25,754
  • 12
  • 83
  • 121
0

In order to skip those links, you need to change XPath expression to extract href attribute only if they don't contain "#"

selector = response.xpath('//*/a[not(contains(@href, "#"))]/@href')
zet5
  • 1