Dropping "#" links from Scrapy crawl

Question

I'm running this spider on a site. It works fine, but one problem I'm running into is that there are a number of hrefs with "#" as the link.

How can I skip or drop those # links altogether? I'm outputting the links to a file currently, and using the lstrip dumps "" to the file. I've also tried i.replace, but it still drops a blank line in the file.

Can you clarify your intention? would you like to remove the '#' from the string? of ignore it all? — omri_saadon, Feb 20 '17 at 23:00
It's generally a good idea to post the relevant portions of code *here*, rather than host them at some 3rd party link. — Nick T, Feb 20 '17 at 23:09

score 1 · Answer 1 · answered Feb 20 '17 at 23:07

For everything that matches your selector, you're yielding. Conditionally yield, so convert:

for i in selector.extract():
    yield {"url": i.lstrip('#')}

into something like

for i in selector.extract():
    url = i.lstrip('#')
    if url:
        yield {"url": url}

score 0 · Answer 2 · answered Feb 22 '17 at 08:24

0

In order to skip those links, you need to change XPath expression to extract href attribute only if they don't contain "#"

selector = response.xpath('//*/a[not(contains(@href, "#"))]/@href')

answered Feb 22 '17 at 08:24

zet5

1

That worked perfectly! I had actually tried `not(contains())` previously, but it seems my syntax was wrong. Thanks! – Christopher Smith Feb 23 '17 at 19:20

Dropping "#" links from Scrapy crawl

2 Answers2