How to list the urls in my website that has a keyword in the page using scrapy?

Question

I Tried this it lists all the urls in my website.

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

I want to list the urls's that has some text, say "Scrappy Test" in the webpage. Any help will be appreciated.

thanks for providing the link, can you include the code snippet you used, the outcome you expected, what actually resulted from the snippet and and any error stack produced? — glls, May 22 '16 at 18:53

score 0 · Accepted Answer · answered May 23 '16 at 10:44

0

If you already have all the urls (as you say in your comment) but want to filter them by a sub-string then try:

if 'Scrapy Test' in url:
    print url
    yield Request(url)

answered May 23 '16 at 10:44

Steve

976
5
15

How to list the urls in my website that has a keyword in the page using scrapy?

1 Answers1