-2

I Tried this it lists all the urls in my website.

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.http import Request

DOMAIN = 'example.com'
URL = 'http://%s' % DOMAIN

class MySpider(BaseSpider):
    name = DOMAIN
    allowed_domains = [DOMAIN]
    start_urls = [
        URL
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        for url in hxs.select('//a/@href').extract():
            if not ( url.startswith('http://') or url.startswith('https://') ):
                url= URL + url 
            print url
            yield Request(url, callback=self.parse)

I want to list the urls's that has some text, say "Scrappy Test" in the webpage. Any help will be appreciated.

Roshan Chhetri
  • 149
  • 2
  • 9

1 Answers1

0

If you already have all the urls (as you say in your comment) but want to filter them by a sub-string then try:

if 'Scrapy Test' in url:
    print url
    yield Request(url)
Steve
  • 976
  • 5
  • 15