Constructing a regular expression for url in start_urls list in scrapy framework python

Question

I am very new to scrapy and also i didn't used regular expressions before

The following is my spider.py code

class ExampleSpider(BaseSpider):
   name = "test_code
   allowed_domains = ["www.example.com"]
   start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

Now if we look at start_urls all the three urls are same except they differ at integer value 2?, 3? and so on i mean unlimited according to urls present on the site , i now that we can use crawlspider and we can construct regular expression for the URL like below,

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    import re

    class ExampleSpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = [
       "http://www.example.com/bookstore/new/1?filter=bookstore",
       "http://www.example.com/bookstore/new/2?filter=bookstore",
       "http://www.example.com/bookstore/new/3?filter=bookstore",
   ]

        rules = (
            Rule(SgmlLinkExtractor(allow=(........),))),
        ) 

   def parse(self, response):
       hxs = HtmlXPathSelector(response)

can u please guide me , that how can i construct a crawl spider Rule for the above start_url list.

score 4 · Accepted Answer · answered May 24 '12 at 18:02

4

If i understand you correctly, you want a lot of start URL with a certain pattern.

If so, you can override BaseSpider.start_requests method:

class ExampleSpider(BaseSpider):
    name = "test_code"
    allowed_domains = ["www.example.com"]

    def start_requests(self):
        for i in xrange(1000):
            yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)

    ...

answered May 24 '12 at 18:02

warvariuc

57,116
41
173
227

Thank you very much its very useful for me i got my output – Shiva Krishna Bavandla May 25 '12 at 04:05
Sure i will vote , how can we make that value for xrange unlimited, because actually i had 20 items(limited to) per page(If extra items are added on the page a URL with incremented interger will be created as shown in the above example) , and integer which is present in URL will contiune to increase.so is there a way to make that range as infinte. – Shiva Krishna Bavandla May 25 '12 at 06:06
An infinite number means infinite processing time. If you are still sure, make a loop: `i = 0; while True: yield ...` – warvariuc May 25 '12 at 08:13

score 0 · Answer 2 · answered May 24 '12 at 17:27

If you are using CrawlSpider, it's not usually a good idea to override the parse method.

Rule object can filter the urls you are interesed to the ones you do not care for.

See CrawlSpider in the docs for reference.

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re

class ExampleSpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/bookstore']

    rules = (
        Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
    )

def parse_boostore(self, response):
   hxs = HtmlXPathSelector(response)

SgmlLinkExtractor has been deprecated for some time now. – Nick Woodhams Apr 26 '19 at 03:10 — Nick Woodhams, Apr 26 '19 at 03:10

Constructing a regular expression for url in start_urls list in scrapy framework python

2 Answers2

Linked