23

I'm new to using Scrapy and I wanted to understand how the rules are being used within the CrawlSpider.

If I have a rule where I'm crawling through the yellowpages for cupcake listings in Tucson, AZ, how does yielding a URL request activate the rule - specifically how does it activiate the restrict_xpath attribute?

Thanks.

OfLettersAndNumbers
  • 822
  • 1
  • 12
  • 22
  • Rule? Or XPATH? What you mean? – Nabin Aug 17 '14 at 07:50
  • @Nabin, my question was I didn't understand how it uses "Rule" when the spider is crawling through the webpage. When it sends the http GET request and receives the page back, does it run through the rules first or does the callback get triggered first? – OfLettersAndNumbers Aug 22 '14 at 21:10

1 Answers1

27

The rules attribute for a CrawlSpider specify how to extract the links from a page and which callbacks should be called for those links. They are handled by the default parse() method implemented in that class -- look here to read the source.

So, whenever you want to trigger the rules for an URL, you just need to yield a scrapy.Request(url, self.parse), and the Scrapy engine will send a request to that URL and apply the rules to the response.

The extraction of the links (that may or may not use restrict_xpaths) is done by the LinkExtractor object registered for that rule. It basically searches for all the <a>s and <area>s elements in the whole page or only in the elements obtained after applying the restrict_xpaths expressions if the attribute is set.

Example:

For example, say you have a CrawlSpider like so:

from scrapy.contrib.spiders.crawl import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    start_urls = ['http://someurlhere.com']
    rules = (
        Rule(
            LinkExtractor(restrict_xpaths=[
                "//ul[@class='menu-categories']",
                "//ul[@class='menu-subcategories']"]),
            callback='parse'
        ),
        Rule(
            LinkExtractor(allow='/product.php?id=\d+'),
            callback='parse_product_page'
        ),
    )

    def parse_product_page(self, response):
        # yield product item here

The engine starts sending requests to the urls in start_urls and executing the default callback (the parse() method in CrawlSpider) for their response.

For each response, the parse() method will execute the link extractors on it to get the links from the page. Namely, it calls the LinkExtractor.extract_links(response) for each response object to get the urls, and then yields scrapy.Request(url, <rule_callback>) objects.

The example code is an skeleton for a spider that crawls an e-commerce site following the links of product categories and subcategories, to get links for each of the product pages.

For the rules registered specifically in this spider, it would crawl the links inside the lists of "categories" and "subcategories" with the parse() method as callback (which will trigger the crawl rules to be called for these pages), and the links matching the regular expression product.php?id=\d+ with the callback parse_product_page() -- which would finally scrape the product data.

As you can see, pretty powerful stuff. =)

Read more:

mthecreator
  • 764
  • 1
  • 8
  • 19
Elias Dorneles
  • 22,556
  • 11
  • 85
  • 107
  • 1
    Thanks @elias for your help. This is extremely helpful! A few follow up questions: **1)** Does LinkExtractor essentially parse through the html response and search for links that fall under "//ul[@class='menu-categories']" and "//ul[@class='menu-subcategories']"? Won't that return a large list of results? **2)** What does it mean when you have "//" under the restrict_xpaths? Does that mean it's under "/html/body"? – OfLettersAndNumbers Aug 18 '14 at 08:51
  • 1
    @SammyLee: You're welcome. **1)** Yes, that's precisely what it does. It may return lots of results, depending on the pages. You don't need to worry about duplication though, because the framework has a dupefilter. **2)** In Xpath, `//elem` means all the elements with name `elem` in the document. This is a nice XPath tutorial: http://zvon.org/comp/r/tut-XPath_1.html Btw, if my answer was useful, you can mark it as accepted. – Elias Dorneles Aug 18 '14 at 12:36
  • Thanks for your response. For the links that do match the restrict_xpaths, where and how are they stored? Just curious to see how they are grabbed to create the next request to yield. – OfLettersAndNumbers Aug 18 '14 at 16:25
  • @SammyLee AFAIK, the links only live in memory. You can see the code that does the extraction and yield the requests here: https://github.com/scrapy/scrapy/blob/master/scrapy/contrib/spiders/crawl.py#L47 – Elias Dorneles Aug 18 '14 at 17:10
  • @elias Can you please expand on the order in which the `LinkExtractors` can evaluated. Thank you. Also I have created created a question regarding `CrawlSpider` if you can take a look at it if you get a chance: http://stackoverflow.com/questions/43417048/in-which-order-do-the-rules-get-evaluated-in-the-crawlspider –  Apr 14 '17 at 18:41