Is it possible to take URL and use it with regular expression to generate requests(Scrapy)

Question

I wanted to ask is there option with Scrapy to crawl websites using only URL and regular expressions. When I what to extract certain information you need to use rules (not always) to extract links and fallow those links to the page where needed information is, but what I mean, is it possible to take URL and use it with regular expressions to generate requests and than parse results.

For an example lets take this URL:

http//:www.example.com/date/2014/news/117

Let say that all the articles are in the last part of URL “/117”. So to my mind it would be easer to write regular expressions for the URL:

http//:www.example.com/date/2014/news/\d+

If with this regular expression you could make HTTP requests to the pages that it would make life very simple in some cases. I wonder is there such way?

what will that regular expression be matched with? where do you intend to take the urls to try and match it? — Guy Gavriely, Mar 07 '14 at 23:48
That is a answer that I am looking for. Is there such option and example with such solution would be nice too. I don't have any idea at this moment. Start_urls could be likely option but I am just guessing. — Vy.Iv, Mar 08 '14 at 00:35
http://stackoverflow.com/questions/10738560/constructing-a-regular-expression-for-url-in-strat-url-list-in-scrapy-framework/10742895#10742895 — warvariuc, Mar 08 '14 at 08:20

score 1 · Answer 1 · answered Mar 08 '14 at 00:43

CrawlerSpider with the right link extractor can do just that, see an example from scrapy docs:

class MySpider(CrawlSpider):
    ...
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    ...

Is it possible to take URL and use it with regular expression to generate requests(Scrapy)

1 Answers1