0

I wanted to ask is there option with Scrapy to crawl websites using only URL and regular expressions. When I what to extract certain information you need to use rules (not always) to extract links and fallow those links to the page where needed information is, but what I mean, is it possible to take URL and use it with regular expressions to generate requests and than parse results.

For an example lets take this URL:

http//:www.example.com/date/2014/news/117

Let say that all the articles are in the last part of URL “/117”. So to my mind it would be easer to write regular expressions for the URL:

http//:www.example.com/date/2014/news/\d+

If with this regular expression you could make HTTP requests to the pages that it would make life very simple in some cases. I wonder is there such way?

Vy.Iv
  • 829
  • 2
  • 8
  • 17
  • what will that regular expression be matched with? where do you intend to take the urls to try and match it? – Guy Gavriely Mar 07 '14 at 23:48
  • That is a answer that I am looking for. Is there such option and example with such solution would be nice too. I don't have any idea at this moment. Start_urls could be likely option but I am just guessing. – Vy.Iv Mar 08 '14 at 00:35
  • http://stackoverflow.com/questions/10738560/constructing-a-regular-expression-for-url-in-strat-url-list-in-scrapy-framework/10742895#10742895 – warvariuc Mar 08 '14 at 08:20

1 Answers1

1

CrawlerSpider with the right link extractor can do just that, see an example from scrapy docs:

class MySpider(CrawlSpider):
    ...
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    ...
Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42