0

I don't understand how scrapy rules work. Say I want to scrape a site and I want it to go through links that contain "category". I want to open the urls that contain "product" and then pass that through to a callback. How do I write this?

What is wrong with this?

rules = (  
           Rule(SgmlLinkExtractor(allow=r'.*?categoryId.*'), follow=True),
           Rule(SgmlLinkExtractor(allow=r'.*?productId.*'), callback='parse_item'),
        )

I'm getting the following error:

    Traceback (most recent call last):
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
        taskObj._oneWorkUnit()
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
        result = next(self._iterator)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
        yield next(it)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 23, in process_spider_output
        for x in result:
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
        for request_or_item in self._requests_to_follow(response):
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 128, in extract_links
        links = self._extract_links(body, response.url, response.encoding, base_url)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 29, in _extract_links
        self.feed(response_text)
      File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 104, in feed
        self.goahead(0)
      File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 174, in goahead
        k = self.parse_declaration(i)
      File "/home/scraper/.fakeroot/lib/python2.7/markupbase.py", line 140, in parse_declaration
        "unexpected %r char in declaration" % rawdata[j])
      File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 111, in error
        raise SGMLParseError(message)
    sgmllib.SGMLParseError: unexpected '=' char in declaration
Crypto
  • 1,217
  • 3
  • 17
  • 33
  • It looks OK. Except the second `follow=True` it's necessary only if you want to follow the "category" links within the "product" page. You could take out the first `follow=True` as the default behavior is to follow the links when there is no callback defined. Have you encountered any problem with this rules? – R. Max Jan 13 '14 at 02:59
  • sgmllib.SGMLParseError: unexpected '=' char in declaration – Crypto Jan 13 '14 at 07:32
  • Updated with full traceback. – Crypto Jan 13 '14 at 08:03
  • 1
    I think it has to do with the HTML response, not the way you declare you crawl rules. They look fine. – Balthazar Rouberol Jan 13 '14 at 08:09
  • How do I go about fixing that? When I tried the rules on another website, it does work indeed. So, something strange is going on with this website. – Crypto Jan 13 '14 at 09:06
  • @Crypto here is a related question: http://stackoverflow.com/questions/12352674/python-unable-to-retrieve-form-with-urllib-or-mechanize You could use a middleware to modify the response's body replacing the offending html declaration. It would be helpful if you can share the either web page or piece of html code that causes this error. This might be fixed in future versions of Scrapy. – R. Max Jan 13 '14 at 16:47
  • The site in question is gnc.com The site is already returning an invalid encoding in Content-Type header so I used a middleware to change that to utf-8 in order to get around the UnicodeEncodingError. But then I'm getting these new errors now. – Crypto Jan 13 '14 at 19:10
  • I wrote a middleware to remove the DOCTYPE declaration from response.body and it is working properly now. The DOCTYPE declaration seems to be correct though, so I'm not sure if this is a bug in scrapy. – Crypto Jan 14 '14 at 06:23

1 Answers1

0

Try this:

rules = (
       Rule(SgmlLinkExtractor(allow=(r'.*?categoryId.*',)), follow=True),
       Rule(SgmlLinkExtractor(allow=(r'.*?productId.*',)), callback='parse_item'),
    )
user3295878
  • 831
  • 1
  • 6
  • 19