1

I have integrated scrapy-splash in my CrawlerSpider process_request in rules like this:

 def process_request(self,request):
    request.meta['splash']={
        'args': {
            # set rendering arguments here
            'html': 1,
        }
    }
    return request

The problem is that the crawl renders just urls in the first depth, I wonder also how can I get response even with bad http code or redirected reponse;

Thanks in advance,

1 Answers1

0

Your problem may be related to this: https://github.com/scrapy-plugins/scrapy-splash/issues/92

In short, try to add this to your parsing callback function:

def parse_item(self, response):
    """Parse response into item also create new requests."""

    page = RescrapItem()
    ...
    yield page

    if isinstance(response, (HtmlResponse, SplashTextResponse)):
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = SplashRequest(url=link.url, callback=self._response_downloaded, 
                                              args=SPLASH_RENDER_ARGS)
                r.meta.update(rule=rule, link_text=link.text)
                yield rule.process_request(r)

In case, you wonder why this could return both items and new requests. Here is from the doc: https://doc.scrapy.org/en/latest/topics/spiders.html

In the callback function, you parse the response (web page) and return either dicts with extracted data, Item objects, Request objects, or an iterable of these objects. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.

Hieu
  • 7,138
  • 2
  • 42
  • 34