0

Recently, I'm learning to use Scrapy with splash to crawl dynamic websites.

Here is the content in my spider:

# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest

class InfoSpider(scrapy.Spider):
    name = 'info'
    url = 'https://hackerone.com/kubernetes'

    def start_requests(self):
        yield SplashRequest(
            self.url,
            callback=self.parse,
            endpoint='render.html',
            args={
                'wait': 15
            })

    def parse(self, response):
        result = response.css('strong span').getall()
        self.log(result)
        if result:
            self.log("FOUND!")
        else:
            self.log("NOT FOUND!")

However, the response returned by splash still is not same as the one I inspect in the browser.

The settings for splash are correct, as I have test it on localhost:8050. Here is the content in my setting.py

BOT_NAME = 'hackerone'

SPIDER_MODULES = ['hackerone.spiders']
NEWSPIDER_MODULE = 'hackerone.spiders'

SPLASH_URL = 'http://localhost:8050'

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0'

ROBOTSTXT_OBEY = False

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
    810,
}

The output in powershell:

scrapy crawl info
2020-01-15 15:41:42 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: hackerone)
2020-01-15 15:41:42 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, 
w3lib 1.21.0, Twisted 19.10.0, Python 3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2020-01-15 15:41:42 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'hackerone', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'hackerone.spiders', 'SPIDER_MODULES': ['hackerone.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0'}
2020-01-15 15:41:42 [scrapy.extensions.telnet] INFO: Telnet Password: 9895b012d3e5c3ae
2020-01-15 15:41:42 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2020-01-15 15:41:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy_splash.SplashCookiesMiddleware',
 'scrapy_splash.SplashMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-01-15 15:41:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy_splash.SplashDeduplicateArgsMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-01-15 15:41:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-01-15 15:41:42 [scrapy.core.engine] INFO: Spider opened
2020-01-15 15:41:42 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-01-15 15:41:42 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-01-15 15:42:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://hackerone.com/kubernetes via http://localhost:8050/render.html> (referer: None)
2020-01-15 15:42:00 [info] DEBUG: []
2020-01-15 15:42:00 [info] DEBUG: NOT FOUND!
2020-01-15 15:42:00 [scrapy.core.engine] INFO: Closing spider (finished)
2020-01-15 15:42:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,
 'downloader/request_count': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 4264,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 18.242975,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 1, 15, 7, 42, 0, 625909),
 'log_count/DEBUG': 3,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'splash/render.html/request_count': 1,
 'splash/render.html/response_count/200': 1,
 'start_time': datetime.datetime(2020, 1, 15, 7, 41, 42, 382934)}
2020-01-15 15:42:00 [scrapy.core.engine] INFO: Spider closed (finished)

The output in docker:

2020-01-15 07:42:00.504447 [events] {"path": "/render.html", "rendertime": 18.063398361206055, "maxrss": 339420, "load": [0.0, 0.0, 0.0], "fds": 66, "active": 0, "qsize": 0, "_id": 140145667498672, "method": "POST", "timestamp": 1579074120, "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0", "args": {"headers": {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"}, "url": "https://hackerone.com/kubernetes", "wait": 15, "uid": 140145667498672}, "status_code": 200, "client_ip": "172.17.0.1"}
2020-01-15 07:42:00.504754 [-] "172.17.0.1" - - [15/Jan/2020:07:42:00 +0000] "POST /render.html HTTP/1.1" 200 4141 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0"

I don't know what's wrong with the code. The elements are not shown in the final html code obtained by splash. Your advises will be highly appreciated.

JoeyLyu
  • 1
  • 1

1 Answers1

0

While using SplashRequest we need to explicitly mention what we need from request. In your case, you want html which is created after js is rendered so you can use this Lua script in your request

function main(splash):
   assert(splash:wait(15))
   return splash:html()
Ahmed Buksh
  • 161
  • 8