3

Problem Occurred When I Was Crawled The Whole Website By Using splash To Render The Entire Target Page.Some Page Was Not Random Successfully So I Was False To Get The Information That Supports To Be There When Render Job Had Done.That Means I Just Get Part Of The Information From The Render Result Although I Can Get The Entire Information From Other Render Result.

Here Is My Code:

yield SplashRequest(url,self.splash_parse,args = {"wait": 3,},endpoint="render.html")

settings:
SPLASH_URL = 'XXX'  
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

# Enable SplashDeduplicateArgsMiddleware:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

# Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter

# a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
Brook
  • 31
  • 3

1 Answers1

4

I am replying this late because the question has no answer and because it is visible on Google search.

I had similar problem and the only solution I found (besides increasing the wait argument, which may or may not work, but is not reliable) is using the execute endpoint and custom lua script for waiting for an element. If this sounds unnecessarily complex, it is, Scrapy and Splash are not well designed in my opinion, but I have found nothing better yet for my needs.

My Lua script looks something like this:

lua_base = '''
function main(splash)
  splash:init_cookies(splash.args.cookies)
  splash:go(splash.args.url)

  while not splash:select("{}") do
    splash:wait(0.1)
  end
  splash:wait(0.1)
  return {{
  cookies = splash:get_cookies(),
  html=splash:html()
  }}
end
'''
css = 'table > tr > td.mydata'
lua_script = lua_base.format(css)

and I generate requests like this:

        yield SplashRequest(link, self.parse, endpoint='execute',
                            args={
                                    'wait': 0.1,
                                    'images': 0,
                                    'lua_source': lua_script,
                                })

It is very ugly, but it works.

comodoro
  • 1,489
  • 1
  • 19
  • 30