2

I was exploring Scrapy+Splash and ran into issue that SplashRequest is not rendering the javascript and is giving exact same response scrapy.Request. The webpage I want to scrape is this. I want some fields from the webpage for my course project.

I am unable to get the final HTML after js is rendered even after waiting for 'wait':'30'. In fact, the result is the same as scrapy.Request. The same code works perfectly for another website that I have tried ie. this. So I believe the settings are fine.

This is spider definition

import scrapy
from .. import IndeedItem
import scrapy
from scrapy_splash import SplashRequest
from bs4 import BeautifulSoup

class IndeedSpider(scrapy.Spider):
    name = "indeed"
    def __init__(self):
        self.headers = {"Host": "www.naukri.com",
            "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0"}

    def start_requests(self):              

        yield SplashRequest(
            url = "https://www.naukri.com/job-listings-Sr-Python-Developer-Rackspace-Gurgaon-4-to-9-years-270819005015",
            endpoint='render.html', headers = self.headers,
            args={
                    'wait': 3,
                }
            )

    def parse(self, response):
        soup = BeautifulSoup(response.body)
        it = IndeedItem()
        it['job_title'] = soup
        yield it

The settings.py (only relevant part) file is

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}


SPLASH_URL = 'http://localhost:8050/'

And the output file is here

I do not know what to make of the output, it has embedded JavaScript in it. Opening it in a browser tells that very little has been rendered (title only). How would I get rendered HTML for the website? Any help is much appreciated.

Fenil
  • 396
  • 1
  • 5
  • 16
  • Have you tried using Splash directly, not through scrapy-splash? Does it render the website as expected? – Gallaecio Dec 18 '19 at 14:22
  • Yes @Gallaecio, I have. Running it directly from Splash gives me a Captcha it's the same when I run Scrapy-Splash without `User-Agent`. Adding User-Agent to the SplashRequest gives the output in the question above. I'm assuming adding `User-Agent` is helping bypassing the Captcha. – Fenil Dec 19 '19 at 04:28

0 Answers0