4

I want to make a general scraper which can crawl and scrape all data from any type of website including AJAX websites. I have extensively searched the internet but could not find any proper link which can explain me how Scrapy and Splash together can scrape AJAX websites(which includes pagination,form data and clicking on button before page is displayed). Every link I have referred tells me that Javascript websites can be rendered using Splash but there's no good tutorial/explanation about using Splash to render JS websites. Please don't give me solutions related to using browsers(I want to do everything programmatically,headless browser suggestions are welcome..but I want to use Splash).

class FlipSpider(CrawlSpider):
    name = "flip"
    allowed_domains = ["www.amazon.com"]

    start_urls = ['https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=mobile']  

    rules = (Rule(LinkExtractor(), callback='lol', follow=True),

    def parse_start_url(self,response):
       yield scrapy.Request(response.url,
                            self.lol,
                            meta={'splash':{'endpoint':'render.html','args':{'wait': 5,'iframes':1,}}})

    def lol(self, response):
       """
       Some code
       """
Community
  • 1
  • 1
Rohan
  • 41
  • 1
  • 5
  • 2
    Have you followed [splash doc](https://github.com/scrapy-plugins/scrapy-splash#installation) ? What is your problem exactly ? – Adrien Blanquer Jun 08 '17 at 12:50
  • Yes I did. Splash doc just mentions the commands we can use. I want to know how to use them to run a website's javascript to get the dynamic content... – Rohan Jun 08 '17 at 12:53
  • Well if you don't have a specific question or problem about splash I won't copy paste the doc... If you refer to the doc you should be able to crawl a JavaScript based website – Adrien Blanquer Jun 08 '17 at 12:58
  • Okay. What I want to do is make a general scraper which can solve the problems of Pagination(infinite scrolling),scraping data from form filling pages,clicking of button before page is displayed together. What I have read is that a POST request is being sent which loads the data into the browser. I want to know how to make these post requests with Splash for the above mentioned problems. How to do this? – Rohan Jun 08 '17 at 13:08

3 Answers3

2

The problem with Splash and pagination is following:

I wasn't able to product a Lua script that delivers a new webpage (after click on pagination link) that is in format of response. and not pure HTML.

So, my solution is following - to click the link and extract that new generated url and direct a crawler to this new url.

So, I on the page that has pagination link I execute

yield SplashRequest(url=response.url, callback=self.get_url, endpoint="execute", args={'lua_source': script})

with following Lua script

def parse_categories(self, response):
    script = """
             function main(splash)
                 assert(splash:go(splash.args.url))
                 splash:wait(1)
                 splash:runjs('document.querySelectorAll(".next-page")[0].click()')
                 splash:wait(1)
                 return splash:url()  
             end
             """

and the get_url function

def get_url(self,response):
    yield SplashRequest(url=response.body_as_unicode(), callback=self.parse_categories)

This way I was able to loop my queries.

Same way if you don't expect new URL your Lua script can just produce pure html that you have to work our with regex (that is bad) - but this is the best I was able to do.

Community
  • 1
  • 1
1

You can emulate behaviors, like a ckick, or scroll, by writting a JavaScript function and by telling Splash to execute that script when it renders your page.

A little exemple:

You define a JavaScript function that selects an element in the page and then clicks on it:

(source: splash doc)

# Get button element dimensions with javascript and perform mouse click.
_script = """
function main(splash)
    assert(splash:go(splash.args.url))
    local get_dimensions = splash:jsfunc([[
        function () {
            var rect = document.getElementById('button').getClientRects()[0];
            return {"x": rect.left, "y": rect.top}
        }
    ]])
    splash:set_viewport_full()
    splash:wait(0.1)
    local dimensions = get_dimensions()
    splash:mouse_click(dimensions.x, dimensions.y)

    -- Wait split second to allow event to propagate.
    splash:wait(0.1)
    return splash:html()
end
"""

Then, when you request, you modify the endpoint and set it to "execute", and you add "lua_script": _script to the args.

Exemple :

def parse(self, response):
    yield SplashRequest(response.url, self.parse_elem,
                        endpoint="execute",
                        args={"lua_source": _script})

You will find all the informations about splash scripting here

Community
  • 1
  • 1
Adrien Blanquer
  • 2,041
  • 1
  • 19
  • 31
  • Thanks! Good explanation. I was wondering if we can execute all javascripts on a webpage using scrapy+splash? – Rohan Jun 08 '17 at 14:02
0

I just answered a similar question here: scraping ajax based pagination. My solution is to get the current and last pages and then replace the page variable in the request URL.

Also - the other thing you can do is look on the network tab in the browser dev tools and see if you can identify any API that is called. If you look at the requests under XHR you can see those that return json.

You can then call the API directly and parse the json/ html response. Here is the link from the scrapy docs:The Network-tool