2

I am fetching data from a page that uses Javascript to link to new pages. I am using Scrapy + splash to fetch this data, however, for some reason, the links are not being followed.

Here is the code for my spider:

import scrapy
from   scrapy_splash import SplashRequest       

script = """
    function main(splash, args)
        local javascript = args.javascript
        assert(splash:runjs(javascript))
        splash:wait(0.5)

        return {
               html = splash:html()
        }
    end
"""


page_url = "https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/exchange-insight/trade-data.html?page=0&pageOffBook=0&fourWayKey=GB00B6774699GBGBXAMSM&formName=frmRow&upToRow=-1"


class MySpider(scrapy.Spider):
    name = "foo_crawler"          
    download_delay = 5.0

    custom_settings = {
                'DOWNLOADER_MIDDLEWARES' : {
                            'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
                            'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
                            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
                            },
                 #'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'
                }




    def start_requests(self):
        yield SplashRequest(url=page_url, 
                                callback=self.parse
                            )



    # Parses first page of ticker, and processes all maturities
    def parse(self, response):
        try:
            self.extract_data_from_page(response)

            href = response.xpath('//div[@class="paging"]/p/a[contains(text(),"Next")]/@href')
            print("href: {0}".format(href))

            if href:
                javascript = href.extract_first().split(':')[1].strip()

                yield SplashRequest(response.url, self.parse, 
                                    cookies={'store_language':'en'},
                                    endpoint='execute', 
                                    args = {'lua_source': script, 'javascript': javascript })

        except Exception as err:
            print("The following error occured: {0}".format(err))



    def extract_data_from_page(self, response):
        url = response.url
        page_num = url.split('page=')[1].split('&')[0]
        print("extract_data_from_page() called on page: {0}.".format(url))
        filename = "page_{0}.html".format(page_num)
        with open(filename, 'w') as f:
            f.write(response.text)




    def handle_error(self, failure):
        print("Error: {0}".format(failure))

Only the first page is fetched, and I'm unable to get the subsequent pages by 'clicking' through the links at the bottom of the page.

How do I fix this so I can click through the pages given at the bottom of the page?

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
Homunculus Reticulli
  • 65,167
  • 81
  • 216
  • 341

2 Answers2

1

Your code looks fine, the only thing is that since the yielded requests have the same url, they are ignored by the duplicate filter. Just uncomment the DUPEFILTER_CLASS and try again.

custom_settings = {
    ...
    'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
}

EDIT: to browse data pages without running javascript, you can do like this:

page_url = "https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/exchange-insight/trade-data.html?page=%s&pageOffBook=0&fourWayKey=GB00B6774699GBGBXAMSM&formName=frmRow&upToRow=-1"

page_number_regex = re.compile(r"'frmRow',(\d+),")
...
def start_requests(self):
    yield SplashRequest(url=page_url % 0,
                        callback=self.parse)
...
if href:
    javascript = href.extract_first().split(':')[1].strip()
    matched = re.search(self.page_number_regex, javascript)
    if matched:
        yield SplashRequest(page_url % matched.group(1), self.parse,
                            cookies={'store_language': 'en'},
                            endpoint='execute',
                            args={'lua_source': script, 'javascript': javascript})

I'm looking forward to a solution using javascript though.

matthieu.cham
  • 501
  • 5
  • 17
  • Yes, the yielded requests have the same URL - which is different behaviour from the browser. It seems the javascript is not being run. The problem is (as you quite rightly spotted), that the URL does not change - whereas it does when links are 'clicked' in the browser. The challenge is how to replicate this 'clicking' behaviour using Scrapy + Splash – Homunculus Reticulli Feb 26 '19 at 12:39
  • Ok sorry I was slow to grasp the issue. I don't have a solution to make scrapy run the javascript atm, but if you want to browse all data pages, you can simply extract the next page number from the javascript fragment and pass it in the yielded request. See my answer below – matthieu.cham Feb 26 '19 at 13:57
1

You can use the page query string variable. It starts at 0 so the first page is page=0. You can check the total pages by looking at:

<div class="paging">
  <p class="floatsx">&nbsp;Page 1 of 157 </p>
</div>

That way you know to call pages 0-156.

dan-klasson
  • 13,734
  • 14
  • 63
  • 101