4

I am writing a scrapy-splash program and I need to click on the display button on the webpage, as seen in the image below, in order to display the data, for 10th edition, so I can scrape it. I have the code I tried below but it does not work. The information I need is only accessible if I click the display button. UPDATE: Still struggling with this and I have to believe there is a way to do this. I do not want to scrape the JSON because that could be a red flag to site owners.

import scrapy
from ..items import NameItem

class LoginSpider(scrapy.Spider):
    name = "LoginSpider"
    start_urls = ["http://www.starcitygames.com/buylist/"]

    def parse(self, response):

        return scrapy.FormRequest.from_response(
        response,
        formcss='#existing_users form',
        formdata={'ex_usr_email': 'email123@example.com', 'ex_usr_pass': 'password123'},
        callback=self.after_login
        )


    def after_login(self, response):
        item = NameItem()
        display_button= response.xpath('//a[contains(., "- Display>>")]/@href').get()
        response.follow(display_button, self.parse)
        item["Name"] = response.css("div.bl-result-title::text").get()
        return item

Snapshot of Webpage HTML Code

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
Tim
  • 191
  • 2
  • 28

1 Answers1

8

Your code can't work because there is no anchor element and no href attribute. Clicking the button will send an XMLHttpRequest to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and the data you want is found in the JSON response.

  1. To check the request URL and response, open Dev Tools -> Network -> XHR and click Display.
  2. In Headers tab you will find the request URL and in Preview or Response tabs you can inspect the JSON.
  3. As you can see you'll need a category id to build the request URL. You can find this by parsing the script element found with this XPath //script[contains(., "categories")]
  4. Then you can send your request from the spider to http://www.starcitygames.com/buylist/search?search-type=category&id=5061 and get the data you want.
$ curl 'http://www.starcitygames.com/buylist/search?search-type=category&id=5061'
{"ok":true,"search":"10th Edition","results":[[{"id":"46269","name":"Abundance","subtitle":null,"condition":"NM\/M","foil":true,"is_parent":false,"language":"English","price":"20.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"},{"id":"176986","name":"Abundance","subtitle":null,"condition":"PL","foil":true,"is_parent":false,"language":"English","price":"12.000","rarity":"Rare","image":"cardscans\/MTG\/10E\/en\/foil\/Abundance.jpg"}....

As you can see, you don't even need to log in into the website or Splash.

  • 4
    That sound like it would work but I am still a little confused. Should I set that url as my start_url? Or should I have my program redirect to that page from the original url? If I just passed that url as my start_url I would be a little concerned of the website owners identifying this program as a web crawler since most people do not visit the JSON data page of a website. – Tim Jun 25 '19 at 17:58
  • @tnorth2620 If you know what are the categories you want to scrape, yes you can hardcode them in the start_urls list. It is much more probable to get detected (if this where the case) if you use a headless browser. With this solution, you are just making a request to a web API, similar to the web application you are trying to scrape. If this works (and it does) stick to it because is the simplest and most effective technique in your case. If you still want to use Splash you should start here: https://splash.readthedocs.io/en/stable/ and here https://github.com/scrapy-plugins/scrapy-splash – Ionut-Cezar Ciubotariu Jun 28 '19 at 18:08
  • For more information on handling dynamic content with plain Scrapy, see https://docs.scrapy.org/en/latest/topics/dynamic-content.html – Gallaecio Nov 20 '19 at 09:00