How to scrape classes that only show up when logged in?

Question

I'm trying to write a spider with Scrapy that will ideally return a list of URLs from a site if said URL(s) contain a certain class which I would define in print response.css(".class"), but I'm not sure if this is even possible when the class will only be on the page if a user is logged in.

I've gone through guides on how to write spiders with Scrapy and I've gotten it to return a list of selectors using a different class that I know is on the page whether or not a user is logged in, just as a test to know that I didnt write it wrong. I really just want to know if this is even possible and if so what steps I can take to get there.

import scrapy


class TestSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['www.example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        print response.css(".class")

The code I have so far is obviously very basic and barely edited from the generated template as I'm still in the simple testing phase of this. Ideally I want to get a list of selectors which would, if this is possible, then give me a list of URLs for each page where the class is found. All I'm looking for is the URLs of the pages that contain the defined class.

You've omitted so much, there's no context here to figure out what you're doing. At least share the URL, so we can have a look. https://stackoverflow.com/help/minimal-reproducible-example — abdusco, Jul 17 '19 at 09:44
Make a POST request to that login endpoint, once that resolves, scrape the page. — abdusco, Jul 17 '19 at 09:45
If the information you want to scrape is only available to a logged user you need to make your spider log into the site and collect the cookies (Scrapy manages the cookies for you), see here: https://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login and here https://stackoverflow.com/a/5850928/10683132 — Luiz Rodrigues da Silva, Jul 17 '19 at 11:11

Umur Togay Yazar · Accepted Answer · 2019-07-17T12:47:03.333

I did not clearly understand your problem. I assume that you want to get URL's which have a specific class attribute. If this is what you want to do, you can change definiton of parse method of the spider:

def parse(self, response):
    for url in response.css('a[class="classname"]::attr(href)').getall():
        print(url)

Also, the information you want to scrape is only available when you log in the target website, then you should make a form request for authentication.

class LoginSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['http://www.example.com/login']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'yourusername', 'password': 'yourpassword'},
            callback=self.after_login
        )

def after_login(self, response):
    if "login failed" in response.body:
        self.logger.error("Login failed")
        return
    else:
       return scrapy.Request(url="www.webpageyouwanttoscrape.com",callback=self.get_all_urls)

def get_all_urls(self,response):
    for url in response.css('a[class="classname"]::attr(href)').getall():
        print(url)

For more information about form requests check the link below: https://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login

Thanks! Haven't tried yet to implement this but it looks like it'll work. — ecclark1, Jul 18 '19 at 02:55

How to scrape classes that only show up when logged in?

1 Answers1