1

I'm trying to write a spider with Scrapy that will ideally return a list of URLs from a site if said URL(s) contain a certain class which I would define in print response.css(".class"), but I'm not sure if this is even possible when the class will only be on the page if a user is logged in.

I've gone through guides on how to write spiders with Scrapy and I've gotten it to return a list of selectors using a different class that I know is on the page whether or not a user is logged in, just as a test to know that I didnt write it wrong. I really just want to know if this is even possible and if so what steps I can take to get there.

import scrapy


class TestSpider(scrapy.Spider):
    name = 'testspider'
    allowed_domains = ['www.example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        print response.css(".class")

The code I have so far is obviously very basic and barely edited from the generated template as I'm still in the simple testing phase of this. Ideally I want to get a list of selectors which would, if this is possible, then give me a list of URLs for each page where the class is found. All I'm looking for is the URLs of the pages that contain the defined class.

ecclark1
  • 21
  • 2
  • You've omitted so much, there's no context here to figure out what you're doing. At least share the URL, so we can have a look. https://stackoverflow.com/help/minimal-reproducible-example – abdusco Jul 17 '19 at 09:44
  • Make a POST request to that login endpoint, once that resolves, scrape the page. – abdusco Jul 17 '19 at 09:45
  • If the information you want to scrape is only available to a logged user you need to make your spider log into the site and collect the cookies (Scrapy manages the cookies for you), see here: https://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login and here https://stackoverflow.com/a/5850928/10683132 – Luiz Rodrigues da Silva Jul 17 '19 at 11:11

1 Answers1

1

I did not clearly understand your problem. I assume that you want to get URL's which have a specific class attribute. If this is what you want to do, you can change definiton of parse method of the spider:

def parse(self, response):
    for url in response.css('a[class="classname"]::attr(href)').getall():
        print(url)

Also, the information you want to scrape is only available when you log in the target website, then you should make a form request for authentication.

class LoginSpider(scrapy.Spider):
    name = 'testspider'
    start_urls = ['http://www.example.com/login']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'yourusername', 'password': 'yourpassword'},
            callback=self.after_login
        )

def after_login(self, response):
    if "login failed" in response.body:
        self.logger.error("Login failed")
        return
    else:
       return scrapy.Request(url="www.webpageyouwanttoscrape.com",callback=self.get_all_urls)

def get_all_urls(self,response):
    for url in response.css('a[class="classname"]::attr(href)').getall():
        print(url)

For more information about form requests check the link below: https://doc.scrapy.org/en/latest/topics/request-response.html#using-formrequest-from-response-to-simulate-a-user-login