Scrapy: Can someone tell me why this code does not let me scrape the subsequent pages?

Question

I'm a beginner learning how to webscrape using Scrapy in Python. Can someone point out what's wrong? My goal is to scrape all the subsequent pages.

from indeed.items import IndeedItem
import scrapy

class IndeedSpider(scrapy.Spider):
    name = "ind"
    allowed_domains = ["https://www.indeed.com"]
    start_urls = ['https://www.indeed.com/jobs?q=analytics+intern&start=']

    def parse(self, response):
        job_card = response.css('.jobsearch-SerpJobCard')
        for job in job_card:
            item = IndeedItem()

            job_title = job.css('.jobtitle::attr(title)').extract()
            company_name = job.css('.company .turnstileLink::text').extract()
            if not company_name:
                company_name = job.css('span.company::text').extract()

            item['job_title'] = job_title
            item['company_name'] = company_name
            yield item

        next_page_extension = response.css('ul.pagination-list a::attr(href)').get()
        if next_page_extension is not None:
            next_page = response.urljoin(next_page_extension)
            yield scrapy.Request(next_page, callback=self.parse)

Hi! We need more details! What is going wrong? Do it run and not give you the output you expect, or does it report some error? — tomjn, Apr 26 '21 at 10:31
You have two problems. First, `allowed_domains` shouldn't include `"https://www."` (see [here](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains)). Second, your `next_page_extension` is always picking the first item in the navigation. For the second page this sends `scrapy` back to page 1, which it will filter as a duplicate request. — tomjn, Apr 26 '21 at 16:43

score 0 · Answer 1 · answered Apr 29 '21 at 00:23

Your code looks pretty good overall, but I can see two issues with it:

1 - The allowed_domains property expects us to provide a domain only, not a full URL. Running it as is, you might see something like this in your logs:

2021-04-28 21:10:55 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.indeed.com': <GET https://www.indeed.com/jobs?q=analytics+intern&start=10>

That means Scrapy is ignoring that request, as it doesn't meet the allowed domains. To fix that, use simply:

allowed_domains = ["indeed.com"]

(more about it)

2 - The selector you're using for the pagination will match always the first link from the pagination widget. You could try using .getall() instead, or get the anchor marked as "Next". Eg:

next_page_extension = response.css(
    'ul.pagination-list a[aria-label=Next]::attr(href)'
).get()

score -1 · Answer 2 · answered Apr 26 '21 at 16:33

-1

The content on that site is dynamically generated using JavaScript. Scrapy alone doesn't handle JavaScript. You'd need something like Selenium + Scrapy, Splash + Scrapy, as Scrapy docs suggest, or other means. Selenium is more beginner friendly and there plenty of tutorials on how to use it with Scrapy

answered Apr 26 '21 at 16:33

St_Mute

57
3

1

In this case the data that is wanted can be obtained with `scrapy`. Disable javascript in your browser and you'll see the information is still there. – tomjn Apr 26 '21 at 16:47
You're right. Skipped that first step and went straight to overkill ️ – St_Mute Apr 27 '21 at 12:53
That's not true in this case. The pagination links are present in the pages HTML. – Thiago Curvelo Apr 29 '21 at 00:25

Scrapy: Can someone tell me why this code does not let me scrape the subsequent pages?

2 Answers2