3

I'm a beginner learning how to webscrape using Scrapy in Python. Can someone point out what's wrong? My goal is to scrape all the subsequent pages.

from indeed.items import IndeedItem
import scrapy

class IndeedSpider(scrapy.Spider):
    name = "ind"
    allowed_domains = ["https://www.indeed.com"]
    start_urls = ['https://www.indeed.com/jobs?q=analytics+intern&start=']

    def parse(self, response):
        job_card = response.css('.jobsearch-SerpJobCard')
        for job in job_card:
            item = IndeedItem()

            job_title = job.css('.jobtitle::attr(title)').extract()
            company_name = job.css('.company .turnstileLink::text').extract()
            if not company_name:
                company_name = job.css('span.company::text').extract()

            item['job_title'] = job_title
            item['company_name'] = company_name
            yield item

        next_page_extension = response.css('ul.pagination-list a::attr(href)').get()
        if next_page_extension is not None:
            next_page = response.urljoin(next_page_extension)
            yield scrapy.Request(next_page, callback=self.parse)
filo babo
  • 31
  • 1
  • Hi! We need more details! What is going wrong? Do it run and not give you the output you expect, or does it report some error? – tomjn Apr 26 '21 at 10:31
  • You have two problems. First, `allowed_domains` shouldn't include `"https://www."` (see [here](https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains)). Second, your `next_page_extension` is always picking the first item in the navigation. For the second page this sends `scrapy` back to page 1, which it will filter as a duplicate request. – tomjn Apr 26 '21 at 16:43

2 Answers2

0

Your code looks pretty good overall, but I can see two issues with it:

1 - The allowed_domains property expects us to provide a domain only, not a full URL. Running it as is, you might see something like this in your logs:

2021-04-28 21:10:55 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.indeed.com': <GET https://www.indeed.com/jobs?q=analytics+intern&start=10>

That means Scrapy is ignoring that request, as it doesn't meet the allowed domains. To fix that, use simply:

allowed_domains = ["indeed.com"]

(more about it)

2 - The selector you're using for the pagination will match always the first link from the pagination widget. You could try using .getall() instead, or get the anchor marked as "Next". Eg:

next_page_extension = response.css(
    'ul.pagination-list a[aria-label=Next]::attr(href)'
).get()

Thiago Curvelo
  • 3,711
  • 1
  • 22
  • 38
-1

The content on that site is dynamically generated using JavaScript. Scrapy alone doesn't handle JavaScript. You'd need something like Selenium + Scrapy, Splash + Scrapy, as Scrapy docs suggest, or other means. Selenium is more beginner friendly and there plenty of tutorials on how to use it with Scrapy

St_Mute
  • 57
  • 3