1

I am building a Scrapy spider WuzzufLinks that scrapes all the links to specific jobs in a job website in this link: https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt

After scraping the links, I would like to send them to another spider WuzzufSpider, which scrapes data from inside each link. The start_urls would be the first link in the scraped list, and the next_page would be the following link, and so on.

I have thought of importing the WuzzufLinks into WuzzufSpider then accessing its data:

import scrapy
from ..items import WuzzufscraperItem


class WuzzuflinksSpider(scrapy.Spider):
    name = 'WuzzufLinks'
    page_number = 1
    start_urls = ['https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt']

    def parse(self, response):
        items = WuzzufscraperItem()

        jobURL = response.css('h2[class=css-m604qf] a::attr(href)').extract()

        items['jobURL'] = jobURL

        yield items

        next_page = 'https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt&start=' + str(WuzzuflinksSpider.page_number)
        if WuzzuflinksSpider.page_number <= 100:
            yield response.follow(next_page, callback = self.parse)
            WuzzuflinksSpider.page_number += 1
# WuzzufSpider

import scrapy
from ..items import WuzzufscraperItem
from spiders.WuzzufLinks import WuzzuflinksSpider


class WuzzufspiderSpider(scrapy.Spider):
    name = 'WuzzufSpider'
    parseClass = WuzzuflinksSpider().parse()
    start_urls = []

    def parse(self, response):
        items = WuzzufscraperItem()
        # CSS selectors
        title = response.css('').extract()
        company = response.css('').extract()
        location = response.css('').extract()
        country = response.css('').extract()
        date = response.css('').extract()
        careerLevel = response.css('').extract()
        experienceNeeded = response.css('').extract()
        jobType = response.css('').extract()
        jobFunction = response.css('').extract()
        salary = response.css('').extract()
        description = response.css('').extract()
        requirements = response.css('').extract()
        skills = response.css('').extract()
        industry = response.css('').extract()
        jobURL = response.css('').extract()

        # next_page and if statement here

Regardless of whether I have written the outlined parts correctly, I have realized that accessing jobURL would return an empty value since it is only a temporary container. I have thought of saving the scraped links in another file, then importing them to WuzzufSpider, but I don't know whether the import is valid and if they will still be a list:

# links.xml

<?xml version="1.0" encoding="utf-8"?>
<items>
<item><jobURL><value>/jobs/p/P5A2NWkkWfv6-Sales-Operations-Specialist-Amreyah-Cement---InterCement-Alexandria-Egypt?o=1&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/pEmZ96R097N3-Senior-Laravel-Developer-Learnovia-Cairo-Egypt?o=2&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/IgHkjP37ymQp-French-Talent-Acquisition-Specialist-Guide-Academy-Giza-Egypt?o=3&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/zOLTqLqegEZe-Export-Sales-Representative-packtec-Cairo-Egypt?o=4&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/U3Q1TDpxzsJJ-Finishing-Site-Engineer--Assiut-Assiut-Egypt?o=5&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/7aQ4QxtYV8N6-Senior-QC-Automation-Engineer-FlairsTech-Cairo-Egypt?o=6&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/qHWyGU7ClMG6-Technical-Office-Engineer-Cairo-Egypt?o=7&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/ptN7qnERUvPT-B2B-Sales-Representative-Smart-Zone-Cairo-Egypt?o=8&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/VUVc0ZAyUNYU-Digital-Marketing-supervisor-National-Trade-Distribution-Cairo-Egypt?o=9&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/WzJhyeVpT5jb-Receptionist-Value-Cairo-Egypt?o=10&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/PAdZOdzWjqbr-Insurance-Specialist-Bancassuranc---Sohag-Allianz-Sohag-Egypt?o=11&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/nJD6YbE4QjNX-Senior-Research-And-Development-Specialist-Cairo-Egypt?o=12&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/DVvMG4BFWEeI-Technical-Sales-Engineer-Masria-Group-Cairo-Egypt?o=13&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/3RtCveEFjveW-Technical-Office-Engineer-Masria-Group-Cairo-Egypt?o=14&amp;l=sp&amp;t=sj&amp;a=search-v3</value><value>/jobs/p/kswGaw4kXTe8-Administrator-Kreston-Cairo-Egypt?o=15&amp;l=sp&amp;t=sj&amp;a=search-v3</value></jobURL></item>
</items>
# WuzzufSpider

import scrapy
from ..items import WuzzufscraperItem
from links import jobURL


class WuzzufspiderSpider(scrapy.Spider):
    name = 'WuzzufSpider'
    start_urls = [jobURL[0]]

    def parse(self, response):
        items = WuzzufscraperItem()
        # CSS selectors
        title = response.css('').extract()
        company = response.css('').extract()
        location = response.css('').extract()
        country = response.css('').extract()
        date = response.css('').extract()
        careerLevel = response.css('').extract()
        experienceNeeded = response.css('').extract()
        jobType = response.css('').extract()
        jobFunction = response.css('').extract()
        salary = response.css('').extract()
        description = response.css('').extract()
        requirements = response.css('').extract()
        skills = response.css('').extract()
        industry = response.css('').extract()
        jobURL = response.css('').extract()
        
        # next_page and if statement here

Is there is a way to make the second method work or a completely different approach?

I have checked forums Scrapy:Pass data between 2 spiders and Pass scraped URL's from one spider to another. I understand that I can do all of the work in one spider, and that there is a way to save to a database or temporary file in order to send data to another spider. However I am not yet very experienced and don't understand how to implement such changes, so marking this question as a duplicate won't help me. Thank you for your help.

Aya Noaman
  • 334
  • 1
  • 2
  • 12

1 Answers1

4

First of all you can keep crawling the urls from the same spider and honestly I don't see a reason for you not to.

Anyway, if you really want to have two spiders, which the output of the first will be the input of the second, you can do something like this:

import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
from scrapy import signals
from twisted.internet import reactor, defer


# grab all the products urls
class ExampleSpider(scrapy.Spider):
    name = "exampleSpider"
    start_urls = ['https://scrapingclub.com/exercise/list_basic']

    def parse(self, response):
        all_urls = response.xpath('//div[@class="card"]/a/@href').getall()
        for url in all_urls:
            yield {'url': 'https://scrapingclub.com' + url}


# get the product's details
class ExampleSpider2(scrapy.Spider):
    name = "exampleSpider2"

    def parse(self, response):
        title = response.xpath('//h3/text()').get()
        price = response.xpath('//div[@class="card-body"]//h4//text()').get()
        yield {
            'title': title,
            'price': price
        }


if __name__ == "__main__":
    # this will be the yielded items from the first spider
    output = []

    def get_output(item):
        output.append(item)

    configure_logging()
    settings = get_project_settings()
    settings['USER_AGENT'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    runner = CrawlerRunner(settings)

    # run spiders sequentially
    # (https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process)
    @defer.inlineCallbacks
    def crawl():
        dispatcher.connect(get_output, signal=signals.item_scraped)
        yield runner.crawl('exampleSpider')
        urls = [url['url'] for url in output]   # create a list of the urls from the first spider

        # crawl the second spider with the urls from the first spider
        yield runner.crawl('exampleSpider2', start_urls=urls)
        reactor.stop()

    crawl()
    reactor.run()

Run this and see that you first get the results from the first spider, and that those results are passed as the "start_urls" for the second spider.

EDIT:

Doing it all in the same spider. See how we loop over all the urls and scraping them in the function "parse_item". I filled in some of the values you want to scrape as an example, so just fill in the rest and you're done.

import scrapy
# from ..items import WuzzufscraperItem


class WuzzufscraperItem(scrapy.Item):
    title = scrapy.Field()
    company = scrapy.Field()
    location = scrapy.Field()
    country = scrapy.Field()
    jobURL = scrapy.Field()
    date = scrapy.Field()
    careerLevel = scrapy.Field()
    experienceNeeded = scrapy.Field()
    jobType = scrapy.Field()
    jobFunction = scrapy.Field()
    salary = scrapy.Field()
    description = scrapy.Field()
    requirements = scrapy.Field()
    skills = scrapy.Field()
    industry = scrapy.Field()


class WuzzuflinksSpider(scrapy.Spider):
    name = 'WuzzufLinks'
    page_number = 1
    start_urls = ['https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt']

    def parse(self, response):
        all_urls = response.css('h2[class=css-m604qf] a::attr(href)').getall()

        if all_urls:
            for url in all_urls:
                yield response.follow(url=url, callback=self.parse_item)

        next_page = 'https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt&start=' + str(WuzzuflinksSpider.page_number)

        if WuzzuflinksSpider.page_number <= 100:
            yield response.follow(next_page)
            WuzzuflinksSpider.page_number += 1

    def parse_item(self, response):
        items = WuzzufscraperItem()
        # CSS selectors

        # Some values as an example:
        items['title'] = response.xpath('(//h1)[last()]/text()').get(default='')
        items['company'] = response.xpath('(//a[@class="css-p7pghv"])[last()]/text()').get(default='')
        items['location'] = response.xpath('(//strong[@class="css-9geu3q"])[last()]/text()').get(default='')
        items['country'] = response.xpath('//meta[@property="og:country_name"]/@content').get(default='')
        items['jobURL'] = response.url

        # items['date'] = response.css('').get(default='')
        # items['careerLevel'] = response.css('').get(default='')
        # items['experienceNeeded'] = response.css('').get(default='')
        # items['jobType'] = response.css('').get(default='')
        # items['jobFunction'] = response.css('').get(default='')
        # items['salary'] = response.css('').get(default='')
        # items['description'] = response.css('').get(default='')
        # items['requirements'] = response.css('').get(default='')
        # items['skills'] = response.css('').get(default='')
        # items['industry'] = response.css('').get(default='')

        yield items
SuperUser
  • 4,527
  • 1
  • 5
  • 24
  • It's not that I don't *want* to have two spiders, I just didn't know it was possible to combine them into one. I'll try this way, thanks a lot; but if having one spider is actually simpler, I'll definitely look into it. – Aya Noaman Dec 30 '21 at 16:11
  • So all of this is written only in one file in the spiders folder? May you please explain how running one spider uses the `__main__` statement, or is this just assuming I don't have any other files in my Scrapy project? – Aya Noaman Dec 30 '21 at 17:25
  • 1
    Yes it's all written in the same spider in the spiders folder. I only did it so you could see everything in one place instead of writing each file individually. If I really wanted to use it then I would've separate it into multiple files. Read about __main__ [here](https://www.freecodecamp.org/news/if-name-main-python-example/). – SuperUser Dec 30 '21 at 17:33
  • @AyaNoaman If you edit your post and add the code as text then I'll help you the write this as a single spider. I still want to keep the current answer since it can help other people. – SuperUser Dec 30 '21 at 17:36
  • I've edited and posted the codes now, I'll also try and work out one spider while you get this. – Aya Noaman Dec 31 '21 at 15:40
  • @AyaNoaman Please see the edit. – SuperUser Jan 01 '22 at 10:43
  • so parse_item is just used for scraping items, not looping through links? How does the spider recognize that it must follow https://wuzzuf.net/ + url after it is done scraping links? (the URLs lack the domain name) – Aya Noaman Jan 01 '22 at 12:09
  • unless you're following them on the go using `yield response.follow(url=url, callback=self.parse_item)` and scrape the links before proceeding to scrape the next page of links? – Aya Noaman Jan 01 '22 at 12:12
  • in that case i think i can just make the url part `url='https://wuzzuf.net'+url` – Aya Noaman Jan 01 '22 at 12:13
  • `Parse_item` is used to scrape the job webpage. The looping is in `parse` function. I used `resposne.follow`, if you want to use the absolute url use `scrapy.Request`. Whatever you prefer. – SuperUser Jan 01 '22 at 12:41
  • so adding the domain as a string to the rest of the URL won't work using `response.follow`? – Aya Noaman Jan 01 '22 at 12:46
  • In `response.follow` it will work, in `scrapy.Request` it won't work. Since I used `response.follow` I can use the relative url and I don't see a reason not to so I didn't add the domain. @AyaNoaman If you happy with my answer then please accept it. – SuperUser Jan 01 '22 at 13:15