I am building a Scrapy spider WuzzufLinks
that scrapes all the links to specific jobs in a job website in this link:
https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt
After scraping the links, I would like to send them to another spider WuzzufSpider
, which scrapes data from inside each link. The start_urls
would be the first link in the scraped list, and the next_page
would be the following link, and so on.
I have thought of importing the WuzzufLinks
into WuzzufSpider
then accessing its data:
import scrapy
from ..items import WuzzufscraperItem
class WuzzuflinksSpider(scrapy.Spider):
name = 'WuzzufLinks'
page_number = 1
start_urls = ['https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt']
def parse(self, response):
items = WuzzufscraperItem()
jobURL = response.css('h2[class=css-m604qf] a::attr(href)').extract()
items['jobURL'] = jobURL
yield items
next_page = 'https://wuzzuf.net/search/jobs/?filters%5Bcountry%5D%5B0%5D=Egypt&start=' + str(WuzzuflinksSpider.page_number)
if WuzzuflinksSpider.page_number <= 100:
yield response.follow(next_page, callback = self.parse)
WuzzuflinksSpider.page_number += 1
# WuzzufSpider
import scrapy
from ..items import WuzzufscraperItem
from spiders.WuzzufLinks import WuzzuflinksSpider
class WuzzufspiderSpider(scrapy.Spider):
name = 'WuzzufSpider'
parseClass = WuzzuflinksSpider().parse()
start_urls = []
def parse(self, response):
items = WuzzufscraperItem()
# CSS selectors
title = response.css('').extract()
company = response.css('').extract()
location = response.css('').extract()
country = response.css('').extract()
date = response.css('').extract()
careerLevel = response.css('').extract()
experienceNeeded = response.css('').extract()
jobType = response.css('').extract()
jobFunction = response.css('').extract()
salary = response.css('').extract()
description = response.css('').extract()
requirements = response.css('').extract()
skills = response.css('').extract()
industry = response.css('').extract()
jobURL = response.css('').extract()
# next_page and if statement here
Regardless of whether I have written the outlined parts correctly, I have realized that accessing jobURL
would return an empty value since it is only a temporary container. I have thought of saving the scraped links in another file, then importing them to WuzzufSpider
, but I don't know whether the import is valid and if they will still be a list:
# links.xml
<?xml version="1.0" encoding="utf-8"?>
<items>
<item><jobURL><value>/jobs/p/P5A2NWkkWfv6-Sales-Operations-Specialist-Amreyah-Cement---InterCement-Alexandria-Egypt?o=1&l=sp&t=sj&a=search-v3</value><value>/jobs/p/pEmZ96R097N3-Senior-Laravel-Developer-Learnovia-Cairo-Egypt?o=2&l=sp&t=sj&a=search-v3</value><value>/jobs/p/IgHkjP37ymQp-French-Talent-Acquisition-Specialist-Guide-Academy-Giza-Egypt?o=3&l=sp&t=sj&a=search-v3</value><value>/jobs/p/zOLTqLqegEZe-Export-Sales-Representative-packtec-Cairo-Egypt?o=4&l=sp&t=sj&a=search-v3</value><value>/jobs/p/U3Q1TDpxzsJJ-Finishing-Site-Engineer--Assiut-Assiut-Egypt?o=5&l=sp&t=sj&a=search-v3</value><value>/jobs/p/7aQ4QxtYV8N6-Senior-QC-Automation-Engineer-FlairsTech-Cairo-Egypt?o=6&l=sp&t=sj&a=search-v3</value><value>/jobs/p/qHWyGU7ClMG6-Technical-Office-Engineer-Cairo-Egypt?o=7&l=sp&t=sj&a=search-v3</value><value>/jobs/p/ptN7qnERUvPT-B2B-Sales-Representative-Smart-Zone-Cairo-Egypt?o=8&l=sp&t=sj&a=search-v3</value><value>/jobs/p/VUVc0ZAyUNYU-Digital-Marketing-supervisor-National-Trade-Distribution-Cairo-Egypt?o=9&l=sp&t=sj&a=search-v3</value><value>/jobs/p/WzJhyeVpT5jb-Receptionist-Value-Cairo-Egypt?o=10&l=sp&t=sj&a=search-v3</value><value>/jobs/p/PAdZOdzWjqbr-Insurance-Specialist-Bancassuranc---Sohag-Allianz-Sohag-Egypt?o=11&l=sp&t=sj&a=search-v3</value><value>/jobs/p/nJD6YbE4QjNX-Senior-Research-And-Development-Specialist-Cairo-Egypt?o=12&l=sp&t=sj&a=search-v3</value><value>/jobs/p/DVvMG4BFWEeI-Technical-Sales-Engineer-Masria-Group-Cairo-Egypt?o=13&l=sp&t=sj&a=search-v3</value><value>/jobs/p/3RtCveEFjveW-Technical-Office-Engineer-Masria-Group-Cairo-Egypt?o=14&l=sp&t=sj&a=search-v3</value><value>/jobs/p/kswGaw4kXTe8-Administrator-Kreston-Cairo-Egypt?o=15&l=sp&t=sj&a=search-v3</value></jobURL></item>
</items>
# WuzzufSpider
import scrapy
from ..items import WuzzufscraperItem
from links import jobURL
class WuzzufspiderSpider(scrapy.Spider):
name = 'WuzzufSpider'
start_urls = [jobURL[0]]
def parse(self, response):
items = WuzzufscraperItem()
# CSS selectors
title = response.css('').extract()
company = response.css('').extract()
location = response.css('').extract()
country = response.css('').extract()
date = response.css('').extract()
careerLevel = response.css('').extract()
experienceNeeded = response.css('').extract()
jobType = response.css('').extract()
jobFunction = response.css('').extract()
salary = response.css('').extract()
description = response.css('').extract()
requirements = response.css('').extract()
skills = response.css('').extract()
industry = response.css('').extract()
jobURL = response.css('').extract()
# next_page and if statement here
Is there is a way to make the second method work or a completely different approach?
I have checked forums Scrapy:Pass data between 2 spiders and Pass scraped URL's from one spider to another. I understand that I can do all of the work in one spider, and that there is a way to save to a database or temporary file in order to send data to another spider. However I am not yet very experienced and don't understand how to implement such changes, so marking this question as a duplicate won't help me. Thank you for your help.