I am looking for some help regarding my Scrapy project. I want to use Scrapy to code a generic Spider that would crawl multiple websites from a list. I was hoping to have the list in a separate file, because it's quite large. For each website, the spider will navigate through internal links, and on each page, it will collect every external links.
I believe there are too many websites to create one spider per website. I want to scrape only external links, meaning "absolute" links whose domain name is different from the domain of the website where the link is found (subdomain would still be internal links from my POV).
Eventually, I want to export the results in a CSV with the following fields:
- domain of the website being crawled, (from the list)
- page_url (where the external link was found)
- external_link If the same external link is found several times on the same page, it is deduped. Not yet sure though, but I might want to dedup external links on the website scope too, at some point.
At some point, I would also like to :
- also filter certain external links to not be considered such as facebook.com/... etc,
- to run the script from Zyte.com. I believe it constrains me to respect a certain structure for the code, rather than just a standalone script. Any suggestion on that aspect would really help too.
After a lot of research, I found this reference: https://coderedirect.com/questions/369975/dynamic-rules-based-on-start-urls-for-scrapy-crawlspider
But it wasn't clear to me how I can make it because it's missing a full version of the code.
So far, the code I developed is as below, but I am stuck, as it does not fulfill my needs:
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
# import the link extractor
from scrapy.linkextractors import LinkExtractor
import os
class LinksSpider(scrapy.Spider):
name = 'publishers_websites'
start_urls = ['https://loremipsum.io/']
allowed_domains = ['loremipsum.io']
custom_settings = {'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}
try:
os.remove('publishers_websites.txt')
except OSError:
pass
custom_settings = {
'CONCURRENT_REQUESTS': 2,
'AUTO_THROTTLE_ENABLED': True
}
def __init__(self):
self.link_extractor = LinkExtractor(unique=True)
def parse(self, response):
domain = 'https://loremipsum.io/'
all_links = self.link_extractor.extract_links(response)
for link in all_links:
if domain not in link.url:
with open('publishers_websites.txt', 'a+') as f:
f.write(f"\n{str(response.request.url), str(link.url)}")
yield response.follow(url=link, callback=self.parse)
if __name__ == "__main__":
process = CrawlerProcess()
process.crawl(LinksSpider)
process.start()
There aren't much answers for my problem, and my Python skills are not good enough to solve the problem by myself.
I would be very grateful for any help I receive.