0

I am looking for some help regarding my Scrapy project. I want to use Scrapy to code a generic Spider that would crawl multiple websites from a list. I was hoping to have the list in a separate file, because it's quite large. For each website, the spider will navigate through internal links, and on each page, it will collect every external links.

I believe there are too many websites to create one spider per website. I want to scrape only external links, meaning "absolute" links whose domain name is different from the domain of the website where the link is found (subdomain would still be internal links from my POV).

Eventually, I want to export the results in a CSV with the following fields:

  • domain of the website being crawled, (from the list)
  • page_url (where the external link was found)
  • external_link If the same external link is found several times on the same page, it is deduped. Not yet sure though, but I might want to dedup external links on the website scope too, at some point.

At some point, I would also like to :

  • also filter certain external links to not be considered such as facebook.com/... etc,
  • to run the script from Zyte.com. I believe it constrains me to respect a certain structure for the code, rather than just a standalone script. Any suggestion on that aspect would really help too.

After a lot of research, I found this reference: https://coderedirect.com/questions/369975/dynamic-rules-based-on-start-urls-for-scrapy-crawlspider

But it wasn't clear to me how I can make it because it's missing a full version of the code.

So far, the code I developed is as below, but I am stuck, as it does not fulfill my needs:

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy import Request
# import the link extractor
from scrapy.linkextractors import LinkExtractor
import os


class LinksSpider(scrapy.Spider):
    name = 'publishers_websites'
    start_urls = ['https://loremipsum.io/']
    allowed_domains = ['loremipsum.io']
    custom_settings = {'USER_AGENT': "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36"}

    try:
        os.remove('publishers_websites.txt')
    except OSError:
        pass

    custom_settings = {
        'CONCURRENT_REQUESTS': 2,
        'AUTO_THROTTLE_ENABLED': True
    }

    def __init__(self):
        self.link_extractor = LinkExtractor(unique=True)

    def parse(self, response):
        domain = 'https://loremipsum.io/'
        all_links = self.link_extractor.extract_links(response)
        for link in all_links:
            if domain not in link.url:
                with open('publishers_websites.txt', 'a+') as f:
                    f.write(f"\n{str(response.request.url), str(link.url)}")

            yield response.follow(url=link, callback=self.parse)

if __name__ == "__main__":
    process = CrawlerProcess()
    process.crawl(LinksSpider)
    process.start()

There aren't much answers for my problem, and my Python skills are not good enough to solve the problem by myself.

I would be very grateful for any help I receive.

Alban
  • 21
  • 4

3 Answers3

1

Please read about CrawlSpider and rules.

For example:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class example(CrawlSpider):
    name = "example_spider"
    start_urls = ['https://example.com']
    rules = (Rule(LinkExtractor(), callback='parse_urls', follow=True),)

    def parse_urls(self, response):
        for url in response.xpath('//a/@href').getall():
            if url:
                yield {
                    'url': url
                }

Maybe you'll want to add a function to check if it's a valid url, or maybe extend relative urls to full urls. But generally speaking this example should work.

(Just create an init function to add your file to the start_urls and any other things you want to add).

(And I don't know anything about zyte...)

Edit1:

You can also use another link extractor inside 'parse_urls' if it's more comfortable to you.

Edit2:

About getting the urls from file you can do it in the init function:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor


class example(CrawlSpider):
    name = "example_spider"

    def __init__(self, *args, **kwargs):
        self.rules = (Rule(LinkExtractor(allow_domains=['example.com']), callback='parse_urls', follow=True),)
        with open('urlsfile.txt', 'r') as f:
            self.start_urls = [line.strip() for line in f.readlines()]
        super(example, self).__init__(*args, **kwargs)

    def parse_urls(self, response):
        for url in response.xpath('//a/@href').getall():
            if url:
                yield {
                    'url': url
                }

SuperUser
  • 4,527
  • 1
  • 5
  • 24
  • Thanks a lot for your kind help! OK for using CrawlSpider as the class. However, I still do not understand how I can pass a list of hundreds of websites that I would like to crawl with the same spider (at the same time, one by one, whatever). I think I can't really mix up the different domains I want to crawl in the start_url array. Any idea? Thanks – Alban Nov 09 '21 at 16:30
  • Thanks, I know how to generate the start_urls with your second edit. However, it's not clear to me how to "contain" the spider so that it will only navigate across pages whose URLs are on the domain in the urlsfile.txt. After a quick test, the current script navigated to other domains, ready to scrap the internet :) I want to scrap external links, while crawling internal pages. How would you do that? Thanks – Alban Nov 10 '21 at 07:56
  • @Alban you are 100% right, I forgot to add allow_domains to the link extractor. I fixed it. You can first open the file and add the domain from each line to the rule. – SuperUser Nov 10 '21 at 08:27
  • Shouldn't the allow_domains be in the for loop to dynamically use the domain in the file? I see in your example that's it's static – Alban Nov 10 '21 at 08:31
  • @Alban yes, I said it in the previous comment. – SuperUser Nov 10 '21 at 09:09
0

When you want to crawl a list of links, you have to pass them alongside the start_urls variable.

import scrapy


    class MySpider(scrapy.Spider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = [
            'http://www.example.com/1.html',
            'http://www.example.com/2.html',
            'http://www.example.com/3.html',
        ]
    
        def parse(self, response):
            self.logger.info('A response from %s just arrived!', response.url)

Notice the start_urls list. It will automatically start the crawling event. There is no need to tell scrapy to read each URL. There is another method, in which, you set the URLs in a second function as:

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'example.com'
    allowed_domains = ['example.com']

    def start_requests(self):
        yield scrapy.Request('http://www.example.com/1.html', self.parse)
        yield scrapy.Request('http://www.example.com/2.html', self.parse)
        yield scrapy.Request('http://www.example.com/3.html', self.parse)

    def parse(self, response):
        for h3 in response.xpath('//h3').getall():
            yield MyItem(title=h3)

        for href in response.xpath('//a/@href').getall():
            yield scrapy.Request(response.urljoin(href), self.parse)

Nevertheless, if your list of links is so big. You can load it in a data frame and loop over it with pandas, in the case you do not want to write the links in a list.

Cheers

Geomario
  • 152
  • 1
  • 12
  • Hello, thanks for helping. In my case, the problem is that my list contains a different website in each row. Can you elaborate on how you would do in such a case? I wonder how to iterate for each domain/website of the list and run the spider with the allowed_domains/start_urls that are relevant for each website. – Alban Nov 10 '21 at 08:28
  • Can you provide an example of your CSV file? Additionally, you are mentioning rows and lists. A row is part of a data frame, and a list has a group of indexed items. – Geomario Nov 10 '21 at 08:36
  • I need to see the file you have from which you want to crawl the URLs. Otherwise it is complicated to tell you without having seen the data type. – Geomario Nov 10 '21 at 08:37
  • I just meant a list actually, in a text file with a domain on each line such as: domain1.com domain2.com domain3.com ... domain999.com – Alban Nov 10 '21 at 09:09
0
import pandas as pd


class Url_Spider(scrapy.Spider):
    name = 'url_page'

    def start_requests(self):
        df = pd.read_csv('list.csv')

        urls = df['link']
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse (self, response):
        '''' Parse here what u need ''''
        pass
Geomario
  • 152
  • 1
  • 12