1

I've created a script using scrapy to parse all the links recursively from the left-sided area of this webpage. It's necessary to go for recursion as most of the links in there have sublinks and so on.

The following script appears to scrape all the inks accordingly. However, what I can't do is reuse the links from unique_links within parse_content method. If I try to use the links while the recursion is going on, the script will use lots of duplicate links within parse_content method. I've added an imaginary block of code after the two lines of comment within parse method to represent what I wish to do.

import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess


class mySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["https://www.amazon.de/-/en/gp/bestsellers/automotive/ref=zg_bs_nav_0"]
    unique_links = set()

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url,callback=self.parse,dont_filter=True)

    def parse(self,response):
        soup = BeautifulSoup(response.text,"lxml")
        link_list = []
        for item in soup.select("li:has(> span.zg_selected) + ul > li > a[href]"):
            item_link = item.get("href")
            link_list.append(item_link)
            self.unique_links.append(item_link)
            
        #THE FOLLOWING IS SOMETHING I WANTED TO DO WITH THE `unique_links` IF I COULD EXECUTE THE FOLLOWING BLOCK
        #AFTER ALL THE LINKS ARE STORED IN `unique_links`

        for new_link in self.unique_links:
            yield scrapy.Request(new_link,callback=self.parse_content,dont_filter=True)

    def parse_content(self,response):
        soup = BeautifulSoup(response.text,"lxml")
        for item in soup.select("span.a-list-item > .a-section a.a-link-normal"):
            print(item.get("href"))

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
        'LOG_LEVEL':'ERROR',
    })
    c.crawl(mySpider)
    c.start()

How can I reuse the links of unique_links within parse_content method?

EDIT: I'm terribly sorry if I still could not be able to clarify what I wanted to achieve. However, this is how I solved it. Any better approach is welcome.

class mySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["https://www.amazon.de/-/en/gp/bestsellers/automotive/ref=zg_bs_nav_0"]
    unique_links = set()


    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url,callback=self.parse,dont_filter=True)

    def parse(self,response):
        soup = BeautifulSoup(response.text,"lxml")
        link_list = []
        for item in soup.select("li:has(> span.zg_selected) + ul > li > a[href]"):
            item_link = item.get("href")
            link_list.append(item_link)
            if item_link not in self.unique_links:
                yield scrapy.Request(item_link,callback=self.parse_content,dont_filter=True)
            self.unique_links.add(item_link)

        for new_link in link_list:
            yield scrapy.Request(new_link,callback=self.parse,dont_filter=True)

    def parse_content(self,response):
        # soup = BeautifulSoup(response.text,"lxml")
        # for item in soup.select("span.a-list-item > .a-section a.a-link-normal"):
        #     print(item.get("href"))
        print("------>",response.url)

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
        'LOG_LEVEL':'ERROR',
    })
    c.crawl(mySpider)
    c.start()
SMTH
  • 67
  • 1
  • 4
  • 17
  • Does this answer your question? [Running code when Scrapy spider has finished crawling](https://stackoverflow.com/questions/17363458/running-code-when-scrapy-spider-has-finished-crawling) – Nja Oct 13 '20 at 08:14
  • Nope, not at all. I'll reuse the links within the script when it has the ability to print the links. – SMTH Oct 13 '20 at 08:18
  • I think you can do that whit what is suggested here https://doc.scrapy.org/en/latest/topics/extensions.html . Define and enable a custom extension. In particular it's explained how to capture the spider close signal. A logging function is run after crawling end. You can change that with you're custom code. – Nja Oct 13 '20 at 08:29
  • You are leading to the wrong route. I do not want to close that spider after collecting those links;rather, I wish to reuse the links within the script for further processing. I edited my question to reflect the same. Btw, what you are suggesting can be done using `closed(self, reason)` without making use of signal. Thanks. – SMTH Oct 13 '20 at 08:36
  • Which kind of processing do you need to do? You would close the spider when *all* the links, even the result of recursive requests will be completed. That code wouldn't close the spider but capture when it is closed by the fact that no more link to scrape are available. – Nja Oct 13 '20 at 08:41
  • Please check out the edit @Nja. Thanks. – SMTH Oct 13 '20 at 12:38
  • Thank you, now it's clear. So you're requirement is that you need to do postprocessing to the `unique_links` all the links at a time? Checking inside the `parse` method if that specific link is unique and do immediately there the postprocessing wouldn't be ok is it ? – Nja Oct 13 '20 at 12:58
  • Yes, you got it right @Nja. Thanks. – SMTH Oct 13 '20 at 13:01
  • @SMTH Why you set `dont_filter=True`? Scrapy have enabled by default dupefilter that works more efficient than your `unique_links` set. – Georgiy Oct 13 '20 at 16:31
  • Do you suggest me to get rid of `dont_filter=True` from all the methods @Georgiy? I appreciate any corrected approach. Btw, am I not doing things in the right way now other than the `dont_filter=True` thing? – SMTH Oct 13 '20 at 16:42
  • To let you know - I'm using proxies in the middleware and the requests are going through proxies so I use `dont_filter=True` as you know all the proxies are not working ones and it is sometimes necessary for the script to retry a single link couple of times to get the response. – SMTH Oct 13 '20 at 16:46

3 Answers3

0
class mySpider(scrapy.Spider):
    def closed(self, reason):
        do-something()

Scrapy is having 'before close' handler. So upper code should work.

Michael Savchenko
  • 1,445
  • 1
  • 9
  • 13
0

You declared unique_links set as class variable:

class mySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["https://www.amazon.de/-/en/gp/bestsellers/automotive/ref=zg_bs_nav_0"]
    unique_links = set()

It means that you can access this variable after end of scraping:

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT': 'Mozilla/5.0',
        'LOG_LEVEL': 'ERROR',
    })
    c.crawl(mySpider)
    c.start()
    unique_lists_from_spider = mySpider.unique_links
    print(unique_lists_from_spider)
Georgiy
  • 3,158
  • 1
  • 6
  • 18
0

As suggested here scrapy extensions you can connect your spider to scrapy signal. The signal we're interested in is spider_close.

The code below works:

import scrapy
from bs4 import BeautifulSoup
from scrapy.crawler import CrawlerProcess
from scrapy import signals #import signals

class mySpider(scrapy.Spider):
    name = "myspider"
    start_urls = ["https://www.amazon.de/-/en/gp/bestsellers/automotive/ref=zg_bs_nav_0"]
    unique_links = set()

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(mySpider, cls).from_crawler(crawler, *args, **kwargs)
        # connect you're spider to spider_close signal
        crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
        return spider

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url,callback=self.parse,dont_filter=True)

    def parse(self,response):
        soup = BeautifulSoup(response.text,"lxml")
        link_list = []
        for item in soup.select("li:has(> span.zg_selected) + ul > li > a[href]"):
            item_link = item.get("href")
            link_list.append(item_link)
            self.unique_links.add(item_link) #set method is add
            
        for new_link in self.unique_links:
            yield scrapy.Request(new_link,callback=self.parse_content,dont_filter=True)

    def parse_content(self,response):
        soup = BeautifulSoup(response.text,"lxml")
        # for item in soup.select("span.a-list-item > .a-section a.a-link-normal"):
        #     print(item.get("href"))

    # code executed after spider ends crawling
    def spider_closed(self, spider):
        print("[*] Spider closed: no more links to process")
        print("[*] Unique collected links are:")
        print(spider.unique_links)
        ###
        # Put here the block of code to run on links after they were all collected
        ###

if __name__ == "__main__":
    c = CrawlerProcess({
        'USER_AGENT':'Mozilla/5.0',
        'LOG_LEVEL':'ERROR',
    })
    c.crawl(mySpider)
    c.start()
Nja
  • 439
  • 6
  • 17