How to process duplicates in scrapy?

Question

I'm learning scrapy and I have some small project.

def parse(self, response):
    links = LinkExtractor().extract_links(response)
    for link in links:
            yield response.follow(link, self.parse)

    if (some_condition):
        yield {'url': response.url}  # Store some data

So I open a page get all links form it and store some data if I have some data on this page. And for example, if I processed http://example.com/some_page then it will skip it next time. And my task is to process it even next time. I want to know that this page has been already processed and I need to store some other data in this case. It should be something like:

def parse(self, response):

    if (is_duplicate):
        yield{} # Store some other data
    else:
        links = LinkExtractor().extract_links(response)
        for link in links:
                yield response.follow(link, self.parse)

        if (some_condition):
            yield {'url': response.url}  # Store some data

score 1 · Accepted Answer · answered Oct 29 '17 at 06:48

First, you need to track links you visit and second, you have to tell Scrapy you want to visit same pages repeatedly.

Change the code this way:

def __init__(self, *args, **kwargs):
    super(MySpider, self).__init__(*args, **kwargs)
    self.visited_links = set()

def parse(self, response):
    if response.url in self.visited_links:
        yield {} # Store some other data
    else:
        self.visited_links.add(response.url)

        links = LinkExtractor().extract_links(response)
        for link in links:
            yield response.follow(link, self.parse, dont_filter=True)

        if (some_condition):
            yield {'url': response.url}  # Store some data

In constructor added, visited_links is used to keep track of links you already visisted. (Here I suppose you spider class is named MySpider, you didn't share this part of code.) In parse, you first check if link is already visited (URL is in the visited_links set). If not, you add it to visited links set and when yielding new Request (using response.follow), you instruct Scrapy not to filter duplicate request using dont_filter=True.

That's work but it looks like if I do like this it will crawl external links too. So I should filter them myself? — GhostKU, Oct 29 '17 at 11:09
@GhostKU `links = LinkExtractor(allow_domains=self.allowed_domains).extract_links(response)` — hdsuperman, Apr 05 '23 at 02:37

How to process duplicates in scrapy?

1 Answers1