1

I'm learning scrapy and I have some small project.

def parse(self, response):
    links = LinkExtractor().extract_links(response)
    for link in links:
            yield response.follow(link, self.parse)

    if (some_condition):
        yield {'url': response.url}  # Store some data

So I open a page get all links form it and store some data if I have some data on this page. And for example, if I processed http://example.com/some_page then it will skip it next time. And my task is to process it even next time. I want to know that this page has been already processed and I need to store some other data in this case. It should be something like:

def parse(self, response):

    if (is_duplicate):
        yield{} # Store some other data
    else:
        links = LinkExtractor().extract_links(response)
        for link in links:
                yield response.follow(link, self.parse)

        if (some_condition):
            yield {'url': response.url}  # Store some data
GhostKU
  • 1,898
  • 6
  • 23
  • 32

1 Answers1

1

First, you need to track links you visit and second, you have to tell Scrapy you want to visit same pages repeatedly.

Change the code this way:

def __init__(self, *args, **kwargs):
    super(MySpider, self).__init__(*args, **kwargs)
    self.visited_links = set()

def parse(self, response):
    if response.url in self.visited_links:
        yield {} # Store some other data
    else:
        self.visited_links.add(response.url)

        links = LinkExtractor().extract_links(response)
        for link in links:
            yield response.follow(link, self.parse, dont_filter=True)

        if (some_condition):
            yield {'url': response.url}  # Store some data

In constructor added, visited_links is used to keep track of links you already visisted. (Here I suppose you spider class is named MySpider, you didn't share this part of code.) In parse, you first check if link is already visited (URL is in the visited_links set). If not, you add it to visited links set and when yielding new Request (using response.follow), you instruct Scrapy not to filter duplicate request using dont_filter=True.

Tomáš Linhart
  • 9,832
  • 1
  • 27
  • 39