I'm learning scrapy and I have some small project.
def parse(self, response):
links = LinkExtractor().extract_links(response)
for link in links:
yield response.follow(link, self.parse)
if (some_condition):
yield {'url': response.url} # Store some data
So I open a page get all links form it and store some data if I have some data on this page. And for example, if I processed http://example.com/some_page
then it will skip it next time. And my task is to process it even next time. I want to know that this page has been already processed and I need to store some other data in this case. It should be something like:
def parse(self, response):
if (is_duplicate):
yield{} # Store some other data
else:
links = LinkExtractor().extract_links(response)
for link in links:
yield response.follow(link, self.parse)
if (some_condition):
yield {'url': response.url} # Store some data