1

I am using portia to crawl the article of a website, now I wonder how can I get the least article everyday, when run the portia spider?

I have a idea that to use datetime from the article, and compared with now datetime.But is there a better one?

gangzi
  • 105
  • 1
  • 13

1 Answers1

2

Depends on how the website is structured, but if every article is in a different URL you could filter URLs already visited in previous crawls by using the deltafetch spider middleware.

To enable install scrapylib and add this to your settings.py:

SPIDER_MIDDLEWARES = {
    'scrapylib.deltafetch.DeltaFetch': 100,
}
DELTAFETCH_ENABLED = True
David Bengoa
  • 127
  • 1
  • 7