I'm writing a spider with scrapy to crawl a website, the index page is a list of link like www.link1.com, www.link2.com, www.link3.com and that site is updated really often, so my crawler is part of a process that runs everey hours, but I would like to crawl only the new link that i havent crawled yet. my problem is that scrapy randomise the way it treats each link when going deep. is it possible to force sracpy to crawl in order ? Like 1 then 2 and then 3, so that I can save the last link that i've crawled and when starting the process again just compare link 1 with formerly link 1 ?
Hope this is understandable, sorry for my poor english,
kindly response,
thanks
EDIT :
class SymantecSpider(CrawlSpider):
name = 'symantecSpider'
allowed_domains = ['symantec.com']
start_urls = [
'http://www.symantec.com/security_response/landing/vulnerabilities.jsp'
]
rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="mrgnMD"]/following-sibling::table')), callback='parse_item')]
def parse_item(self, response):
open("test.t", "ab").write(response.url + "\n")