3

I'm writing a spider with scrapy to crawl a website, the index page is a list of link like www.link1.com, www.link2.com, www.link3.com and that site is updated really often, so my crawler is part of a process that runs everey hours, but I would like to crawl only the new link that i havent crawled yet. my problem is that scrapy randomise the way it treats each link when going deep. is it possible to force sracpy to crawl in order ? Like 1 then 2 and then 3, so that I can save the last link that i've crawled and when starting the process again just compare link 1 with formerly link 1 ?

Hope this is understandable, sorry for my poor english,

kindly response,

thanks

EDIT :

class SymantecSpider(CrawlSpider):

    name = 'symantecSpider'
    allowed_domains = ['symantec.com']
    start_urls = [
        'http://www.symantec.com/security_response/landing/vulnerabilities.jsp'
        ]
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//div[@class="mrgnMD"]/following-sibling::table')), callback='parse_item')]

    def parse_item(self, response):
        open("test.t", "ab").write(response.url + "\n")
warvariuc
  • 57,116
  • 41
  • 173
  • 227
Nils
  • 31
  • 1
  • 3
  • Why don't you just save the links that you scrape somewhere, then check them later to make sure you don't scrape the same site twice? –  Jul 26 '12 at 15:34
  • Hm, because the list of link i crawl is pretty long and my bot is meant to be running for a long time so i might have to save a very large amount of link and then the comparaison between saved link and new might take a very long time in a few month – Nils Jul 26 '12 at 15:38
  • in fact, theres's actually 1400 links on the page, so what i would like to do is crawl them all the first time, but then when my spider is recalled like 1 hour after, just check if theres a new link and crawl it if there is – Nils Jul 26 '12 at 15:49
  • One thing you might do is every time you scrape the page, get a hash code of the content (the list of links). Then in an hour, read the content again to get another hash code, and if the hashes are the same, don't scrape at all. Another thing you can do is save all the scraped values in a SQLite database and query that database every time you scrape. This will be faster than you think for only 1400 links. That way, you can also save useful data on the database like when it was scraped so you can re-scrape a link every week for example. –  Jul 26 '12 at 16:11
  • Thanks a lot for your answers, I'll try that, just to be sure, there,s no way to force scrapy to treat the links in the order they appears on the page ? This would save me a lot of time if i could just do that ! This is my starting url =http://www.symantec.com/security_response/landing/vulnerabilities.jsp and i'd like scrapy to crawl the links below Vulnerabilities and treat them in the order they appears but it seems that my callback_method don't want to do that =) – Nils Jul 26 '12 at 16:27
  • Maybe there is, I don't know. I thought I'd share how I would approach the problem. Unfortunately, scrapy is not a very active tag on this site, and that's why you haven't gotten a single answer yet (answers appear below, here we are just posting comments). But hopefully someone will come along and give you a proper answer. I will add the Python tag to your question so you can get some more views (and because this is a Python question). –  Jul 26 '12 at 16:41
  • Ok, thanks a lot for your time and answers, i will post the answer or the workaround i use for this as soon as i solve the probleme – Nils Jul 26 '12 at 17:34
  • I think the links are scraped in order, it's just callbacks are called out of order, because some requests are downloaded faster. See also [this question](http://stackoverflow.com/q/6566322/248296) – warvariuc Jul 27 '12 at 07:26

2 Answers2

3

Try this example.
Construct a list and append all the links to it.
Then pop them one by one to get your requests in order.

I recommend doing something like @Hassan mention and pipe your contents to a database.

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy import log


class SymantecSpider(BaseSpider):
    name = 'symantecSpider'
    allowed_domains = ['symantec.com']
    allLinks = []
    base_url = "http://www.symantec.com"

    def start_requests(self):
        return [Request('http://www.symantec.com/security_response/landing/vulnerabilities.jsp', callback=self.parseMgr)]

    def parseMgr(self, response):
        # This grabs all the links and append them to allLinks=[]
        self.allLinks.append(HtmlXPathSelector(response).select("//table[@class='defaultTableStyle tableFontMD tableNoBorder']/tbody/tr/td[2]/a/@href").extract())
        return Request(self.base_url + self.allLinks[0].pop(0), callback=self.pageParser)

    # Cycle through the allLinks[] in order
    def pageParser(self, response):
        log.msg('response: %s' % response.url, level=log.INFO)
        return Request(self.base_url + self.allLinks[0].pop(0), callback=self.pageParser)
user1460015
  • 1,973
  • 3
  • 27
  • 37
  • Dont you have to return a `Request` in `parseMgr` for each link? – Hakim Jan 07 '14 at 16:04
  • good idea (I got it)! `parseMgr` manage only the first link & then it's the `pageParser` that manage the rest. – Hakim Jan 07 '14 at 16:30
  • Should just add a condition `if not empty list` in the `pageParser` for the last link in the list. – Hakim Jan 07 '14 at 16:45
1

SgmlLinkExtractor will extract links in the same order they appear on the page.

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
links = SgmlLinkExtractor(
    restrict_xpaths='//div[@class="mrgnMD"]/following-sibling::table',
        ).extract_links(response)

You can use them in the rules in your CrawlSpider:

class ThreatSpider(CrawlSpider):
    name = 'threats'
    start_urls = [
        'http://www.symantec.com/security_response/landing/vulnerabilities.jsp',
    ]
    rules = (Rule(SgmlLinkExtractor(
                restrict_xpaths='//div[@class="mrgnMD"]/following-sibling::table')
            callback='parse_threats'))
Steven Almeroth
  • 7,758
  • 2
  • 50
  • 57
  • Hi, thank you for your answer, i've tried your code but it appears that the order of the url is still wrong, did i missed something ? heres the code i'im using : – Nils Jul 26 '12 at 20:07
  • I'll edit my question with the code, can't figure how to post code in comment :X – Nils Jul 26 '12 at 20:31