1

I would like my Spider Crawl the start_urls website entirely before follow more deeply the websites.

The crawler aim to find expired domains.

For exemple I create a page with 500 urls (450 expired & 50 actif websites), the crawler must insert in database every url before follow.

Actually the crawler follow the first website alive and stop crawling the start_urls website.

This is my configuration :

self.custom_settings = {
    'RETRY_ENABLED': False,
    'DEPTH_LIMIT' : 0,
    'DEPTH_PRIORITY' : 1,
    'CONCURRENT_REQUESTS_PER_DOMAIN' : 64,
    'CONCURRENT_REQUESTS' : 128,
    'REACTOR_THREADPOOL_MAXSIZE' : 30,
}

Setting :

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
LOG_LEVEL = 'INFO'
DUPEFILTER_CLASS = 'dirbot.custom_filters.BLOOMDupeFilter'

Crawler :

rules = (
    Rule(LxmlLinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino', '.co'),
        deny=('facebook', 'amazon', 'wordpress', 'blogspot', 'free')),
        callback='parse_obj',
        process_request='add_errback',
        follow=True),
)

def add_errback(self, request):
    return request.replace(errback=self.errback_httpbin)

def errback_httpbin(self, failure):
    if failure.check(DNSLookupError):
        request = failure.request
        ext = tldextract.extract(request.url)
        domain = ext.registered_domain
        if domain != '' :
            self.checkDomain(domain)
Pixel
  • 900
  • 1
  • 13
  • 31
  • 1
    what happens if you specify your settings on the `settings.py` file? just for testing, I think `custom_settings` needs to be a class attribute, please confirm assigning that on `settings.py` directly. – eLRuLL Mar 17 '16 at 14:56
  • You want to put the custom_settings on the settings.py ? I think my custom_settings is already read when my spider start. – Pixel Mar 17 '16 at 15:10
  • You right ! I have change de place of the self.custom_settings and now everything is ok. – Pixel Mar 17 '16 at 15:18
  • hope I helped, I'll add the answer then – eLRuLL Mar 17 '16 at 15:18

1 Answers1

0

custom_settings needs to be defined as a class attribute for it to replace actual settings on settings.py.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Do you know why the spider don't continue to crawler after the first page ? The crawler as stop with the DEPH LIMIT : 0 – Pixel Mar 17 '16 at 15:24