Scrapy DEPTH_PRIORITY don't work

Question

I would like my Spider Crawl the start_urls website entirely before follow more deeply the websites.

The crawler aim to find expired domains.

For exemple I create a page with 500 urls (450 expired & 50 actif websites), the crawler must insert in database every url before follow.

Actually the crawler follow the first website alive and stop crawling the start_urls website.

This is my configuration :

self.custom_settings = {
    'RETRY_ENABLED': False,
    'DEPTH_LIMIT' : 0,
    'DEPTH_PRIORITY' : 1,
    'CONCURRENT_REQUESTS_PER_DOMAIN' : 64,
    'CONCURRENT_REQUESTS' : 128,
    'REACTOR_THREADPOOL_MAXSIZE' : 30,
}

Setting :

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
LOG_LEVEL = 'INFO'
DUPEFILTER_CLASS = 'dirbot.custom_filters.BLOOMDupeFilter'

Crawler :

rules = (
    Rule(LxmlLinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino', '.co'),
        deny=('facebook', 'amazon', 'wordpress', 'blogspot', 'free')),
        callback='parse_obj',
        process_request='add_errback',
        follow=True),
)

def add_errback(self, request):
    return request.replace(errback=self.errback_httpbin)

def errback_httpbin(self, failure):
    if failure.check(DNSLookupError):
        request = failure.request
        ext = tldextract.extract(request.url)
        domain = ext.registered_domain
        if domain != '' :
            self.checkDomain(domain)

what happens if you specify your settings on the `settings.py` file? just for testing, I think `custom_settings` needs to be a class attribute, please confirm assigning that on `settings.py` directly. — eLRuLL, Mar 17 '16 at 14:56
You want to put the custom_settings on the settings.py ? I think my custom_settings is already read when my spider start. — Pixel, Mar 17 '16 at 15:10
You right ! I have change de place of the self.custom_settings and now everything is ok. — Pixel, Mar 17 '16 at 15:18

score 0 · Answer 1 · answered Mar 17 '16 at 15:19

0

custom_settings needs to be defined as a class attribute for it to replace actual settings on settings.py.

answered Mar 17 '16 at 15:19

eLRuLL

18,488
9
73
99

Do you know why the spider don't continue to crawler after the first page ? The crawler as stop with the DEPH LIMIT : 0 – Pixel Mar 17 '16 at 15:24

Scrapy DEPTH_PRIORITY don't work

1 Answers1