1

I'm using scrapy, on scrappinghub, to scrap a few thousands websites. When scraping a single website, requests durations are kept pretty short (< 100ms).

But I also have a spider that is responsible for 'validating' around 10k urls (I'm testing a bunch of different domains, with or without www.), all it does is scraping the homepage, and ditching the status isn't 200 or a redirect.

I've noticed that when running this spider several times in a row, I get inconsistent results (not the same number of items and requests).

When looking at the requests logs, I can see that the request durations are going gradually higher, before getting back to a lower number, and then getting even higher, until the point of triggering a user timeout on some urls.

I'm using a CONCURENT_REQUESTS usually > 100 (I've tried, 100, 200, 500, 1000)

Here are the duration logs, Nothing times out here because there are only 100 urls, but I need to run this validation on 10k urls, and this duration instability is a worry:

    {"time": 1535517660373, "duration": 26, "status": 400}
    {"time": 1535517661582, "duration": 26, "status": 400}
    {"time": 1535517663724, "duration": 26, "status": 400}
    {"time": 1535517663897, "duration": 26, "status": 400}
    {"time": 1535517665046, "duration": 46, "status": 200}
    {"time": 1535517657573, "duration": 50, "status": 200}
    {"time": 1535517657615, "duration": 83, "status": 200}
    {"time": 1535517657616, "duration": 85, "status": 200}
    {"time": 1535517657822, "duration": 112, "status": 200}
    {"time": 1535517657831, "duration": 112, "status": 200}
    {"time": 1535517657816, "duration": 120, "status": 200}
    {"time": 1535517657837, "duration": 121, "status": 200}
    {"time": 1535517658470, "duration": 130, "status": 200}
    {"time": 1535517663093, "duration": 135, "status": 302}
    {"time": 1535517658133, "duration": 149, "status": 200}
    {"time": 1535517657862, "duration": 153, "status": 200}
    {"time": 1535517657933, "duration": 228, "status": 200}
    {"time": 1535517658362, "duration": 230, "status": 200}
    {"time": 1535517657946, "duration": 258, "status": 200}
    {"time": 1535517657989, "duration": 269, "status": 200}
    {"time": 1535517657967, "duration": 271, "status": 200}
    {"time": 1535517658108, "duration": 389, "status": 200}
    {"time": 1535517665893, "duration": 433, "status": 404}
    {"time": 1535517658142, "duration": 467, "status": 200}
    {"time": 1535517658350, "duration": 467, "status": 200}
    {"time": 1535517668501, "duration": 526, "status": 200}
    {"time": 1535517658216, "duration": 543, "status": 200}
    {"time": 1535517658312, "duration": 670, "status": 200}
    {"time": 1535517658342, "duration": 678, "status": 200}
    {"time": 1535517658347, "duration": 679, "status": 200}
    {"time": 1535517658291, "duration": 682, "status": 200}
    {"time": 1535517658345, "duration": 684, "status": 200}
    {"time": 1535517658310, "duration": 688, "status": 200}
    {"time": 1535517658333, "duration": 688, "status": 200}
    {"time": 1535517658336, "duration": 689, "status": 200}
    {"time": 1535517658317, "duration": 690, "status": 200}
    {"time": 1535517658314, "duration": 694, "status": 200}
    {"time": 1535517658339, "duration": 696, "status": 200}
    {"time": 1535517658319, "duration": 697, "status": 200}
    {"time": 1535517658315, "duration": 701, "status": 200}
    {"time": 1535517658349, "duration": 701, "status": 200}
    {"time": 1535517658322, "duration": 703, "status": 200}
    {"time": 1535517658327, "duration": 703, "status": 200}
    {"time": 1535517658377, "duration": 704, "status": 200}
    {"time": 1535517658309, "duration": 708, "status": 200}
    {"time": 1535517658376, "duration": 710, "status": 200}
    {"time": 1535517658374, "duration": 711, "status": 200}
    {"time": 1535517658335, "duration": 717, "status": 200}
    {"time": 1535517658344, "duration": 720, "status": 200}
    {"time": 1535517658338, "duration": 728, "status": 200}
    {"time": 1535517658372, "duration": 728, "status": 200}
    {"time": 1535517658324, "duration": 732, "status": 200}
    {"time": 1535517658360, "duration": 748, "status": 200}
    {"time": 1535517658341, "duration": 753, "status": 200}
    {"time": 1535517658396, "duration": 797, "status": 200}
    {"time": 1535517658408, "duration": 801, "status": 200}
    {"time": 1535517658529, "duration": 938, "status": 200}
    {"time": 1535517658579, "duration": 994, "status": 200}
    {"time": 1535517658607, "duration": 996, "status": 200}
    {"time": 1535517658604, "duration": 1001, "status": 200}
    {"time": 1535517658611, "duration": 1006, "status": 200}
    {"time": 1535517658606, "duration": 1022, "status": 200}
    {"time": 1535517658707, "duration": 1104, "status": 200}
    {"time": 1535517658634, "duration": 1110, "status": 200}
    {"time": 1535517658772, "duration": 1166, "status": 200}
    {"time": 1535517658859, "duration": 1236, "status": 200}
    {"time": 1535517658956, "duration": 1348, "status": 200}
    {"time": 1535517659025, "duration": 1358, "status": 200}
    {"time": 1535517658958, "duration": 1368, "status": 200}
    {"time": 1535517658959, "duration": 1373, "status": 200}
    {"time": 1535517658985, "duration": 1408, "status": 200}
    {"time": 1535517658960, "duration": 1426, "status": 200}
    {"time": 1535517659349, "duration": 1445, "status": 200}
    {"time": 1535517659469, "duration": 1583, "status": 200}
    {"time": 1535517659283, "duration": 1694, "status": 200}
    {"time": 1535517659278, "duration": 1712, "status": 200}
    {"time": 1535517659620, "duration": 2033, "status": 200}
    {"time": 1535517660588, "duration": 2400, "status": 200}
    {"time": 1535517660353, "duration": 2819, "status": 200}
    {"time": 1535517660756, "duration": 3194, "status": 200}
    {"time": 1535517660752, "duration": 3214, "status": 200}
    {"time": 1535517661403, "duration": 3216, "status": 200}
    {"time": 1535517660889, "duration": 3316, "status": 200}
    {"time": 1535517661535, "duration": 3371, "status": 200}
    {"time": 1535517661407, "duration": 3848, "status": 200}
    {"time": 1535517661966, "duration": 4436, "status": 200}
    {"time": 1535517662355, "duration": 4463, "status": 200}
    {"time": 1535517662153, "duration": 4613, "status": 200}
    {"time": 1535517662336, "duration": 4814, "status": 200}
    {"time": 1535517664132, "duration": 6594, "status": 200}
    {"time": 1535517681367, "duration": 23480, "status": 200}
    {"time": 1535517683665, "duration": 26104, "status": 200}
    {"time": 1535517685281, "duration": 27744, "status": 200}
    {"time": 1535517691127, "duration": 33598, "status": 200}
    {"time": 1535517692933, "duration": 35454, "status": 200}
    {"time": 1535517693278, "duration": 35764, "status": 200}
    {"time": 1535517693337, "duration": 35812, "status": 200}
    {"time": 1535517693972, "duration": 36459, "status": 200}
    {"time": 1535517694212, "duration": 36701, "status": 200}
    {"time": 1535517694576, "duration": 37071, "status": 200}

my spider :

from scrapy.spiders import Spider
from scrapy import Request
import pkgutil
from ...utils.parse import parse
from ...utils.errback_httpbin import errback_httpbin


class QuotesSpider(Spider):
    name = "validation_2"
    rotate_user_agent = True

    def start_requests(self):
        urls = pkgutil.get_data("qwarx_spiders", "resources/urls_100.txt").decode('utf-8').splitlines()
        for url in urls:
            yield Request(url=url, callback=self.parse, errback=self.errback_httpbin)

    def parse(self, response):
        return parse(self, response)

    def errback_httpbin(self, failure):
        return errback_httpbin(self, failure)

parse method :

from ..items.broad import URL
from scrapy.exceptions import NotSupported


def getDomain(url):
    spltAr = url.split("://")
    i = (0, 1)[len(spltAr) > 1]
    dm = spltAr[i].split("?")[0].split('/')[0].split(':')[0].lower()
    return dm.replace('www.', '')


def parse(self, response):
    item = URL()
    id = {}

    id['url'] = response.url
    id['domain'] = getDomain(response.url)
    try:
        id['title'] = response.xpath("//title/text()").extract_first()
        if id['title'] is not None:
            id['title'] = id['title'].strip()
    except (AttributeError, NotSupported) as e:
        yield None

    meta_names = response.xpath("//meta/@name").extract()
    meta_properties = response.xpath("//meta/@property").extract()
    meta = {}
    content = {}

    if 'description' in meta_names:
        meta['description'] = response.xpath("//meta[@name='description']/@content").extract_first()
    else:
        if 'og:description' in meta_properties:
            meta['description'] = response.xpath("//meta[@property='og:description']/@content").extract_first()
        else:
            meta['description'] = ''

    if 'og:image' in meta_names:
        meta['image'] = response.xpath("//meta[@name='og:image']/@content").extract_first()
    else:
        if 'og:image' in meta_properties:
            meta['image'] = response.xpath("//meta[@property='og:image']/@content").extract_first()
        else:
            meta['image'] = ''

    content['p'] = response.xpath('//p/text()').extract_first()
    if content['p'] is not None:
        content['p'] = list(map(lambda x: x.strip()[:150], response.xpath('//p/text()').extract()))[:4]

        if 'redirect_urls' in response.meta:
            meta['redirect_urls'] = response.meta['redirect_urls']

    item['id'] = id
    item['content'] = content
    item['meta'] = meta

    yield item

errback_httpbin:

from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


def errback_httpbin(self, failure):
    # log all errback failures,
    # in case you want to do something special for some errors,
    # you may need the failure's type
    self.logger.error(repr(failure))

    # if isinstance(failure.value, HttpError):
    if failure.check(HttpError):
        # you can get the response
        response = failure.value.response
        self.logger.error('HttpError on %s', response.url)

    # elif isinstance(failure.value, DNSLookupError):
    elif failure.check(DNSLookupError):
        # this is the original request
        request = failure.request
        self.logger.error('DNSLookupError on %s', request.url)

    # elif isinstance(failure.value, TimeoutError):
    elif failure.check(TimeoutError):
        request = failure.request
        self.logger.error('TimeoutError on %s', request.url)

settings.py:

SPIDER_MODULES = ['qwarx_spiders.spiders.broad', 'qwarx_spiders.spiders.custom', 'qwarx_spiders.spiders.validation']
NEWSPIDER_MODULE = 'qwarx_spiders.spiders'

SPIDER_MIDDLEWARES = {
    'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': True,
}

DOWNLOADER_MIDDLEWARES = {
    'qwarx_spiders.middlewares.FilterDomainbyLimitMiddleware': 200,
    'qwarx_spiders.middlewares.RotateUserAgentMiddleware': 110,
}

ITEM_PIPELINES = {
    'qwarx_spiders.pipelines.DuplicatesPipeline': 300,
}

EXTENSIONS = {
    'scrapy_dotpersistence.DotScrapyPersistence': 0
}

BOT_NAME = 'Qwarx'

USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 ' \
             '(KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.3'

ROBOTSTXT_OBEY = False
LOG_LEVEL = 'INFO'

CONCURRENT_REQUESTS = 1000
REACTOR_THREADPOOL_MAXSIZE = 1000

DOWNLOAD_DELAY = 0

COOKIES_ENABLED = False
REDIRECT_ENABLED = True
AJAXCRAWL_ENABLED = True
AUTOTHROTTLE_ENABLED = False
RETRY_ENABLED = True
DOWNLOAD_TIMEOUT = 60
DNSCACHE_ENABLED=True
DNSCACHE_SIZE=100000

CRAWL_LIMIT_PER_DOMAIN = 100000

URLLENGTH_LIMIT = 180

USER_AGENT_CHOICES = [
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.62 Safari/537.36',
    'Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; WOW64; Trident/6.0)',
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36',
    'Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20140205 Firefox/24.0 Iceweasel/24.3.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0',
    'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
]


URLLENGTH_LIMIT=180
romain-lavoix
  • 403
  • 2
  • 6
  • 20
  • could you show your spider's source? Scrapy is asyncronious and tracking response time can be a bit difficult if you don't know when it started. When you `yield Request` you only schedule it, who knows when it will be picked up by the downloader, thus you need to do the actual tracking in downloader middlewares: mark request when it's leaving downloder and when it's coming back to it. – Granitosaurus Aug 30 '18 at 02:55
  • Thank you for looking into it, I just added my spider and settings code. Also I went with very high number on CONCURRENT_REQUESTS and THREADPOOL_MAXSIZE, but even with lower numbers the result is more or less the same, it just goes quicker that way. – romain-lavoix Aug 30 '18 at 03:48
  • and one last thing : running this spider on my local env gives me consistent results (always the same number of scraped items). Not the case when running on scrappinghub – romain-lavoix Aug 30 '18 at 04:09

1 Answers1

0

So I found a solution for my problem.
I had a bunch of 'false negative' when crawling lots of domains, meaning that when running the validation crawl on 10k urls several times in a row, I would never get the same number of results.
However I have set up a rotating proxy system (through Crawlera), and it's now completely stable.

romain-lavoix
  • 403
  • 2
  • 6
  • 20