6

I've got a set of 25,000+ urls that I need to scrape. I'm consistently seeing that after about 22,000 urls the crawl rate drops drastically.

Take a look at these log lines to get some perspective:

2016-04-18 00:14:06 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:15:06 [scrapy] INFO: Crawled 5324 pages (at 5324 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:16:06 [scrapy] INFO: Crawled 9475 pages (at 4151 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:17:06 [scrapy] INFO: Crawled 14416 pages (at 4941 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:18:07 [scrapy] INFO: Crawled 20575 pages (at 6159 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:19:06 [scrapy] INFO: Crawled 22036 pages (at 1461 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:20:06 [scrapy] INFO: Crawled 22106 pages (at 70 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:21:06 [scrapy] INFO: Crawled 22146 pages (at 40 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:22:06 [scrapy] INFO: Crawled 22189 pages (at 43 pages/min), scraped 0 items (at 0 items/min)
2016-04-18 00:23:06 [scrapy] INFO: Crawled 22229 pages (at 40 pages/min), scraped 0 items (at 0 items/min)

Here're my settings

# -*- coding: utf-8 -*-

BOT_NAME = 'crawler'

SPIDER_MODULES = ['crawler.spiders']
NEWSPIDER_MODULE = 'crawler.spiders'

CONCURRENT_REQUESTS = 10
REACTOR_THREADPOOL_MAXSIZE = 100
LOG_LEVEL = 'INFO'
COOKIES_ENABLED = False
RETRY_ENABLED = False
DOWNLOAD_TIMEOUT = 15
DNSCACHE_ENABLED = True
DNSCACHE_SIZE = 1024000
DNS_TIMEOUT = 10
DOWNLOAD_MAXSIZE = 1024000 # 10 MB
DOWNLOAD_WARNSIZE = 819200 # 8 MB
REDIRECT_MAX_TIMES = 3
METAREFRESH_MAXDELAY = 10
ROBOTSTXT_OBEY = True
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36' #Chrome 41

DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'

#DOWNLOAD_DELAY = 1
#AUTOTHROTTLE_ENABLED = True
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 604800 # 7 days
COMPRESSION_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
    'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
    'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 550,
    'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
    'crawler.middlewares.RandomizeProxies': 740,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
    'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware': 830,
    'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
    'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}

PROXY_LIST = '/etc/scrapyd/proxy_list.txt'
  • Memory and CPU consumption is less than 10%
  • tcptrack shows no unusual network activity
  • iostat shows negligible disk i/o\

What can I look at to debug this?

smci
  • 32,567
  • 20
  • 113
  • 146
HyderA
  • 20,651
  • 42
  • 112
  • 180
  • Have you tried changing your logging level to see if anything unexpected is happening? – RattleyCooper Apr 20 '16 at 18:06
  • 4
    Were those URLs on the same site(s)? Perhaps you got rate-limited by that/those site(s) after 22,000 hits? Try scraping from multiple different IP addresses and see if it isn't faster. Try asking those sites to whitelist your IP for scraping. (I presume your own ISP or network itself isn't rate-limiting you). – smci Apr 20 '16 at 18:13
  • Are the TCP connections closed after they are used? – ozOli Apr 20 '16 at 18:14
  • Can you try **strace** to see whats going under the hood. I hope you are on linux :) – Anupam Saini Apr 20 '16 at 18:18
  • @DuckPuncher - the current log level is at INFO - shouldn't that suffice? – HyderA Apr 20 '16 at 18:18
  • @ozOli: Yes, tcptrack shows connections closing as they're completed and the connection count is proportionate with the speed of the crawler, there's no backlog. – HyderA Apr 20 '16 at 18:18
  • @smci: No, the URLs are varied and even when slowed down, there's no backlog of URLs from the same domain. – HyderA Apr 20 '16 at 18:19
  • @HyderA, well not if it's throwing an error and it's not logging it. But other than that maybe try scraping less. Those servers are probably blacklisting you or limiting your requests. Use some private proxies with unique user agents and that should _help_. It looks like you are scraping a ton, so you'll probably need a lot of proxies to stay under their radar. – RattleyCooper Apr 20 '16 at 18:26
  • @DuckPuncher - regardless, shouldn't it timeout after 15 seconds based on the timeout settings? – HyderA Apr 20 '16 at 18:28
  • @AnupamSaini - just tried with strace. I see a significant drop in the strace output as well coinciding with the slowdown in the crawler, telling me that there's probably no loop issues. The strace output so far looks fine - though I have to go through a larger sample-set - there's simply too much output from strace. – HyderA Apr 20 '16 at 18:36
  • 2
    @HyderA, well yeah, but if you are getting blocked or limited and 100 pages timeout, then that is 25 minutes of waiting for timeouts. And that is just 100 pages. This could easily be the issue if you are scraping pages from the same domain. – RattleyCooper Apr 20 '16 at 18:43
  • 1
    @HyderA, I suggest reading [this article](https://learn.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/) about writing web crawlers. Letting loose on a website like you are doing will get you blacklisted at best, and a cease and desist letter at worse. – RattleyCooper Apr 20 '16 at 18:47
  • @DuckPuncher Sounds right - let me double-check to ensure what the lagging urls are and get back to you. I wasn't considering the combined queue + timeout - I figured they should all just timeout within 15 seconds. – HyderA Apr 20 '16 at 18:47
  • 1
    @DuckPuncher - thanks for the link, however, these are my clients' websites so I have an agreement with them with preset hit-rates. I'm well below them. Also, I entered the user agent setting for simplicity sake - I use rotating proxies and user-agents to prevent their IDS from blacklisting me. And I keep track of failures as well, my failure rate is relatively low - which wouldn't be the case with a high number of blacklist responses. – HyderA Apr 20 '16 at 18:52
  • @HyderA Dang. I really don't know what would cause this then. It looks like you are monitoring pretty much everything that you can without any luck... I really don't know of anything else that would cause this. Can you give them an ip address or unique user-agent that is specific to your scraper so they can whitelist it?(if you haven't already done that) At this point I really don't know what else to suggest. I'll wait for someone else to chime in as I don't think I have much else to offer. – RattleyCooper Apr 20 '16 at 18:56
  • @DuckPuncher - I have given them the proxy addresses which they (hopefully) have already whitelisted. Like I said, low failure rate, so there's unlikely to be a blacklisting issue. – HyderA Apr 20 '16 at 18:59
  • @DuckPuncher - I really do appreciate you taking the time to help out :) There could very well be a timeout backlog based on what you mentioned earlier. – HyderA Apr 20 '16 at 19:00

1 Answers1

2

Turns out the issue was with one particular domain that was causing a backlog. The URL queue would be filled up and be waiting on responses from these domains. Since only one request per IP/Domain is being allowed, these were processed one at a time.

I turned up logging on my proxies and tailed their combined output and it was clear as day.

I wouldn't have figured this out without the comment conversation above - thanks to @DuckPuncher and @smci

HyderA
  • 20,651
  • 42
  • 112
  • 180