0

I have a working scrapy spider deployed on an Amazon EC2 instance (c4xlarge) and running using scrapyd.

No matter what I do, I can't seem to top ~200 processed items per minute (according to scrapy logs).

I tried playing around with scrapyd concurrency settings, nothing helped, tried playing around with scrapyd max_proc_per_cpu (lowered to 1 to avoid context switch), tried to run separate scrapy crawlers from command line, still, all of them together give the same results of an aggregate amount of around 200 items.

I can see from scrapy logs that the aggregate amount of web pages hit is increasing almost linearly but the scraped items per minute seems stuck at 200.

Any tips? Has anybody come across this before? Have I missed a setting somewhere?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • limit by items isn't realistic, only limit by requests can be specified, have you tried making a test project to just throw items very quickly? (send 1000 items at a time, for example). – eLRuLL Nov 08 '15 at 16:55
  • You seem to be right, I tried isolating the problem but it's a bit elusive. When all I do is return 10K Items per page, than yeah, the item count goes trough the roof, but when it's a real scan scenario, no matter how much i play with it is stays around 200 items. The weird thing is that it's a strong server, nothing seems to be under heavy load (i.e. CPU,Network,RAM), so I can't understand what's holding it back. – Daniel Dubovski Nov 09 '15 at 18:21
  • Do you have the throttle middleware enabled? Have you set up any download delay in your settings? Usually, the bottleneck in these situations is bandwidth. Servers may also rate limit you if you're making too many requests, and doing the throttling on their end. – Rejected Nov 09 '15 at 21:43
  • Ok, my bad, seems the site I was scraping was throttling me (status code 429). Was stupid of me not to look at the error callback. So if somebody else sees this, check your err callback! – Daniel Dubovski Nov 12 '15 at 12:06

0 Answers0