4

I have followed Splash's FAQ for production setups and my system currently looks like this:

  • 1 Scrapy Container with 6 concurrency requests.
  • 1 HAProxy Container that load balance to splash containers
  • 2 Splash Containers with 3 slots each.

I use docker stats to monitor my setup and I never get more than 7% CPU usage or more than 55% Memory usage.

I still get a lot of

DEBUG: Retrying <GET https://the/url/ via http://haproxy:8050/execute> (failed 1 times): 504 Gateway Time-out

For every successful request I get 6-7 of these timeouts.

I have experimented with changing the slots of the splash containers and the amount of concurrency requests. I've also tried running with a single splash container behind the HAProxy. I keep getting these errors.

I'm running on a AWS EC2 t2.micro instance which have 1gb memory.

I suspect that the issue is still related to the splash instance getting flooded. Is there any advice you can give me to reduce the load of the Splash instances? Is there a good ratio between slots and concurrency requests? Should I throttle requests?

Gallaecio
  • 3,620
  • 2
  • 25
  • 64
Marcus Lind
  • 10,374
  • 7
  • 58
  • 112
  • 2
    Same problem here. Did you find any solution? – reisdev Sep 17 '19 at 23:24
  • did anyone figure out a solution? @reisdev – Luca Guarro Nov 08 '20 at 15:44
  • I ended up using Scrapinghub's paid version of Splash https://www.scrapinghub.com/splash/ that resolved the issue. So this is probably related to maybe the production config(?) or something else that differs the local setup from Scrapinghubs setup. – Marcus Lind Nov 10 '20 at 18:47

0 Answers0