6

looking for some help in ElasticCache We're using ElasticCache Redis to run a Resque based Qing system. this means it's a mix of sorted sets and Lists. at normal operation, everything is OK and we're seeing good response times & throughput. CPU level is around 7-10%, Get+Set commands are around 120-140K operations. (All metrics are cloudwatch based. ) but - when the system experiences a (mild) burst of data, enqueing several K messages, we see the server become near non-responsive. the CPU is steady @ 100% utilization (metric says 50, but it's using a single core) number of operation drops to ~10K response times are slow to a matter of SECONDS per request We would expect, that even IF the CPU got loaded to such an extent, the throughput level would have stayed the same, this is what we experience when running Redis locally. redis can utilize CPU, but throughput stays high. as it is natively single-cored, not context switching appears. AFAWK - we do NOT impose any limits, or persistence, no replication. using the basic config.

the size: cache.r3.large we are nor using periodic snapshoting

Shlomi Hassan
  • 177
  • 1
  • 11
  • How is your memory? If redis needs to swap, it can happen that redis slows down to seconds per request. We raise an alert when system free mem is below 8%. This is not related to redis limits you might have set up. – Tw Bert Mar 08 '16 at 09:19
  • 1
    The new LUA-pop script is missing a LIMIT : https://github.com/gresrun/jesque/issues/101 – user3041539 Mar 10 '16 at 15:53

1 Answers1

3

This seems like a characteristic of a rouge lua script. having a defect in such a script could cause a big CPU load, while degrading the overall throughput.

are you using such ? try to look in the Redis slow log for one