4

I am seeing some very slow cache reads in my rails app. Both redis (redis-rails) and memcached (dalli) produced the same results.

It looks like it is only the first call to Rails.cache that causes the slowness (averaging 500ms).

I am using skylight to instrument my app and see a graph like:

enter image description here

I have a Rails.cache.fetch call in this code, but when I benchmark it I see it average around 8ms, which matches what memcache-top shows for my average call time.

I thought this might be dalli connections opening slowly, but benchmarking that didnt show anything slow either. I'm at a loss for what else to check into.

Does anyone have any good techniques for tracking this sort of thing down in a rails app?

Edit #1

Memcache server is stored in ENV['MEMCACHE_SERVERS'], all the servers are in the us-east-1 datacenter.

Cache config looks like:

  config.cache_store = :dalli_store, nil, { expires_in: 1.day, compress: true }

I ran something like:

100000.times { Rails.cache.fetch('something') }

and calculated the average timings and got something on the order of 8ms when running on one of my webservers.

Testing my theory of the first request is slow, I opened a console on my web server and ran the following as the first command.

irb(main):002:0> Benchmark.ms { Rails.cache.fetch('someth') { 1 } }
Dalli::Server#connect my-cache.begfpc.0001.use1.cache.amazonaws.com:11211
=> 12.043342

Edit #2

Ok, I split out the fetch into a read and write, and tracked them independently with statsd. It looks like the averages sit around what I would expect, but the max times on the read are very spiky and get up into the 500ms range.

http://s16.postimg.org/5xlmihs79/Screen_Shot_2014_12_19_at_6_51_16_PM.png

  • Can you post your Dalli/Rails Cache config? Are your app server and cache server/service geographically close or separated (e.g. AWS us-east-1 and AWS us-west-2)? Can you do a bundle console and check the cache directly? What sort of performance do you see in that case? – Peter Goldstein Dec 20 '14 at 00:39
  • Updated with cache config and console performance. When I see a trace with multiple cache calls, it seems like it is only the very first cache read that take a long time, subsequent calls seem normal. – Trae Robrock Dec 20 '14 at 00:52
  • So a distribution of read times with a log tail (spiky high performance) would indicate to me a performance limit on one side or the other. Are you sure you're not hitting a resource limit on the Memcached machine? Alternately, what does the load look like on your app server during these long reads? – Peter Goldstein Dec 20 '14 at 07:13
  • Load on the app servers is pretty consistent, although we dont have high traffic volumes, so it is a little spiky itself. I am using elasticache for the memcached server and have only little insight to system stats, cpu utilization does not peak with the slow reads. I am wondering if it could be a network issue since I am using t2.micro instances for memcached. – Trae Robrock Dec 20 '14 at 16:09
  • I added another graph to track number of calls/s to cache read, and it looks like that directly correlates to the slowness numbers. Maybe this is the network getting bogged down somehow, it just doesnt seem like high enough numbers to do something like that. – Trae Robrock Dec 20 '14 at 17:07
  • Moving to an m3.medium instance looks like it brought the spikes down to the 100-200ms range, still seems very high, but it might be a case of my web boxes having low network performance now. – Trae Robrock Dec 20 '14 at 19:22
  • Ok, all of that strongly indicates that this is a performance problem. Certainly t2.micros could be too small - definitely unsuitable for a production instance. Are your elasticache servers in the same availability zone as the app servers? You can see fluctuations in network performance between availability zones. And how frequent are the spikes? – Peter Goldstein Dec 21 '14 at 05:53

0 Answers0