Memcached scaling strategy

Question

Currently I am running a production environment with 4 dedicated memcached servers, each of them having 48Gb of RAM (42 dedicated to memcache). Right now they are doing fine, but traffic and content are growing and will surely be growing next year too.

What are your thoughts on strategies for scaling memcached further? How have you done until now:

Do you add more RAM to the boxes until their full capacity - effectively doubling the cache pool on the same number of boxes? Or do you scale horizontally by adding more of the same boxes, with the same amount of RAM.

The current boxes can surely handle more RAM as their CPU load is quite low, the only bottleneck being memory, but I wonder if it wouldn't be a better strategy to distribute the cache, making things more redundant and minimizing the impact on the cache of losing one box (losing 48Gb of cache versus losing 96Gb). How would you (or have you) handle this decision.

are you stuck on memcache or will you be open to other options? — Mike, Sep 29 '11 at 12:38
Stuck to memcache. It has served us well up until now so there is no reason in changing. And I am pretty sure it can scale quite nicely, the question is which would be the best way. — danakim, Sep 29 '11 at 14:55

score 2 · Answer 1 · answered Oct 02 '11 at 01:42

When I've done this there is usually a break-even between point box size (rack space cost), expense of high density chips and failure scenario handling. This almost always ends up with a configuration less than the maximum memory density (as well as usually not the fastest chips available), which as you mentioned improves impact of node failure and usually makes them more cost effective. Some costs/things to consider when making this choice:

node cost (cpu/mem/etc)
rack space cost
administrative overhead/cost
failure scenarios (are you trying to do N+1?)

I have also done upgrades to max out boxes as you grow clusters too (usually when they are pretty small), as it may be significantly cheaper in the short term to buy some more memory as you scale to give you more time to make larger architectural decisions.

Thanks a lot for the advice! I think I might go for just scaling horizontally as the high density memory chips are quite expensive and the impact of one node failure is quite high. I would rather spread this around. — danakim, Oct 03 '11 at 11:27

score 1 · Accepted Answer · answered Oct 02 '11 at 02:37

I so want to know what it is you're moving that consumes over 100 GB of memory while not maxing out your NICs.

Memcache scales fairly linearly between machines, so the questions you have to ask are:

Is my system bus currently saturated?
- This might not relate to CPU usage -- DMA transfers won't show that way
How expensive is the high-density memory versus a new box containing the increase amount of memory?
- Full cost of rack space, power consumption, etc.
Do you see a fundamental difference between losing 25% of your cache 1% of the time and 12.5% of your cache 2% of the time? (Randomly chosen failure rate).

Scaling is 10% intuition, 70% measuring and adapting, and 20% going back and trying something else.

Load 'em up until they max out the weakest link or stop being cost-effective. They may or may not already be there.

Thanks a lot for the advice! As I told polynomial above, I am going to go for more boxes instead of more memory. Going for 96gb of RAM is quite expensive and looking at the impact of a node failure on the application, I would like to minimize that. And regarding your question: each box has 48gb of ram and a gigabit link. I am maxing out their connection at about 150Mbps - so there is room to grow the RAM, at least in terms of network bandwidth. — danakim, Oct 03 '11 at 11:34

Memcached scaling strategy

2 Answers2