I'm running into an issue where I have a python-memcached client connecting to 3 memcached nodes on ElastiCache. I have some cache values that have an infinite TTL and they get overridden when the data source is updated. Cache is also written to on cache misses.
The issue is sometimes old cached values get returned by memcached. My best guess as to what's happening is:
- "foo" gets written to memcached A.
- memcached A is temporarily unavailable in process #1, so it's marked as failed.
- process #1 uses memcached B which has a cache miss, so it writes "bar" to memcached B and returns that value.
- process #2 is able to connect to memcached A and doesn't know process #1 marked it as a bad node, so it connects and returns "foo".
- any time a process is able to connect to memcached A "foo" gets returned, but anytime it's temporarily marked as dead memcached B is connected to and "bar" gets returned.
Here's the line where a failure results in a new server being selected: https://github.com/linsomniac/python-memcached/blob/release-1.57/memcache.py#L413
I looked at the hashing client for pymemcached and I think it'll do the same thing: temporarily remove a memcached host and try to use another one.
This makes sense when a host is going to permanently be removed, but doesn't make sense to me when a host might just be unavailable for a few seconds. Am I missing something? Are infinite TTLs a memcached anti-pattern?