1

I have a Redis database with many millions of keys in it. Over time, the keys that I have written to and read from have changed, and so there are many keys that I am simply not using any more. Most don't have any kind of TTL either.

I want to get a sense for what percentage of the keys in the Redis database is not in use any more. I was thinking I could use hyperloglog to estimate the cardinality of the number of keys that are being written to, but it seems like a lot of work to do a PFADD for every key that gets written to and read from.

To be clear, I don't want to delete anything yet, I just want to do some analysis on the number of used keys in the database.

alec
  • 141
  • 2
  • 11

2 Answers2

4

I'd start with the scan command to iterate through the keys, and use the object idletime command on each to collect the number of seconds since the key was last used. From there you can generate metrics however you like.

One way, using Redis, would be to use a sorted set with the idletime of the key as its score. The advantage of this over HLL is that you can then say "give me keys idle between x and y seconds ago" by using zrange and/or zrevrange. The results of that you could then use for operations such as deletion, archival, or setting a TTL. With HLL you can't do this.

Another advantage is that, unless you store the result in Redis, there is only a Redis cost when you run it. You don't have to modify your code to do additional operations when accessing keys, for example.

The accuracy of the object's idle time is around ten seconds or so if I recall. But for getting an idea of how many and which keys haven't been accessed in a given time frame it should work fine.

The Real Bill
  • 14,884
  • 8
  • 37
  • 39
0

You can analysis the data with time window, and use a hyperloglog to estimate the cardinality for each time window.

For example, you can use a hyperloglog for each day's analysis:

// for each key that has been read or written in day1
// add it to the corresponding hyperloglog
pfadd key-count-day1 a b
pfadd key-count-day1 c d e

// for each key that has been read or written in day2
// add it to the corresponding hyperloglog
pfadd key-count-day2 a
pfadd key-count-day2 c

In this case, you can get the estimated number of keys that are active in dayN with the hyperloglog whose key is key-count-dayN.

With pfcount, you can get the number of active keys for each day or several days.

// number of active keys in day2: count2
pfcount key-count-day2

// number of active keys in day1 and day2: count-total
pfcount key-count-day1 key-count-day2

With these 2 counts, you can calculate the percentage of keys that are unused since day2: (count-total - count2) / count-total

for_stack
  • 21,012
  • 4
  • 35
  • 48