My application serves a large number of domains. Some domains have unexpectedly high traffic sometimes. To handle their traffic on Backend (microservices architecture built in Golang), I setup Redis Rate Limiter along with the cache in the gateway layer for the API (using the go-redis library), which serves the content of home page for those domains. The logic for the Redis Rate limiter is that whenever any request for the domain arrives:
- I call a GET KEY request on redis to get the current value of counter
- If that key doesn't exist in redis or the counter value is less than 500, then I call SET redis command in go thread with counter incremented by one and expiry set to 15 minutes.
- After that, I return the result from internal APIs of multiple microservices called from that gateway layer API. Now, in case the counter value is greater than or equal to 500, then I simple serves the content from redis cache. In my redis rate limiter, I have created a separate key for each domain in the format "DOMAIN:COUNTER".
Now, I encountered a problem where my gateway layer API got simultaneous requests from a large number of domains at the same time, but none of the requests counter for individual domains crossed 500 and cumulative requests were increased way too much. However, since I was incrementing Redis counter in a go thread, so a large number of go threads were created in that specific period, which caused CPU spikes on that gateway layer server and also created a large number of Redis connections at the same time, because only SET redis Command was being called instead of GET KEY.
As of now, I replaced the SET command with INCR command in REDIS and new logic of redis rate limiter is:
- I call a GET KEY request on redis to get the current value of counter.
- If that key doesn't exist in redis or the counter value is less than 500, then I call SET redis command in go thread with counter incremented by one and expiry set to 15 minutes.
- After that, I return the result from internal APIs of multiple microservices called from that gateway layer API.
How do I fix this issue ?