I will give an abstract example of the issue that I encountered.
The user makes an HTTPS request to our server (request-proxy / load balancer), the load balancer establishes a socket connection with one of the endpoint node (multi node service). This service, in turn, performs some logic, creates a message and sends it to the topic (request topic). Payload in this message also contains an assigned partitions for this instance (e.g. [1, 3, 5]). Then the system (black box) processes this request and replies to another topic (response topic) determining which partition to send this message to (e.g. randomly from [1, 3, 5]). Endpoint service (pod) that has a connections to the user receives this message and replies to user via http.
Now imagine that there was a rebalance, but the endpoint service managed to send a message before that. As a result (possibly) another pod of endpoint service will receive response but will not be able to respond to user, because no connection with him.
Note:
- Consumer group segregation is not the way to go (use different consumer groups for each pod), because messages are relatively large. I don't want each pod to receive messages that do not belong to it, thereby increasing the load on the network.
- I see no point in using key partitioning (calculate hash).
I use workarounds to solve this problem, but would really like to know what practices exist when using Kafka. Thanks.