We have a test Kafka cluster that we were experimenting with adjusting various settings. One of the settings that was adjusted was to set the transaction.max.timeout.ms to 7 days.
While that setting was in place we had a network failure to one of the ZK nodes. It was brief but enough that it triggered a broker leader election. This leader election wasn't clean as it only registered 6 of the 8 brokers when it came up. We manually triggered another election and everything came up cleanly.
The problem that we have now is that we have a bunch of zombie transactions that have not aborted or committed.
This means that our apps that use transactions/have an isolation level of read_committed are no longer reading from certain partitions. I know this is because that Last Stable Offset (LSO) is at the point where the transaction was created. I've tested this by using the console consumer to read from a particular topic:partition offset and it was fine and then added --isolation-level read_committed and it doesn't return any records.
Is there any way to force the transaction coordinator to abort the zombie transactions or to manually set the LSO? I've even 'purged' the topic by setting retention.ms to 100 and seen the consumer group offset record shift but any read_committed clients still wont read from the partition and the consumer group wont advance past the log rotation.
Thanks