Scenario:
In a KafkaStreams web sessioning scenario, with unlimited (or years-long) retention, with interactive queries (this can be reviewed if necessary), with many clients, which have many users each (each user particular to each client), and where partitioning goes like this:
Partition by a function of (clientId, userId) % numberOfPartitions, setting a numberOfPartitions beforehand depending on the cluster size. This would allow sessioning to be performed on (clientId,userId) data, and should provide an even data distribution among the nodes (no hotspotting, on partition size or on write load).
However, when querying, I'd query by client(and time range). So then, I'd build an aggregated Ktable from that Sessions table, where key is the client, and Sessions are queried by (client, timeStart, timeEnd). That would make that data from a client to have to go into one node, which could pose scalability issues(too big a client), but since data is aggregated already, I guess that would be manageable.
Question:
In this scenario (variants appreciated), I'd like to be able to reprocess only for one client.
But data from one client would be scattered among (potentially all of) the partitions.
How can a partial reprocess be achieved in Kafka Streams with minor impact, and keep (old) state queryable in the meantime?