0

There's a question I have about this one, but I haven't gotten a satisfactory answer yet.

In time-series data, the order in which messages are sent is crucial. Let's say a downstream consumer has a Python script that computes windowed statistics on time-series data. Suppose we have a topic with multiple partitions and as you know we have no control over the order of messages that are stored in each partition.

So, how can we make sure that the batch of messages we get has all the data points without any missing data? There is an obvious way to do it, and that is to have a single partition. But that means paying on scale.

Would it be possible to solve such cases without limiting the scalability of the system?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
sci9
  • 700
  • 1
  • 7
  • 21
  • 1
    Sure. Write the data to an actual TSDB that you'd query from Python rather than polling across partitions. There's no way even within one partition that you'd know what is "missing" or not. You need some other referential ID of what is (attempted to be) written vs what is actually consumed. – OneCricketeer Mar 06 '23 at 23:24
  • Yes, one option is to use secondary storage that sorts by time stamp – sci9 Mar 07 '23 at 02:13
  • There's an idea I'm thinking about where a producer labels each message with the timestamp of each time-series data point, so that consumers can poll messages based on those labeled timestamps. Do you think it is feasible? – sci9 Mar 07 '23 at 02:14
  • 1
    No. Consumers can only poll in sequence of offset, not by timestamp – OneCricketeer Mar 07 '23 at 16:32

0 Answers0