0

I currently have a producer producing a Kafka topic with messages containing average score of, something similar to the below:

{
    "post_id": "post_123",
    "sentiment_score: "0.6789"
}

The kafka cluster is being hosted on Confluent cloud and I'm using confluent_kafka library to process these messages. However, it now has come to the point where, in order to process a message, I need the average sentiment_score of all the posts within the past 1 day.

What would be the best way to do it now? Thank you!

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
knl
  • 969
  • 11
  • 35
  • I suggest writing the data into something like Prometheus or InfluxDB (a time series database), then using its average functions, not doing that in Python (you could write to _those_ systems from Python, but Kafka Connect would make more sense) – OneCricketeer Feb 07 '22 at 07:31
  • Does [this](https://stackoverflow.com/questions/50280935/kafka-real-time-average-for-last-x-minutes) answer your question ? – Shanavas M Feb 07 '22 at 08:55
  • @ShanavasM thanks! It does answer my question. But I'm trying not to do it with kSQLDB, but with Python. I've been looking at Kafka Streams but it seems it's only possible with Java, not Python – knl Feb 08 '22 at 03:40
  • Ksql has a rest api, and there's a third party Python library for it – OneCricketeer Feb 08 '22 at 15:36
  • Thanks @OneCricketeer, let me try to have a look. Other than that, is there any recommendations? – knl Feb 10 '22 at 02:18
  • 1
    I suppose pyspark, pyflink, or Beam might work. Basically, you need a way to maintain and distribute state (the average so far) amongst consumer instances. Out of the box, the Confluent client doesn't do that – OneCricketeer Feb 10 '22 at 15:07

0 Answers0