We want to keep things simple and use python wherever possible. Thus we want to use pyflink (latest version, we are flexible) for continuous queries.
We have written the code (which pulls live data from a pulsar cluster with the flink-pulsar connector ). Now, a few conceptual questions to this?
How can we deploy (Session Deployment) this code as/to Flink Task Managers so that we are able to scale? Assuming, that with every message that arrives we updated the continuous query. If there is a lot of inbound messages, we would want to have multiple K8s workers to run a pod with that Flink code. What is a good concept to do so? (we would set the pulsar subscription type to exclusive, so that only one Flink instance would consume that latest message and then compute on that one).
And how do we deploy the Flink Job Manager and let it know of all the Task Manger instances on-the-fly? How do you configure it?
Assuming we understand how to deploy in a scalable way, what would be the preferred way of publishing result from this distributed approach? Pushing continuous queries results to a sink topic in pulsar? The goal is to send the results over SSE to consumer clients.
Any hints/documentation to a full tutorial of pyflink for a scalable deployment with Job and Task manager is appreciated.