My Current Flink Application
- based on Flink Stateful Function 3.1.1, it reads message from Kafka, process the message and then sink to Kafka Egress
- Application has been deployed on K8s following guide and is running well: Stateful Functions Deployment
- Based on the standard deployment, I have turned on kubernetes HA
My Objectives
I want to auto scale up/down the stateful functions. I also want to know how to create more standby job managers
My Observations about the HA
I tried to set kubernetes.jobmanager.replicas
in the flink-config
ConfigMap:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: flink-config
labels:
app: shadow-fn
data:
flink-conf.yaml: |+
kubernetes.jobmanager.replicas: 7
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
I see no standby job managers in K8s.
Then I directly adjust the replicas of deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: statefun-master
spec:
replicas: 7
Standby job managers show up. I check the pod log, the leader election is done successfully. However, when I access UI in the web browser, it says:
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
What's wrong with my approach?
My Questions about the scaling
Reactive Mode is exactly what I need. I tried but failed, job manager has error message:
Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Reactive mode is configured for an unsupported cluster type. At the moment, reactive mode is only supported by standalone application clusters (bin/standalone-job.sh).
It seems that stateful function auto scaling shouldn't be done in this way. What's the correct way to do the auto scaling, then?
Potential Approach(Probably incorrect)
After some research, my current direction is:
- Job Manger has nothing to do with auto scaling. It is related to HA on K8s. I just need to make sure Job Manager has correct failover behaviors
- My stateful functions are
Flink remote services
, i.e., they are regulark8s services
. they can be deployed in form ofKNative service
to achieve auto scaling. Replicas of services goes up only when http requests come from Flink's worker - The most important part, Flink's worker(or Task Manager) I have no idea how to do the auto scaling yet. Maybe I should use
KNative
to deploy the Flink worker? If it doesn't work with KNative, maybe I should totally change the flink runtime deployment. E.g., to try the original reactive demo. But I'm afraid the Stateuful functions are not intended to work like that.
At the last
I have read the Flink documentation and Github samples over and over but cannot find any more information to do this. Any hint/instructions/guideline are appreciated!