0

My Current Flink Application

  • based on Flink Stateful Function 3.1.1, it reads message from Kafka, process the message and then sink to Kafka Egress
  • Application has been deployed on K8s following guide and is running well: Stateful Functions Deployment
  • Based on the standard deployment, I have turned on kubernetes HA

My Objectives

I want to auto scale up/down the stateful functions. I also want to know how to create more standby job managers

My Observations about the HA

I tried to set kubernetes.jobmanager.replicas in the flink-config ConfigMap:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: flink-config
  labels:
    app: shadow-fn
data:
  flink-conf.yaml: |+
    kubernetes.jobmanager.replicas: 7
    high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory

I see no standby job managers in K8s.

Then I directly adjust the replicas of deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: statefun-master
spec:
  replicas: 7

Standby job managers show up. I check the pod log, the leader election is done successfully. However, when I access UI in the web browser, it says:

{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}

What's wrong with my approach?

My Questions about the scaling

Reactive Mode is exactly what I need. I tried but failed, job manager has error message:

Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Reactive mode is configured for an unsupported cluster type. At the moment, reactive mode is only supported by standalone application clusters (bin/standalone-job.sh).

It seems that stateful function auto scaling shouldn't be done in this way. What's the correct way to do the auto scaling, then?

Potential Approach(Probably incorrect)

After some research, my current direction is:

  1. Job Manger has nothing to do with auto scaling. It is related to HA on K8s. I just need to make sure Job Manager has correct failover behaviors
  2. My stateful functions are Flink remote services, i.e., they are regular k8s services. they can be deployed in form of KNative service to achieve auto scaling. Replicas of services goes up only when http requests come from Flink's worker
  3. The most important part, Flink's worker(or Task Manager) I have no idea how to do the auto scaling yet. Maybe I should use KNative to deploy the Flink worker? If it doesn't work with KNative, maybe I should totally change the flink runtime deployment. E.g., to try the original reactive demo. But I'm afraid the Stateuful functions are not intended to work like that.

At the last

I have read the Flink documentation and Github samples over and over but cannot find any more information to do this. Any hint/instructions/guideline are appreciated!

Yun Xing
  • 43
  • 4

1 Answers1

0

Since Reactive Mode is a new, experimental feature, not all features supported by the default scheduler are also available with Reactive Mode (and its adaptive scheduler). The Flink community is working on addressing these limitations.

https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/

ChangLi
  • 772
  • 2
  • 8
  • I see reactive mode doesn't work for stateful functions. Is there any method to scale up stateful functions? – Yun Xing Feb 15 '22 at 23:48