I have set up Apache Flink on an EKS Kubernetes cluster with one Job Manager and two Task Managers.
flink configuration configmap.yaml:
apiVersion: v1
kind: ConfigMap
metadata:
name:
namespace: <namespace?
labels:
app: flink
data:
flink-conf.yaml: |+
kubernetes.cluster-id: <cluster-id>
high-availability:
high-availability.storageDir: s3a://<bucketname>/recovery
jobmanager.rpc.address: jobmanager
jobmanager.memory.process.size: 2000m
taskmanager.memory.process.size: 2228m
blob.server.port: 6124
parallelism.default: 2
restart-strategy: fixed-delay
restart-strategy.fixed-delay.attempts: 100000
heartbeat.timeout: 300000
kubernetes.namespace: <namespace>
state.backend: filesystem
state.checkpoint-storage: filesystem
state.checkpoints.dir: s3a://<bucketname>/checkpoints/
state.savepoints.dir: s3a://<bucketname>/savepoints/
state.backend.incremental: true
state.backend.fs.checkpointdir: s3a://<bucketname>/checkpoints
classloader.resolve-order: parent-first
log4j-console.properties: |+
# This affects logging for both user code and Flink
rootLogger.level = INFO
rootLogger.appenderRef.console.ref = ConsoleAppender
rootLogger.appenderRef.rolling.ref = RollingFileAppender
# Uncomment this if you want to _only_ change Flink's logging
#logger.flink.name = org.apache.flink
#logger.flink.level = INFO
# The following lines keep the log level of common libraries/connectors on
# log level INFO. The root logger does not override this. You have to manually
# change the log levels here.
logger.akka.name = akka
logger.akka.level = INFO
logger.kafka.name= org.apache.kafka
logger.kafka.level = INFO
logger.hadoop.name = org.apache.hadoop
logger.hadoop.level = INFO
logger.zookeeper.name = org.apache.zookeeper
logger.zookeeper.level = INFO
# Log all infos to the console
appender.console.name = ConsoleAppender
appender.console.type = CONSOLE
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n
# Log all infos in the given rolling file
appender.rolling.name = RollingFileAppender
appender.rolling.type = RollingFile
appender.rolling.append = false
appender.rolling.fileName = ${sys:log.file}
appender.rolling.filePattern = ${sys:log.file}.%i
appender.rolling.layout.type = PatternLayout
appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n
appender.rolling.policies.type = Policies
appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
appender.rolling.policies.size.size=100MB
appender.rolling.strategy.type = DefaultRolloverStrategy
appender.rolling.strategy.max = 10
# Suppress the irrelevant (wrong) warnings from the Netty channel handler
logger.netty.name = org.jboss.netty.channel.DefaultChannelPipeline
logger.netty.level = OFF
Job Manager Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name:
namespace:
spec:
selector:
matchLabels:
app: flink
replicas: 1
template:
metadata:
labels:
app: flink
component: master
spec:
containers:
- name: master
image: {{ .Values.service.image.repository }}/{{ .Release.Name }}{{ if .Values.service.image.tag }}:{{ end }}{{.Values.service.image.tag}}
imagePullPolicy: {{ .Values.service.image.pull_policy }}
resources:
limits:
cpu: "4000m"
memory: "5Gi"
requests:
cpu: "3000m"
memory: "4Gi"
workingDir: /opt/flink
ports:
- containerPort: 6123
name: rpc
- containerPort: 6124
name: blob
- containerPort: 6125
name: query
- containerPort: 6126
name: ui
readinessProbe:
tcpSocket:
port: 5000
initialDelaySeconds: 300
periodSeconds: 30
env:
- name: JOB_MANAGER_RPC_ADDRESS
value: jobmanager
volumeMounts:
- name: flink-config-volume
mountPath: /opt/flink/conf
serviceAccountName: flink-serviceaccount
volumes:
- name: flink-config-volume
configMap:
name: flink-config-name
items:
- key: flink-conf.yaml
path: flink-conf.yaml
- key: log4j-console.properties
path: log4j-console.properties
Additinally I have other yaml files:
rest-service jobamanger-session.yaml jobmanager-service role-binding serviceaccount
The JAR file for my Flink job is generated in a separate repository and is accessible via the Jfrog artefact repository. I've a Jenkins pipeline to download the JAR from artefact and push it to Kubernetes pod.
I've deployed Flink using Helm charts, and I can access the Flink UI through port forwarding:
kubectl port-forward <podname> -n <namespace> 8081:8081
I can successfully submit the Flink job JAR using the Flink UI or by using the flink run command within the pod:
flink run -c myClass /opt/flink/lib/myflink.jar
However, I want to automate the process of submitting the JAR file. To achieve this, I added an entry point to my Dockerfile:
Dockerfile
ENTRYPOINT ["sh", "-c", "./docker-entrypoint.sh"]
Complete Dockerfile:
# Use the flink:1.16 base image
FROM flink:1.16.1 AS base
# Make directories for the plugins and copy the required JAR files
RUN mkdir -p /opt/flink/plugins/s3-fs-hadoop \
&& cp /opt/flink/opt/flink-s3-fs-hadoop-1.16.1.jar /opt/flink/plugins/s3-fs-hadoop/ \
&& mkdir -p /opt/flink/plugins/s3-fs-presto \
&& cp /opt/flink/opt/flink-s3-fs-presto-1.16.1.jar /opt/flink/plugins/s3-fs-presto/
COPY .jar /opt/flink/lib/ myflink.jar
RUN chown flink:flink /opt/flink/lib/myflink.jar
COPY docker-entrypoint.sh docker-entrypoint.sh
RUN chmod +x docker-entrypoint.sh
RUN chown flink:flink docker-entrypoint.sh
EXPOSE 8081
EXPOSE 6123
EXPOSE 6122
EXPOSE 6124
EXPOSE 6125
EXPOSE 6126
ENTRYPOINT ["sh", "-c", "./docker-entrypoint.sh"]
docker-entrypoint.sh:
#!/bin/sh
# Run the Flink job in the foreground and suppress exit code
flink run -c myClass /opt/flink/lib/myflink.jar &
# Prevent the script from immediately exiting
trap "echo 'Script is running...'; wait" SIGINT SIGTERM
# Wait for the Flink job to complete
wait
After adding this entry point, I encountered an error during auto-submission:
ERROR StatusLogger Reconfiguration failed: No configuration found for '17…’ at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for '31….’ at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for '6c478…’ at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for '249….’ at 'null' in 'null'
08:49:12.383 [main] ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute sql
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) ~[myflink.jar:?]
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[myflink.jar:?]
------------------------------------------------------------
The program finished with the following exception:
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) ~[myflink.jar:?]
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:843) ~[myflink.jar:?]
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:240) ~[myflink.jar:?]
at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1087) ~[myflink.jar:?]
at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1165) ~[myflink.jar:?]
at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [myflink.jar:?]
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1165) [myflink.jar:?]
Caused by: org.apache.flink.table.api.TableException: Failed to execute sql
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:867) ~[myflink.jar:?]
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:827) ~[myflink.jar:?]
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:918) ~[myflink.jar:?]
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:730) ~[myflink.jar:?]
at myClass.main(myclass.java:157) ~[myflink.jar:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355) ~[myflink.jar:?]
... 8 more
Caused by: org.apache.flink.util.FlinkException: Failed to execute job 'insert-into_default_catalog.default_database.Correlation'.
at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2203) ~[flink-dist-1.16.1.jar:1.16.1]
at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:206) ~[myflink.jar:?]
at org.apache.flink.table.planner.delegation.DefaultExecutor.executeAsync(DefaultExecutor.java:95) ~[?:?]
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:850) ~[myflink.jar:?]
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:827) ~[myflink.jar:?]
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:918) ~[myflink.jar:?]
at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:730) ~[myflink.jar:?]
at myClass.main(myClass.java:157) ~[myflink.jar:?]
at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$11(RestClusterClient.java:448) ~[myflink.jar:?]
at java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) ~[myflink.jar:?]
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[myflink.jar:?]
at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:271) ~[myflink.jar:?]
at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) ~[myflink.jar:?]
at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
at org.apache.flink.util.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1125) ~[myflink.jar:?]
at org.apache.flink.util.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) ~[myflink.jar:?]
at org.apache.flink.util.concurrent.FutureUtils.lambda$orTimeout$12(FutureUtils.java:489) ~[myflink.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:843) ~[myflink.jar:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at org.apache.flink.util.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1125) ~[myflink.jar:?]
When I checked the pod status using kubectl describe pod -n , I observed the following information:
kubectl describe pod <podname> -n <namespace>
Ports: 6123/TCP, 6124/TCP, 6125/TCP, 6126/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP, 0/TCP
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Completed
Exit Code: 0
Started:
Finished:
Ready: False
Additionally, my attempt at port forwarding using kubectl port-forward also failed:
``
vbnet
Copy code
kubectl port-forward <jobmanager-podname> -n <namespace> 8081:8081
Forwarding from 127.0.0.1:8081 -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081
E0813 23:02:08.669886 32077 portforward.go:407] an error occurred forwarding...
Could you please help me troubleshoot and understand what might be causing these errors? Thank you.
``
Everything is working fine till I add ENTRYPOINT on my Dockerfile to submit the job. ENTRYPOINT is terminating the POD, jar submission is failing and port-forward is also not working. Please let me know what is missing here.