1

I have set up Apache Flink on an EKS Kubernetes cluster with one Job Manager and two Task Managers.

flink configuration configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: 
  namespace: <namespace?
  labels:
    app: flink
data:
  flink-conf.yaml: |+
    kubernetes.cluster-id: <cluster-id>
    high-availability: 
    high-availability.storageDir: s3a://<bucketname>/recovery
    jobmanager.rpc.address: jobmanager
    jobmanager.memory.process.size: 2000m
    taskmanager.memory.process.size: 2228m
    blob.server.port: 6124
    parallelism.default: 2  
    restart-strategy: fixed-delay
    restart-strategy.fixed-delay.attempts: 100000
    heartbeat.timeout: 300000
    kubernetes.namespace: <namespace>
    state.backend: filesystem
    state.checkpoint-storage: filesystem
    state.checkpoints.dir: s3a://<bucketname>/checkpoints/
    state.savepoints.dir: s3a://<bucketname>/savepoints/
    state.backend.incremental: true
    state.backend.fs.checkpointdir: s3a://<bucketname>/checkpoints
    classloader.resolve-order: parent-first


  log4j-console.properties: |+
    # This affects logging for both user code and Flink
    rootLogger.level = INFO
    rootLogger.appenderRef.console.ref = ConsoleAppender
    rootLogger.appenderRef.rolling.ref = RollingFileAppender

    # Uncomment this if you want to _only_ change Flink's logging
    #logger.flink.name = org.apache.flink
    #logger.flink.level = INFO

    # The following lines keep the log level of common libraries/connectors on
    # log level INFO. The root logger does not override this. You have to manually
    # change the log levels here.
    logger.akka.name = akka
    logger.akka.level = INFO
    logger.kafka.name= org.apache.kafka
    logger.kafka.level = INFO
    logger.hadoop.name = org.apache.hadoop
    logger.hadoop.level = INFO
    logger.zookeeper.name = org.apache.zookeeper
    logger.zookeeper.level = INFO

    # Log all infos to the console
    appender.console.name = ConsoleAppender
    appender.console.type = CONSOLE
    appender.console.layout.type = PatternLayout
    appender.console.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n

    # Log all infos in the given rolling file
    appender.rolling.name = RollingFileAppender
    appender.rolling.type = RollingFile
    appender.rolling.append = false
    appender.rolling.fileName = ${sys:log.file}
    appender.rolling.filePattern = ${sys:log.file}.%i
    appender.rolling.layout.type = PatternLayout
    appender.rolling.layout.pattern = %d{yyyy-MM-dd HH:mm:ss,SSS} %-5p %-60c %x - %m%n
    appender.rolling.policies.type = Policies
    appender.rolling.policies.size.type = SizeBasedTriggeringPolicy
    appender.rolling.policies.size.size=100MB
    appender.rolling.strategy.type = DefaultRolloverStrategy
    appender.rolling.strategy.max = 10

    # Suppress the irrelevant (wrong) warnings from the Netty channel handler
    logger.netty.name = org.jboss.netty.channel.DefaultChannelPipeline
    logger.netty.level = OFF


Job Manager Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: 
  namespace: 
spec:
  selector:
    matchLabels:
      app: flink
  replicas: 1
  template:
    metadata:
      labels:
        app: flink
        component: master
    spec:
      containers:
      - name: master
        image: {{ .Values.service.image.repository }}/{{ .Release.Name  }}{{ if .Values.service.image.tag }}:{{ end }}{{.Values.service.image.tag}}
        imagePullPolicy: {{ .Values.service.image.pull_policy }}
        resources:
          limits:
            cpu: "4000m"
            memory: "5Gi"
          requests:
            cpu: "3000m"
            memory: "4Gi"  
        workingDir: /opt/flink
              ports:
        - containerPort: 6123
          name: rpc
        - containerPort: 6124
          name: blob
        - containerPort: 6125
          name: query
        - containerPort: 6126
          name: ui
        readinessProbe:
          tcpSocket:
            port: 5000
          initialDelaySeconds: 300
          periodSeconds: 30
        env:
        - name: JOB_MANAGER_RPC_ADDRESS
          value: jobmanager
        volumeMounts:
        - name: flink-config-volume
          mountPath: /opt/flink/conf

      serviceAccountName: flink-serviceaccount
      volumes:
      - name: flink-config-volume
        configMap:
          name: flink-config-name
          items:
          - key: flink-conf.yaml
            path: flink-conf.yaml
          - key: log4j-console.properties
            path: log4j-console.properties

Additinally I have other yaml files:

rest-service jobamanger-session.yaml jobmanager-service role-binding serviceaccount

The JAR file for my Flink job is generated in a separate repository and is accessible via the Jfrog artefact repository. I've a Jenkins pipeline to download the JAR from artefact and push it to Kubernetes pod.

I've deployed Flink using Helm charts, and I can access the Flink UI through port forwarding:

kubectl port-forward <podname> -n <namespace> 8081:8081

I can successfully submit the Flink job JAR using the Flink UI or by using the flink run command within the pod:

flink run -c myClass /opt/flink/lib/myflink.jar

However, I want to automate the process of submitting the JAR file. To achieve this, I added an entry point to my Dockerfile:

Dockerfile

ENTRYPOINT ["sh", "-c", "./docker-entrypoint.sh"]

Complete Dockerfile:

# Use the flink:1.16 base image
FROM flink:1.16.1 AS base

# Make directories for the plugins and copy the required JAR files
RUN mkdir -p /opt/flink/plugins/s3-fs-hadoop \
    && cp /opt/flink/opt/flink-s3-fs-hadoop-1.16.1.jar /opt/flink/plugins/s3-fs-hadoop/ \
    && mkdir -p /opt/flink/plugins/s3-fs-presto \
    && cp /opt/flink/opt/flink-s3-fs-presto-1.16.1.jar /opt/flink/plugins/s3-fs-presto/

COPY .jar /opt/flink/lib/ myflink.jar

RUN chown flink:flink /opt/flink/lib/myflink.jar

COPY docker-entrypoint.sh docker-entrypoint.sh
RUN chmod +x docker-entrypoint.sh
RUN chown flink:flink docker-entrypoint.sh

EXPOSE 8081
EXPOSE 6123
EXPOSE 6122
EXPOSE 6124
EXPOSE 6125
EXPOSE 6126

ENTRYPOINT ["sh", "-c", "./docker-entrypoint.sh"]

docker-entrypoint.sh:

#!/bin/sh

# Run the Flink job in the foreground and suppress exit code
flink run -c myClass /opt/flink/lib/myflink.jar &

# Prevent the script from immediately exiting
trap "echo 'Script is running...'; wait" SIGINT SIGTERM

# Wait for the Flink job to complete
wait

After adding this entry point, I encountered an error during auto-submission:

ERROR StatusLogger Reconfiguration failed: No configuration found for '17…’ at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for '31….’ at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for '6c478…’ at 'null' in 'null'
ERROR StatusLogger Reconfiguration failed: No configuration found for '249….’ at 'null' in 'null'
08:49:12.383 [main] ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Failed to execute sql
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) ~[myflink.jar:?]
    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[myflink.jar:?]
------------------------------------------------------------
 The program finished with the following exception:
    at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) ~[myflink.jar:?]
    at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:843) ~[myflink.jar:?]
    at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:240) ~[myflink.jar:?]
    at org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1087) ~[myflink.jar:?]
    at org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1165) ~[myflink.jar:?]
    at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) [myflink.jar:?]
    at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1165) [myflink.jar:?]
Caused by: org.apache.flink.table.api.TableException: Failed to execute sql
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:867) ~[myflink.jar:?]
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:827) ~[myflink.jar:?]
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:918) ~[myflink.jar:?]
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:730) ~[myflink.jar:?]
    at myClass.main(myclass.java:157) ~[myflink.jar:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[?:?]
    at java.lang.reflect.Method.invoke(Unknown Source) ~[?:?]
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355) ~[myflink.jar:?]
    ... 8 more
Caused by: org.apache.flink.util.FlinkException: Failed to execute job 'insert-into_default_catalog.default_database.Correlation'.
    at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2203) ~[flink-dist-1.16.1.jar:1.16.1]
    at org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:206) ~[myflink.jar:?]
    at org.apache.flink.table.planner.delegation.DefaultExecutor.executeAsync(DefaultExecutor.java:95) ~[?:?]
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:850) ~[myflink.jar:?]
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:827) ~[myflink.jar:?]
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:918) ~[myflink.jar:?]
    at org.apache.flink.table.api.internal.TableEnvironmentImpl.executeSql(TableEnvironmentImpl.java:730) ~[myflink.jar:?]
    at myClass.main(myClass.java:157) ~[myflink.jar:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
    at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$11(RestClusterClient.java:448) ~[myflink.jar:?]
    at java.util.concurrent.CompletableFuture.uniExceptionally(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
    at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372) ~[myflink.jar:?]
    at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222) ~[myflink.jar:?]
    at org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$6(FutureUtils.java:271) ~[myflink.jar:?]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(Unknown Source) ~[?:?]
    at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
    at org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:98) ~[myflink.jar:?]
    at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]
    at org.apache.flink.util.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1125) ~[myflink.jar:?]
    at org.apache.flink.util.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) ~[myflink.jar:?]
    at org.apache.flink.util.concurrent.FutureUtils.lambda$orTimeout$12(FutureUtils.java:489) ~[myflink.jar:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) ~[?:?]
    at java.util.concurrent.FutureTask.run(Unknown Source) ~[?:?]
    at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:843) ~[myflink.jar:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
    at java.lang.Thread.run(Unknown Source) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
    at org.apache.flink.util.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1125) ~[myflink.jar:?]

When I checked the pod status using kubectl describe pod -n , I observed the following information:

kubectl describe pod <podname>  -n <namespace>
    Ports:          6123/TCP, 6124/TCP, 6125/TCP, 6126/TCP
    Host Ports:     0/TCP, 0/TCP, 0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      
      Finished:     
    Ready:          False

Additionally, my attempt at port forwarding using kubectl port-forward also failed:

``
vbnet
Copy code
kubectl port-forward <jobmanager-podname> -n <namespace> 8081:8081
Forwarding from 127.0.0.1:8081 -> 8081
Forwarding from [::1]:8081 -> 8081
Handling connection for 8081
E0813 23:02:08.669886   32077 portforward.go:407] an error occurred forwarding...
Could you please help me troubleshoot and understand what might be causing these errors? Thank you.

``

Everything is working fine till I add ENTRYPOINT on my Dockerfile to submit the job. ENTRYPOINT is terminating the POD, jar submission is failing and port-forward is also not working. Please let me know what is missing here.

P36912
  • 21
  • 3

0 Answers0