0

I'm trying to checkpoint/savepoint my flink state running on EMR to an s3 bucket on AWS. Please note:

  • The instances (master and core nodes) have the IAM role properly set up to access the s3 bucket and all the directories/files inside it (AmazonS3FullAccess policy is attached to the role and nothing overrides it).
  • I can use aws s3 cp xxx s3://flink-bc/checkpoints from slave and master nodes successfully to copy files to the bucket
  • Using hdfs for savepoints/checkpoints work
  • If I set the checkpoints to use hdfs and then try to savepoint to s3, the savepoint operation error looks like
org.apache.flink.util.FlinkException: Triggering a savepoint for the job 16c162c47f225cddad974056c9494b6d failed.
    at org.apache.flink.client.cli.CliFrontend.triggerSavepoint(CliFrontend.java:723)
    at org.apache.flink.client.cli.CliFrontend.lambda$savepoint$9(CliFrontend.java:701)
    at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
    at org.apache.flink.client.cli.CliFrontend.savepoint(CliFrontend.java:698)
    at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1065)
    at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
    at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
    at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to trigger savepoint. Decline reason: An Exception occurred while triggering the checkpoint.........
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.checkpoint.CheckpointTriggerException: Failed to trigger savepoint. Decline reason: An Exception occurred while triggering the checkpoint.
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) 

and the jobmanager logs:

java.io.IOException: Cannot instantiate file system for URI: s3://flink-bc/savepoints
    at org.apache.flink.runtime.fs.hdfs.HadoopFsFactory.create(HadoopFsFactory.java:187)
    at org.apache.flink.core.fs.FileSystem.getUnguardedFileSystem(FileSystem.java:399)
    at org.apache.flink.core.fs.FileSystem.get(FileSystem.java:318)
    at org.apache.flink.core.fs.Path.getFileSystem(Path.java:298)
    at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.initializeLocationForSavepoint(AbstractFsCheckpointStorage.java:147)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerCheckpoint(CheckpointCoordinator.java:511)
    at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.triggerSavepoint(CheckpointCoordinator.java:370)
    at org.apache.flink.runtime.jobmaster.JobMaster.triggerSavepoint(JobMaster.java:951)
Denorm
  • 466
  • 4
  • 13

1 Answers1

0

I faced a similar kind of issue while using the latest Flink version(1.10.0) with s3 to store the checkpoints in the s3 bucket.

So please find a detailed working answer which I have provided here.

Keshav Lodhi
  • 2,641
  • 2
  • 17
  • 23