Error when trying to start Flink job from retained checkpoint

Question

As I understand from the documentation, it should be possible to resume a Flink job from a checkpoint just as from a savepoint by specifing the checkpoint path in the "Savepoint path" input box of the web UI (e.g. /path/to/my/checkpoint/chk-1, where "chk-1" contains the "_metadata" file).

I've been trying this out but the I get the following exception:

2020-09-04 10:35:11
java.lang.Exception: Exception while creating StreamOperatorStateContext.
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
    at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:255)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1006)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:454)
    at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:449)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:461)
    at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for LegacyKeyedProcessOperator_632e4c67d1f4899514828b9c5059a9bb_(1/1) from any of the 1 provided restore options.
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
    ... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
    at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:336)
    at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:548)
    at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
    at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
    ... 11 more
Caused by: java.nio.file.NoSuchFileException: /tmp/flink-io-ee95b361-a616-4531-b402-7a21189e8ce5/job_c71cd62de3a34d90924748924e78b3f8_op_LegacyKeyedProcessOperator_632e4c67d1f4899514828b9c5059a9bb__1_1__uuid_ae7dd096-f52f-4eab-a2a3-acbfe2bc4573/336ed2fe-30a4-44b5-a419-9e485cd456a4/CURRENT
    at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
    at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
    at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
    at java.nio.file.Files.copy(Files.java:1274)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBIncrementalRestoreOperation.java:483)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:218)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:194)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:168)
    at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:154)
    at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:279)
    ... 15 more

Anyone has an idea of what's causing this?

UPDATE: After some tests, I noticed that this behavior depends on the state backend used. In this case I'm using RocksDBStateBackend with incremental checkpointing enabled. When I switched to FsStateBackend, the error disappeared. Come to think of it, that would make sense since, from what I understand, checkpoints taken with incremental checkpointing enabled only record the changes compared to the previous completed checkpoint instead of the full job state, so it would not be possible to restore the job from this kind of checkpoint.

If that's correct, I think it would be useful to add a notice on the documentation (https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#resuming-from-a-retained-checkpoint)

No, the job is the exact same jar file. I think I found the problem anyway, see the upated question. — Andrea Gallina, Sep 04 '20 at 10:22
It should be possible to start from an incremental checkpoint if you don't change anything on your job. In fact, it's the same as if Flink application crashes and recovers. Did you change anything beside the job? Config, Flink version? Or any manual cleanup on checkpoint dir? — Arvid Heise, Sep 04 '20 at 12:56
No changes are made from one run to another: the job starts the first time with a clean state; after some checkpoints get completed, I cancel the job and try to re-run it from the same .jar by specifying the checkpoint path in the "Savepoint path" input, and I get this error. I'm running on version 1.10 Furthermore, I just run another test using `RocksDbStateBackend` with incremental checkpointing disabled and it works fine, but when I enable it I get that exception. — Andrea Gallina, Sep 04 '20 at 13:08
I also found another S.O. question that went unanswered reporting a similar problem:https://stackoverflow.com/questions/59713932/flink-failed-to-recover-from-a-checkpoint As you can see from the stacktrace, the problem arises from the `RocksDBIncrementalRestoreOperation` class, which I assume means that he was using incremental checkpointing too. — Andrea Gallina, Sep 04 '20 at 13:10

Error when trying to start Flink job from retained checkpoint

0 Answers0