Why Kinesis Data Analytics for Flink drops state when scaled up or scaled down?

Question

We are using Kinesis for Apache Flink to analyze various visitor events from multiple sources. In one of the operators, we are using a MapSate for cumulative metrics calculation. Flink application was auto-scaled 4 times during one-week execution. The problem is that each time it auto-scaled operator state was completely dropped. There are no error messages in logs, except - " RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested." from TaskManagerRunner.

The job uses the following configuration: Checkpoint configuration is using DEFAULT mode and is enabled. Application auto-scaling is enabled. Application restore configuration - Update without snapshot. State does not use TTL.

Is my understanding correct that if we need to persist state after auto-scaling we should start a job with RESTORE_FROM_LATEST_SNAPSHOT configuration? I thought that this value is needed only for full application restarts. Is there anything else that could cause a similar problem?

@DavidAnderson, so will RESTORE_FROM_LATEST_SNAPSHOT help in this case? Does kinesis take a snapshot before doing auto-scale? — Yuriy Zanichkovskyy, Mar 12 '21 at 15:13
According to https://docs.aws.amazon.com/kinesisanalytics/latest/java/how-fault-snapshot.html "If SnapshotsEnabled is set to true in the ApplicationSnapshotConfiguration for the application, Kinesis Data Analytics automatically creates and uses snapshots when the application is updated, scaled, or stopped to provide exactly-once processing semantics." — David Anderson, Mar 12 '21 at 16:20

Why Kinesis Data Analytics for Flink drops state when scaled up or scaled down?

0 Answers0