I have a Kinesis consumer whose job is to keep track of "currently active users" in a system. Every minute users send a heartbeat to a Kinesis stream and this system just keeps a list of all unique user GUIDs it has seen along with the last time they received a heartbeat from that GUID. If a heartbeat hasn't been seen in 2 minutes, we assume the user is no longer active and evict them from the list of "currently active users". Pretty straight-forward.
Because this system is only concerned with currently active users, we don't need to back-process old messages. If we were to turn this consumer off for 2 hours then turn it back on, we want to start processing at the "LATEST" message instead of picking up where we left off.
Finally, this has been implemented as a NodeJS application per the Amazon Kinesis Client NodeJS example using the MultiLangDaemon to communicate with the Kinesis Client Library.
Under normal use I've found that the best way to always resume from "LATEST" is to never use the checkpointing feature of KCL. For instance, at the bottom of my processRecords
method I have the following:
// We don't checkpoint with kinrta, because if we crash for some reason we
// want to immediately catch back up to live instead of wasting time
// processing expired heartbeats
// processRecordsInput.checkpointer.checkpoint(sequenceNumber,
// function(err, checkpointedSequenceNumber) {
completeCallback();
// }
// );
This way whenever I kill the consumer and restart it, it looks at the *.properties
file and sees "initialPositionInStream" is "LATEST", then begins processing from there.
HOWEVER
When I re-shard my stream (split shards or merge shards) I run into an issue. When I re-shard the checkpoint on the new shard is not set to "LATEST", but to "TRIM_HORIZON". Since I don't ever re-checkpoint, this means that if my consumer is turned off and restarted I end up having to process 24 hours of data.
I can manually fix this by editing the Dynamo table used by KCL to manage checkpointing, but that's obviously not a scalable solution. I've tried using the checkpointer and passing the string "LATEST" instead of a sequence number, but this throws an error that the sequence number is invalid.
How can I tell KCL that when I re-shard I want to set the checkpoint to "LATEST" on the new shards?
As a hack-y solution I've considered just using the DynamoDB SDK and fixing the checkpoint in the initialize
method. It's ugly, but I think it would work (assuming Amazon doesn't change how they manage the KCL tables)
Update
Per the "hack-y solution" described, I wrote the following small helper method:
/**
* Assumes the current shardId (available in the initialize method's
* `initializeInput.shardId`) is stored in the global "state" object,
* accessible via the "state" import
*/
import { Kinesis, DynamoDB } from "aws-sdk";
import state from "../state";
import logger from "./logger";
const kinesis = new Kinesis();
const ddb = new DynamoDB.DocumentClient();
const log = logger().getLogger("recordProcessor");
const appName = process.env.APP_NAME;
export default async function (startingCheckpoint: string) {
// We can't update any Dynamo tables if we don't know which table to update
if (!appName) return;
// Compute the name of the shard JUST BEFORE ours
// Because Kinesis uses an "exclusive" start ID...
const shardIdNum = parseInt(state.shardId.split("-")[1]) - 1;
const startShardId = "shardId-" + ("000000000000" + shardIdNum).substr(-12);
// Pull data about our current shard
const kinesisResp = await kinesis.listShards({
StreamName: process.env.KINESIS_STREAM_NAME,
MaxResults: 1,
ExclusiveStartShardId: startShardId
}).promise();
const oldestSeqNumber = kinesisResp.Shards[0].SequenceNumberRange.StartingSequenceNumber;
// Pull data about our current checkpoint
const dynamoResp = await ddb.get({
TableName: appName,
Key: {
leaseKey: state.shardId
}
}).promise();
const prevCheckpoint = dynamoResp.Item.checkpoint;
log.debug(`Oldest sequence number in Kinesis shard: ${oldestSeqNumber} vs checkpoint: ${prevCheckpoint}`);
// Determine if we need to "fix" anything
if (startingCheckpoint === "TRIM_HORIZON") {
// If our checkpoint is before the oldest sequence number, reset it to
// "TRIM_HORIZON" so we pull the oldest sequence number
if (prevCheckpoint < oldestSeqNumber) {
log.info("Updating checkpoint to TRIM_HORIZON");
await ddb.update({
TableName: appName,
Key: {
leaseKey: state.shardId
},
UpdateExpression: "SET #checkpoint = :value",
ExpressionAttributeNames: {
"#checkpoint": "checkpoint"
},
ExpressionAttributeValues: {
":value": "TRIM_HORIZON"
}
}).promise();
}
} else if (startingCheckpoint === "LATEST") {
if (prevCheckpoint !== "LATEST") {
log.info("Updating checkpoint to LATEST");
await ddb.update({
TableName: appName,
Key: {
leaseKey: state.shardId
},
UpdateExpression: "SET #checkpoint = :value",
ExpressionAttributeNames: {
"#checkpoint": "checkpoint"
},
ExpressionAttributeValues: {
":value": "LATEST"
}
}).promise();
}
} else {
log.warn("We can't 'fix' checkpoints that aren't TRIM_HORIZON or LATEST");
}
}
I tested and this properly and effectively updates the DynamoDB table, but it doesn't immediately begin pulling records from the new location. It looks like the KCL reads the checkpoint once before calling the initialize
method and doesn't ever re-read it.
At this point I'm looking for either a way to tell KCL "start using the new checkpoint", or a way to gracefully restart the consumer so that it re-initializes everything. I have neither at my disposal yet, but I'll keep researching. Maybe I can find something in the MultiLangDaemon docs that I can write to STDOUT...