0

I have a Kinesis consumer whose job is to keep track of "currently active users" in a system. Every minute users send a heartbeat to a Kinesis stream and this system just keeps a list of all unique user GUIDs it has seen along with the last time they received a heartbeat from that GUID. If a heartbeat hasn't been seen in 2 minutes, we assume the user is no longer active and evict them from the list of "currently active users". Pretty straight-forward.

Because this system is only concerned with currently active users, we don't need to back-process old messages. If we were to turn this consumer off for 2 hours then turn it back on, we want to start processing at the "LATEST" message instead of picking up where we left off.

Finally, this has been implemented as a NodeJS application per the Amazon Kinesis Client NodeJS example using the MultiLangDaemon to communicate with the Kinesis Client Library.

Under normal use I've found that the best way to always resume from "LATEST" is to never use the checkpointing feature of KCL. For instance, at the bottom of my processRecords method I have the following:

    // We don't checkpoint with kinrta, because if we crash for some reason we
    // want to immediately catch back up to live instead of wasting time
    // processing expired heartbeats
    // processRecordsInput.checkpointer.checkpoint(sequenceNumber,
      // function(err, checkpointedSequenceNumber) {

        completeCallback();

      // }
    // );

This way whenever I kill the consumer and restart it, it looks at the *.properties file and sees "initialPositionInStream" is "LATEST", then begins processing from there.

HOWEVER

When I re-shard my stream (split shards or merge shards) I run into an issue. When I re-shard the checkpoint on the new shard is not set to "LATEST", but to "TRIM_HORIZON". Since I don't ever re-checkpoint, this means that if my consumer is turned off and restarted I end up having to process 24 hours of data.

I can manually fix this by editing the Dynamo table used by KCL to manage checkpointing, but that's obviously not a scalable solution. I've tried using the checkpointer and passing the string "LATEST" instead of a sequence number, but this throws an error that the sequence number is invalid.

How can I tell KCL that when I re-shard I want to set the checkpoint to "LATEST" on the new shards?

As a hack-y solution I've considered just using the DynamoDB SDK and fixing the checkpoint in the initialize method. It's ugly, but I think it would work (assuming Amazon doesn't change how they manage the KCL tables)

Update

Per the "hack-y solution" described, I wrote the following small helper method:

/**
 * Assumes the current shardId (available in the initialize method's
 * `initializeInput.shardId`) is stored in the global "state" object,
 * accessible via the "state" import
 */

import { Kinesis, DynamoDB } from "aws-sdk";
import state from "../state";
import logger from "./logger";
 
const kinesis = new Kinesis();
const ddb = new DynamoDB.DocumentClient();

const log = logger().getLogger("recordProcessor");
const appName = process.env.APP_NAME;

export default async function (startingCheckpoint: string) { 
    // We can't update any Dynamo tables if we don't know which table to update
    if (!appName) return;

    // Compute the name of the shard JUST BEFORE ours
    // Because Kinesis uses an "exclusive" start ID...
    const shardIdNum = parseInt(state.shardId.split("-")[1]) - 1;
    const startShardId = "shardId-" + ("000000000000" + shardIdNum).substr(-12);

    // Pull data about our current shard
    const kinesisResp = await kinesis.listShards({
        StreamName: process.env.KINESIS_STREAM_NAME,
        MaxResults: 1,
        ExclusiveStartShardId: startShardId
    }).promise();
    const oldestSeqNumber = kinesisResp.Shards[0].SequenceNumberRange.StartingSequenceNumber;

    // Pull data about our current checkpoint
    const dynamoResp = await ddb.get({
        TableName: appName,
        Key: {
            leaseKey: state.shardId
        }
    }).promise();
    const prevCheckpoint = dynamoResp.Item.checkpoint;

    log.debug(`Oldest sequence number in Kinesis shard: ${oldestSeqNumber} vs checkpoint: ${prevCheckpoint}`);

    // Determine if we need to "fix" anything
    if (startingCheckpoint === "TRIM_HORIZON") {

        // If our checkpoint is before the oldest sequence number, reset it to
        // "TRIM_HORIZON" so we pull the oldest sequence number
        if (prevCheckpoint < oldestSeqNumber) {
            log.info("Updating checkpoint to TRIM_HORIZON");

            await ddb.update({
                TableName: appName,
                Key: {
                    leaseKey: state.shardId
                },
                UpdateExpression: "SET #checkpoint = :value",
                ExpressionAttributeNames: {
                    "#checkpoint": "checkpoint"
                },
                ExpressionAttributeValues: {
                    ":value": "TRIM_HORIZON"
                }
            }).promise();
        }

    } else if (startingCheckpoint === "LATEST") {

        if (prevCheckpoint !== "LATEST") {
            log.info("Updating checkpoint to LATEST");

            await ddb.update({
                TableName: appName,
                Key: {
                    leaseKey: state.shardId
                },
                UpdateExpression: "SET #checkpoint = :value",
                ExpressionAttributeNames: {
                    "#checkpoint": "checkpoint"
                },
                ExpressionAttributeValues: {
                    ":value": "LATEST"
                }
            }).promise();
        }

    } else {
        log.warn("We can't 'fix' checkpoints that aren't TRIM_HORIZON or LATEST");
    }
}

I tested and this properly and effectively updates the DynamoDB table, but it doesn't immediately begin pulling records from the new location. It looks like the KCL reads the checkpoint once before calling the initialize method and doesn't ever re-read it.

At this point I'm looking for either a way to tell KCL "start using the new checkpoint", or a way to gracefully restart the consumer so that it re-initializes everything. I have neither at my disposal yet, but I'll keep researching. Maybe I can find something in the MultiLangDaemon docs that I can write to STDOUT...

stevendesu
  • 15,753
  • 22
  • 105
  • 182

1 Answers1

0

After much research I've concluded that Amazon provides no way to request a graceful shut down. You merely have to crash your consumer (process.exit()) and wait for Docker to restart it.

However between my hack-y "checkpoint fixer" script (which I run in the initialize() callback) and this hack-y "crash to restart" method, I now have a solution that updates my checkpoints appropriately - so Kinesis is running a lot more smoothly for me now.

stevendesu
  • 15,753
  • 22
  • 105
  • 182