Goal
I wanted to make a proof of concept of the callback pattern. This is where you have a step function that puts a message and token in an sqs queue, the queue is wired up to some arbitrary work, and when that work is done you give the step function back the token so it knows to continue.
Problem
I started testing all this stuff by starting an execution in the step function manually and after a few failures I hit on what should have worked. The send_task_success
was called but all I ever got back was this An error occurred (TaskTimedOut) when calling the SendTaskSuccess operation: Task Timed Out: 'Provided task does not exist anymore'
.
My architecture (you can skip this part)
I did this all in terraform.
Permissions
I'm going to skip all the IAM permission details for brevity but the idea is:
- The queue the following with resource of my lambda
lambda:CreateEventSourceMapping
lambda:ListEventSourceMappings
lambda:ListFunctions
- The step function has the following with the resource of my queue
sqs:SendMessage
- The lambda has
AWSLambdaBasicExecutionRole
AWSLambdaSQSQueueExecutionRole
states:SendTaskSuccess
with step function resource
Terraform
resource "aws_sqs_queue" "queue" {
name_prefix = "${local.project_name}-"
fifo_queue = true
# This one is required for fifo queues for some reason
content_based_deduplication = true
policy = templatefile(
"policy/queue.json",
{lambda_arn = aws_lambda_function.run_job.arn}
)
}
resource "aws_sfn_state_machine" "step" {
name = local.project_name
role_arn = aws_iam_role.step.arn
type = "STANDARD"
definition = templatefile(
"states.json", {
sqs_url = aws_sqs_queue.queue.url
}
)
}
resource "aws_lambda_function" "run_job" {
function_name = local.project_name
description = "Runs a job"
role = aws_iam_role.lambda.arn
architectures = ["arm64"]
runtime = "python3.9"
filename = var.zip_path
handler = "main.main"
}
resource "aws_lambda_event_source_mapping" "trigger_lambda" {
event_source_arn = aws_sqs_queue.queue.arn
enabled = true
function_name = aws_lambda_function.run_job.arn
batch_size = 1
}
Notes:
For my use case I definitely want a FIFO queue. However, there are two funny things you have to do to make a FIFO work (that also make me question what the heck the implementation is doing).
- Deduplication. This can either be content based deduplication for the whole queue or you can use the dedplication id thing on a per message basis.
- MessageGroupId. This is on a per message basis.
I don't have to worry about the deduplication because every item I put in this queue comes with a unique guid.
State Function
I expect this to be executed with a json that includes "job": "some job guid"
at the top level.
{
"Comment": "This is a thing.",
"StartAt": "RunJob",
"States": {
"RunJob": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"Parameters": {
"QueueUrl": "${sqs_url}",
"MessageBody": {
"Message": {
"job_guid.$": "$.job",
"TaskToken.$": "$$.Task.Token"
}
},
"MessageGroupId": "me_group"
},
"Next": "Finish"
},
"Finish": {
"Type": "Succeed"
}
}
}
Notes:
- "RunJob"s resource is not the arn of the queue followed by
.waitForTaskToken
. Seems obvious since it starts witharn:aws:states
but it threw me for a bit. - Inside "MessageBody" I'm pretty sure you can just put whatever you want. For sure I know you can rename "TaskToken" to whatever you want.
- You need "MessageGroupId" because it's required when you are using a FIFO queue (for some reason).
Python
import boto3
from json import loads
def main(event, context):
message = loads(event["Records"][0]["body"])["Message"]
task_token = message["TaskToken"]
job_guid = message["job_guid"]
print(f"{task_token=}")
print(f"{job_guid=}")
client = boto3.client('stepfunctions')
client.send_task_success(taskToken=task_token, output=event["Records"][0]["body"])
return {"statusCode": 200, "body": "All good"}
Notes:
- event["Records"][0]["body"] is a string of a json.
- In
send_task_success
,output
expects a string that is json. Basically this means the output ofdumps
. It just so happens thatevent["Records"][0]["body"]
is a stringified json so that's why I'm returning it.