"Fork and Join" with serverless functions (e.g. AWS Lambda) / Python

Question

I'm processing relatively large images using AWS Lambda (https://registry.opendata.aws/sentinel-2/).

In order to process these images, I split them into smaller images (~1500 "chips") which can be processed independently (the number of chips varies unpredictably depending on the content of the source image). Chips are processed in parallel using multiple invocations of a Lambda that takes in a "page" of a couple of hundred chips.

Here's where I'm stuck: when all pages have been processed, I need to combine results into a single output image, but how to know when all pages - the "variable batch of invocations" - are complete?

I've considered e.g. writing progress information to s3 or dynamo and invoking the combining function after every page so that only the last invocation of that function goes ahead (when a progress check returns as complete). I've seen options like futures/promises, but the processing time of a page of chips is of the order of 10-15 minutes so I don't want to keep a "controller" function waiting for the futures/promises to complete, because at that point it's cheaper to go with multiple invocations.

Is there a better solution that writing out progress information and checking it multiple times?

(NB I've seen this question: Fork and Join with Amazon Lambda)

score 2 · Answer 1 · answered Mar 18 '19 at 18:42

You could add the chips into a queue with Amazon SQS, and have workers or Lambdas pull those individual jobs off the queue. Then, you can have a cloudwatch alert setup that monitors the depth of your queue, where a queue depth of zero (job completed) triggers a "completion" Lambda that will piece the individual output chips back together.

I believe CloudWatch alerts poll queue statuses in 5 minute intervals, so for your use-case where you have long processing times (~10-15 mins), it wouldn't be the bottleneck here (Lambda timeouts are 15 minutes anyway, so if you set to poll at 15 minutes, your Lambda has either failed or will be completed by then).

Step by step, what this would look like:

Upload new file to S3
Upload triggers a lambda to break apart file into "chips" within a new folder
Add all chips to a new queue
Lambdas pull chips off queue
when queue is empty, trigger conglomeration lambda

Here's another helpful answer on setting triggers based on queue status: Efficient way to check whether SQS queue is empty

I think this could work. I've got a couple of questions that come up fof me. If two different files are requested around the same time, presumably that would need two different queues and an updated trigger for the consuming Lambda? Second, if the Cloudmwatch trigger happens to land just after the last chip message has been consumed, but before the chip finishes processing, I'm thinking there's a chance of missing a chip? I could see the latter being solved by a design that either waits, or counts at least two "empty queue" polls before triggering aggregation. — David Carboni, Dec 10 '19 at 10:14

"Fork and Join" with serverless functions (e.g. AWS Lambda) / Python

1 Answers1