0

We have a QLDB ingestion process that consists of a Lambda function triggered by SQS.

We want to make sure our pipeline is airtight so if a failure or error occurs during driver execution, we don't lose that data if the data fails to commit to QLDB.

In our testing we noticed that if there's a failure within the Lambda itself, it automatically resends the message to the queue, but if the driver fails, the data is lost.

I understand that the default behavior for the driver is to retry four times after the initial failure. My question is, if I wrap qldb_driver.execute_lambda() in a try statement, will that allow the driver to retry upon failure or will it instantly return as a failure and be handled by the except statement?

Here is how I've written the first half of the function:

import json
import boto3
import datetime
from pyqldb.driver.qldb_driver import QldbDriver
from utils import upsert, resend_to_sqs, delete_from_sqs

queue_url = 'https://sqs.XXX/'
sqs = boto3.client('sqs', region_name='us-east-1')

ledger = 'XXXXX'
table = 'XXXXX'
qldb_driver = QldbDriver(ledger_name = ledger, region_name='us-east-1')

def lambda_handler(event, context):
    # Simple iterable to identify messages
    i = 0
    
    # Error flag
    error = False
    
    # Empty list to store message send status as well as body or receipt_handle
    batch_messages = []
    
    for record in event['Records']:
        payload = json.loads(record["body"])
        payload['update_ts'] = str(datetime.datetime.now())

        try:
            qldb_driver.execute_lambda(lambda executor: upsert(executor, ledger = ledger, table_name = table, data = payload))

            # If the message sends successfully, give it status 200 and add the recipt_handle to our list 
            # so in case an error occurs later, we can delete this message from the queue.

            message_info = {f'message_{i}': 200, 'receiptHandle': record['receiptHandle']}
            batch_messages.append(message_info)
            
        except Exception as e:
            print(e)

            # Flip error flag to True
            error = True

            # If the commit fails, set status 400 and add the message's body to our list.
            # This will allow us to send the message back to the queue during error handling.

            message_info = {f'message_{i}': 400, 'body': record['body']}
            batch_messages.append(message_info)
        
        i += 1
    

Assuming that this try/except allows the driver to retry upon failure, I've written an additional process to record message data from our batch to delete successful commits and send failures back to the queue:

    # Begin error handling
    if error:
        count = 0
        for j in range(len(batch_messages)):
            # If a message was sent successfully delete it from the queue
            if batch_messages[j][f'message_{j}'] == 200:
                receipt_handle = batch_messages[j]['receiptHandle']
                delete_from_sqs(sqs, queue_url, receipt_handle)
            
            # If the message failed to commit to QLDB, send it back to the queue
            else:
                body = batch_messages[j]['body']
                resend_to_sqs(sqs, queue_url, body)
                count += 1
                
        print(f"ERROR(S) DETECTED - {count} MESSAGES RETURNED TO QUEUE")
                
    else:
        print("BATCH PROCESSING SUCCESSFUL")

Thank you for your insight!

1 Answers1

0

The qldb python driver can be configured for more or less retries if you need. I'm not sure if you wanted it to only try 1 time, or if you were asking that the driver will try the transaction 4 times before triggering the try/catch exception. The driver will still try up-to 4 times, before throwing the except.

You can follow the example here to modify the retry amount. Also, note the default retry timeout is a random ms jitter and not exponential. With QLDB, you shouldn't need to wait long periods to retry since it uses optimistic concurrency control.

Also, with your design of throwing the failed message back into the queue...you might want to consider throwing it into a dead letter queue. Dead-letter queues would prevent trouble messages from retrying indefinitely, unless thats your goal.

(edit/additionally) Observe that the qldb driver exhausting retires before raising an exception.

bwinchester
  • 91
  • 1
  • 5
  • Thank you for the feedback! You answered my main question, and brought up a good point about the DLQ. I would like to send messages back to the queue at least one time to retry upon failure before sending to a DLQ. Would the best way to handle that be to add an attribute to the message that the Lambda checks for before determining if it will resubmit to SQS or the DLQ? Something like a "failure_count": x, key: value pair that the Lambda checks before processing the failure. If the count is above our desired retry threshold it sends to the DLQ, else it sends back to the queue. – acswan9690 Feb 17 '23 at 15:19
  • @acswan9690 , If your driver is configured like the example, which is using the default retry count of 4, the driver will have tried to insert your document 4 times before the error handler catches. At that point I would DLQ the document, and not consume DLQ messages until you can get to the bottom of the issue of why the document could not insert or update in 4 tries. It should be designed to work within your retries amount. Otherwise your document is highly contentious and might require redesigning schema. – bwinchester May 11 '23 at 13:33
  • Often times, a DLQ will also automatically write the error response from the driver into the DLQ. This should help identify errors causing failure. – bwinchester May 11 '23 at 13:35