12

I am currently evaluating AWS state machine that can process single document. The state machine would take 5-10 mins to process a single document.

{
  "Comment":"Process document",
  "StartAt": "InitialState",
  "States": {
          //the document goes through multiple states here
  }
}

The C# code invokes the state machine by passing some json for each document. Something like

      // max 100 documents
      public Task Process(IEnumerable<Document> documents)
      {   
          var amazonStepFunctionsConfig = new AmazonStepFunctionsConfig { RegionEndpoint = RegionEndpoint.USWest2 };
          using (var amazonStepFunctionsClient = new AmazonStepFunctionsClient(awsAccessKeyId, awsSecretAccessKey, amazonStepFunctionsConfig))
          {
            foreach(var document in documents)
            {
                var jsonData1 = JsonConvert.SerializeObject(document);
                var startExecutionRequest = new StartExecutionRequest
                {
                  Input = jsonData1,
                  Name = document.Id, 
                  StateMachineArn = "arn:aws:states:us-west-2:<SomeNumber>:stateMachine:ProcessDocument"
                };
                var taskStartExecutionResponse = await amazonStepFunctionsClient.StartExecutionAsync(startExecutionRequest);                
            }
          }
      }

We process the documents in batch of 100. So in above loop the max number of documents will be 100. However we process thousands of documents weekly (25000+).

As per the AWS documentation Maximum execution history size is 25,000 events. If the execution history reaches this limit the execution will fail.

Does that mean we can not execute a single state machine more than 25000 times? Why execution of state machine should depend on its history, why cant AWS just purge history?

I know there is a way to continue as new execution but I am just trying to understand the history limit and its relation to state machine execution, and is my understanding is correct?

Update 1
I don't think this is duplicate question. I am trying find if my understanding of history limit is correct? Why history has anything to do with number of times state machine can execute? When state machine executes, it creates history record, if history records goes more 25000+, then purge them or archive them. Why would AWS stop execution of state machine. That does not make sense.

So question, Can single state machine (unique arn) execute more than 25000+ times in loop? if i have to create new state machine (after 25000 executions) wouldn't that state machine will have different arn?

Also if i had to follow linked SO post where would i get current number of executions? Also he is looping with-in the step function, while i am calling step function with-in the loop

Update 2
So just for testing i created the following state machine

{
  "StartAt": "HelloWorld",
  "States": {
    "HelloWorld": {
      "Type": "Pass",
      "Result": "Hello World!",
      "End": true
    }
  }
}

and executed it 26000 times with NO failure

    public static async Task Main(string[] args)
    {
        AmazonStepFunctionsClient client = new AmazonStepFunctionsClient("my key", "my secret key", Amazon.RegionEndpoint.USWest2);
        for (int i = 1; i <= 26000; i++)
        {
            var startExecutionRequest = new StartExecutionRequest
            {
                Input = JsonConvert.SerializeObject(new { }),
                Name = i.ToString(),
                StateMachineArn = "arn:aws:states:us-west-2:xxxxx:stateMachine:MySimpleStateMachine"
            };

            var response = await client.StartExecutionAsync(startExecutionRequest);
        }

        Console.WriteLine("Press any key to continue");
        Console.ReadKey();
    }

and on AWS Console i am able to pull the history for all 26000 executions enter image description here

So i am not sure exactly what does it mean by Maximum execution history size is 25,000 events

LP13
  • 30,567
  • 53
  • 217
  • 400

2 Answers2

5

The term "Execution History" is used to describe 2 completely different things in the quota docs, which has caused your confusion (and mine until I realized this):

  • 90 day quota on execution history retention: This is the history of all executions, as you'd expect
  • 25,000 quota on execution history size: This is the history of "state events" within 1 execution, NOT across all executions in history. In other words, if your single execution runs through thousands of steps, thereby racking up 25k events (likely because of a looping structure in the workflow), it will suddenly fail and exit.

As long as each single execution completes in under 25k steps each, so that the execution history for an individual run is LESS THAN 25k, then you can execute the state machine as much as you'd like (much more than 25k times) :)

Update: As of Dec 2022, you can use Distributed Map to avoid this 25k quota. We're now using this to manage large queues of background processing via 1 state machine that would have hit this 25k limit. We're iterating in the 100k range.

lance.dolan
  • 3,493
  • 27
  • 36
3

I don't think you've got it right. 25,000 limit is for a State Machine execution history. You have tested 26,000 State Machine executions. State Machine executions limit is 1,000,000 open executions.

A State Machine can run for up to 1 year, and during this time its execution history should not reach more than 25,000.

Hope it helps.

A.Khan
  • 3,826
  • 21
  • 25
  • Does this mean you can have no more than 25k concurrently-running step functions? I would think the execution limit of 1 million would be concurrently-running step functions. Maybe it means up to 1 million can be in progress at any given time, but only 25k of those are actually executing and the rest are waiting on the next step? – r590 Nov 04 '20 at 22:51