0

I have tried to implement an optimistic concurrency 'worker'.

Goal is to read a batch of data from the same database table (single table with no relations) with multiple parallel 'worker'. This did seem to work so far. I get optimistic concurrency exceptions here and there, catch them and retry.

So far so good, and the function to get the data works stable on my local setup. When moving the application to a test environment however, I get a strange timeout exception, which even if caught, will end the async function (breaks the while loop). Does someone see a flaw in the implementation? What could cause the timeout? What could cause the end of the async function?

public async IAsyncEnumerable<List<WorkItem>> LoadBatchedWorkload([EnumeratorCancellation] CancellationToken token, int batchSize, int runID)
{
    DataContext context = null;
    try
    {
        context = GetNewContext(); // create a new dbContext
        List<WorkItem> workItems;
        bool loadSuccessInner;
        while (true)
        {
            if (token.IsCancellationRequested) break;

            loadSuccessInner = false;

            context.Dispose();
            context = GetNewContext(); // create a new dbContext

            RunState currentRunState = context.Runs.Where(a => a.Id == runID).First().Status;

            try
            {
                // Error happens on the following line: Microsoft.Data.SqlClient.SqlException: Timeout
                workItems = context.WorkItems.Where(a => a.State == ProcessState.ToProcess).Take(batchSize).ToList();
                loadSuccessInner = true;
            }
            catch (Exception ex)
            {
                workItems = new List<WorkItem>();
            }

            if (workItems.Count == 0 && loadSuccessInner)
            {
                break;
            }

            //... update to a different RunState
            //... if set successful yield the result
            //... else cleanup and retry
        }
    }
    finally
    {
        if (context != null) context.Dispose();
    }
}
  • I verified that EntityFramework (here with MS SQL Server adapter) does full server side query, which translates to a simple query like this: SELECT TOP 10 field_1, field_2 FROM WorkItems WHERE field_2 = 0

  • The query usually takes <1ms and the timeout is left on default of 30s

  • I verified that there are no cancelation requests fired

  • This happens also when there is only a single worker and no one else is accessing the database. I'm aware that a timeout can happen when the resource is busy or blocked. But until now, I never saw a timeout on any other query yet.

Peter Pakre
  • 41
  • 1
  • 6

1 Answers1

0

(I'll update this answer whenever more information is being provided.)

Does someone see a flaw in the implementation?

Generally, your code looks fine.

What could cause the end of the async function?

Nothing in the code you showed should normally be an issue. Start by putting another try-catch block inside the loop, to ensure, that no other exceptions are getting thrown anywhere else (especially later in the not shown code):

public async IAsyncEnumerable<List<WorkItem>> LoadBatchedWorkload([EnumeratorCancellation] CancellationToken token, int batchSize, int runID)
{
    DataContext context = null;
    try
    {
        context = GetNewContext();
        List<WorkItem> workItems;
        bool loadSuccessInner;
        while (true)
        {
            try
            {
                // ... (the inner loop code)
            }
            catch (Exception e)
            {
                // TODO: Log the exception here using your favorite method.
                throw;
            }
        }
    }
    finally
    {
        if (context != null) context.Dispose();
    }
}

Take a look at your log and ensure, that the log does not show any exceptions being thrown. Then additionally log every possible exit condition (break and return) from the loop, to find out how and why the code exits the loop.

If there are no other break or return statements in your code, then the only way the code can exit from the loop is if zero workItems are successfully returned from the database.

What could cause the timeout?

Make sure, that any Task returning/async methods you call are being called using await.


To track down, where the exceptions are actually coming from, you should deploy a Debug release with pdb files to get a full stack trace with source code line references.

You can also implement a DbCommandInterceptor and trace failing commands on your own:

public class TracingCommandInterceptor : DbCommandInterceptor
{
    public override void CommandFailed(DbCommand command, CommandErrorEventData eventData)
    {
        LogException(eventData);
    }

    public override Task CommandFailedAsync(DbCommand command, CommandErrorEventData eventData, CancellationToken cancellationToken = new CancellationToken())
    {
        LogException(eventData);
        return Task.CompletedTask;
    }

    private static void LogException(CommandErrorEventData eventData)
    {
        if (eventData.Exception is SqlException sqlException)
        {
            // -2 = Timeout error
            // See https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/cc645611(v=sql.105)?redirectedfrom=MSDN
            if (sqlException.Number == -2)
            {
                var stackTrace = new StackTrace();
                var stackTraceText = stackTrace.ToString();

                // TODO: Do some logging here and output the stackTraceText
                //       and other helpful information like the command text etc.
                // -->
            }
        }
    }
}

protected override void OnConfiguring(DbContextOptionsBuilder optionsBuilder)
{
    optionsBuilder.UseLoggerFactory(LoggingFactory);
    optionsBuilder.UseSqlServer(connectionString);
    optionsBuilder.EnableSensitiveDataLogging();
    optionsBuilder.EnableDetailedErrors();

    // Add the command interceptor.
    optionsBuilder.AddInterceptors(new TracingCommandInterceptor());

    base.OnConfiguring(optionsBuilder);
}

Additionally logging the command text of the failed command in the interceptor is also a good idea.

lauxjpn
  • 4,749
  • 1
  • 20
  • 40
  • Thanks! Did wrap the whole inner while loop and tested. Result is the same. Function ends. In the orginal code there is a logger in the exception part. I get two logs, one from the db logger and one from the caught exception. First `[2020-07-22 12:14:40Z - ERRR]: Microsoft.EntityFrameworkCore.Database.Command` and the second `[2020-07-22 12:14:40Z - ERRR]: Microsoft.EntityFrameworkCore.Query` I'm happy to share the full code – Peter Pakre Jul 22 '20 at 17:59
  • 1
    Sure, if you can share the code, just upload it to github or somewhere. I'll take a look at it. – lauxjpn Jul 22 '20 at 18:20
  • Sry! It took me quite some time to clean the [project](https://github.com/mBHyuC/concurrency-test). Will add a/better readme and do some smaller formatting changes soon. Thats basically all relevant code, works locally, cannot tell if it does create the same error on the server. – Peter Pakre Jul 26 '20 at 14:06
  • 1
    I'll take a look at it. What is the full message and stack trace of the caught exception after you added the try-catch-block? – lauxjpn Jul 26 '20 at 15:02
  • Thanks, really appreciate that! Added the small log that exisits and a full session log. Please keep in mind that this is not the 100% Original Code. I did try to keep it as aligned to the original as possible. The question that really bothers me is, why the exception always occures at that simple select statement and always after processing the first batch. The closest i could come to recreate that, was by shutting down the sql server. Or maybe it's just a dump mistake /: – Peter Pakre Jul 26 '20 at 17:30
  • Maybe another important update. The printed exceptions do not come from the try and catch block! Its an ERRR message, the message from the exception would have printed was VERB. – Peter Pakre Jul 26 '20 at 17:33
  • I appended another section to the answer. – lauxjpn Jul 26 '20 at 19:10
  • Ty, .. as it's almost impossible to answer stupid questions, I'll accept your answer as it already delivers quite some value (: I might add an update in the future. – Peter Pakre Jul 26 '20 at 21:38