10

I've got an Azure webjob with several queue-triggered functions. The SDK documentation at https://learn.microsoft.com/en-us/azure/app-service-web/websites-dotnet-webjobs-sdk-storage-queues-how-to#config defines the MaxDequeueCount property as:

The maximum number of retries before a queue message is sent to a poison queue (default is 5).

but I'm not seeing this behavior. In my webjob I've got:

JobHostConfiguration config = new JobHostConfiguration();
config.Queues.MaxDequeueCount = 1;
JobHost host = new JobHost(config);
host.RunAndBlock();

and then I've got a queue-triggered function in which I throw an exception:

public void ProcessQueueMessage([QueueTrigger("azurewejobtestingqueue")] string item, TextWriter logger)
{
   if ( item == "exception" )
   {
      throw new Exception();
   }
}

Looking at the webjobs dashboard I see that the SDK makes 5 attempts (5 is the default as stated above):

Webjob failures shown on webjobs dashboard

and after the 5th attempt the message is moved to the poison queue. I expect to see 1 retry (or no retries?) not 5.

UPDATE: Enabled detailed logging for the web app and opted to save those logs to an Azure blob container. Found some logs relevant to my problem located in the azure-jobs-host-archive container. Here's an example showing an item with a dequeue count of 96:

{
  "Type": "FunctionCompleted",
  "EndTime": "2017-02-22T00:07:40.8133081+00:00",
  "Failure": {
    "ExceptionType": "Microsoft.Azure.WebJobs.Host.FunctionInvocationException",
    "ExceptionDetails": "Microsoft.Azure.WebJobs.Host.FunctionInvocationException: Exception while executing function: ItemProcessor.ProcessQueueMessage ---> MyApp.Exceptions.MySpecialAppExceptionType: Exception of type 'MyApp.Exceptions.MySpecialAppExceptionType' was thrown.
  },
  "ParameterLogs": {},
  "FunctionInstanceId": "1ffac7b0-1290-4343-8ee1-2af0d39ae2c9",
  "Function": {
    "Id": "MyApp.Processors.ItemProcessor.ProcessQueueMessage",
    "FullName": "MyApp.Processors.ItemProcessor.ProcessQueueMessage",
    "ShortName": "ItemProcessor.ProcessQueueMessage",
    "Parameters": [
      {
        "Type": "QueueTrigger",
        "AccountName": "MyStorageAccount",
        "QueueName": "stuff-processor",
        "Name": "sourceFeedItemQueueItem"
      },
      {
        "Type": "BindingData",
        "Name": "dequeueCount"
      },
      {
        "Type": "ParameterDescriptor",
        "Name": "logger"
      }
    ]
  },
  "Arguments": {
    "sourceFeedItemQueueItem": "{\"SourceFeedUpdateID\":437530,\"PodcastFeedID\":\"2d48D2sf2\"}",
    "dequeueCount": "96",
    "logger": null
  },
  "Reason": "AutomaticTrigger",
  "ReasonDetails": "New queue message detected on 'stuff-processor'.",
  "StartTime": "2017-02-22T00:07:40.6017341+00:00",
  "OutputBlob": {
    "ContainerName": "azure-webjobs-hosts",
    "BlobName": "output-logs/1ffd3c7b012c043438ed12af0d39ae2c9.txt"
  },
  "ParameterLogBlob": {
    "ContainerName": "azure-webjobs-hosts",
    "BlobName": "output-logs/1cf2c1b012sa0d3438ee12daf0d39ae2c9.params.txt"
  },
  "LogLevel": "Info",
  "HostInstanceId": "d1825bdb-d92a-4657-81a4-36253e01ea5e",
  "HostDisplayName": "ItemProcessor",
  "SharedQueueName": "azure-webjobs-host-490daea03c70316f8aa2509438afe8ef",
  "InstanceQueueName": "azure-webjobs-host-d18252sdbd92a4657d1a436253e01ea5e",
  "Heartbeat": {
    "SharedContainerName": "azure-webjobs-hosts",
    "SharedDirectoryName": "heartbeats/490baea03cfdfd0416f8aa25aqr438afe8ef",
    "InstanceBlobName": "zd1825bdbdsdgga465781a436q53e01ea5e",
    "ExpirationInSeconds": 45
  },
  "WebJobRunIdentifier": {
    "WebSiteName": "myappengine",
    "JobType": "Continuous",
    "JobName": "ItemProcessor",
    "RunId": ""
  }
}

What I'm further looking for though are logs which would show me detail for a particular queue item where processing succeeds (and hence is removed from the queue) or fails due to an exception and is placed in the poison queue. I so far haven't found any logs showing that detail. The log files referenced in the output above do not contain data of this sort.

UPDATE 2: Looked at the state of my poison queue and it seems like it could be a smoking gun but I'm too dense to put 2 and 2 together. Looking at the screenshot of the queue below you can see the message with the ID (left column) 431210 in there many times. The fact that it appears multiple times says to me that the message in the original queue is failing improperly.

Poison queue

Howiecamp
  • 2,981
  • 6
  • 38
  • 59
  • Are you running on Azure or running locally? – Rob Reagan Feb 15 '17 at 21:20
  • @Rob Running on Azure. – Howiecamp Feb 15 '17 at 21:25
  • We are doing the same, if you down grade to 7.2.1 this issue doesn't not happen. However downgrading is a poor solution. For us this doesn't work as we are using Umbraco which requires a later version as part of https://www.nuget.org/packages/UmbracoFileSystemProviders.Azure which you can see in the dependencies WindowsAzure.Storage (>= 8.7.0) https://www.nuget.org/packages/WindowsAzure.Storage/ – Sigex Sep 10 '19 at 09:58

6 Answers6

6

If you are still seeking an answer, we tried some of the answers listed without success. It turns out that it was a version issue with the Storage sdk (WindowsAzure.Storage) and the Webjob sdk (Microsoft.Azure.WebJobs). To fix it, we ended up having to downgrade our version of the Storage sdk to 7.2.1 (we had recently upgraded to 8.1.1). Based on the article below, the engineers are now aware of the problems and will hopefully have it fixed soon:

https://github.com/Azure/azure-webjobs-sdk/issues/1045

Rob W.
  • 270
  • 6
  • 14
  • Nailed it! Describes it perfectly. – Howiecamp Mar 30 '17 at 19:58
  • Also facing a similar issue too but the max dequeue count works well on our side. The message gets moved properly to poison queue after x dequeue count. But after that we see one new entry for the same message in the poison queue every 10 minutes. Issue seems to only happen with WebJobs SDK 1.1.2 and Azure Storage V8.0.1 (which is our case) Issue is referenced @ https://github.com/Azure/azure-webjobs-sdk/issues/985 We may downgrade too while it is fixed... – Julien Jacobs Mar 31 '17 at 12:54
6

As mentioned by Rob W, this issue exists when using WindowsAzure.Storage > 7.1.2. The issue has apparently been fixed in issue #1141 but this has not yet made it into a release.

Contributer asifferman has shared a code snippet in a comment post on issue #985. that appears to resolve the problem (it worked perfectly for me).

In case of link rot, and to meet SO rules, here's the post along with the code snippet:

For those (like me) who cannot wait the next release to get the WebJobs SDK to work with the latest releases of Azure Storage, and based on the explanations of @brettsam, you can simply write a custom CustomQueueProcessorFactory to create a new CloudQueueMessage in CopyMessageToPoisonQueueAsync.

namespace ConsoleApplication1
{
    using Microsoft.Azure.WebJobs.Host.Queues;
    using Microsoft.WindowsAzure.Storage.Queue;
    using System.Threading;
    using System.Threading.Tasks;

    public class CustomQueueProcessorFactory : IQueueProcessorFactory
    {
        public QueueProcessor Create(QueueProcessorFactoryContext context)
        {
            return new CustomQueueProcessor(context);
        }

        private class CustomQueueProcessor : QueueProcessor
        {
            public CustomQueueProcessor(QueueProcessorFactoryContext context)
                : base(context)
            {
            }

            protected override Task CopyMessageToPoisonQueueAsync(CloudQueueMessage message, CloudQueue poisonQueue, CancellationToken cancellationToken)
            {
                var newMessage = new CloudQueueMessage(message.Id, message.PopReceipt);
                newMessage.SetMessageContent(message.AsBytes);

                return base.CopyMessageToPoisonQueueAsync(newMessage, poisonQueue, cancellationToken);
            }
        }
    }
}

Then in your Main, you just have to set the custom queue processor factory in the job host configuration:

var config = new JobHostConfiguration();
config.Queues.QueueProcessorFactory = new CustomQueueProcessorFactory();

I could get it work with WindowsAzure.Storage 8.1.1 and Microsoft.Azure.WebJobs 2.0.0. Hope that helps!

Paul Fleming
  • 24,238
  • 8
  • 76
  • 113
1

MaxDequeueCount property works correctly for me if I configure it.

So it is very odd that it is not working for you. When I set config.Queues.MaxDequeueCount = 2; then I get the expected result please refer to the screenshot.

enter image description here

And we also could use dequeueCount to control the retry times. The following is the demo code for no try.

public void ProcessQueueMessage([QueueTrigger("queue")] string item, int dequeueCount, TextWriter logger)
        {
            if (dequeueCount == 1)
            {
                if (item == "exception")
                {
                    throw new Exception();
                }
                logger.WriteLine($"NewMsge: {item}");
                Console.WriteLine($"NewMsge: {item}");
            }
        }

Log info please refer to the screenshot

enter image description here

Tom Sun - MSFT
  • 24,161
  • 3
  • 30
  • 47
0

I suspect it's because you're not actually running the binaries that you think you are in Azure. This one threw me for a loop as well.

When you're running triggered WebJobs on Azure, publishing a new version of the WebJob doesn't cause the old triggered WebJob to be immediately unloaded and the new one started. If you look at your WebJob logs, I suspect you will not see a restart when you republished.

This is because Kudu by default copies all of your WebJob files to a temp directory and executes them. From the Kudu WebJob docs:

The WebJob is copied to a temporary directory under %TEMP%\jobs{job type}{job name}{random name} and will run from there This option prevents the original WebJob binaries from being locked which might cause issues redeploying the WebJob. For example updating an .exe file that is currently running.

The only success I've had in making sure that a newly published triggered WebJob is actually running is to do the following:

  1. Log into the Kudu console. It's https://yourappname.scm.azurewebsites.net. You'll use the same credentials that you do when logging into the Azure Portal.

  2. Once logged in, click on the Process Explorer menu option at the top. Find your WebJob process that's currently running, and kill it.

  3. FTP into your Web App. Browse to the directory containing your WebJob code, and delete it. It should be under /app_data/jobs/triggered/[your webjob name].

  4. I then hop over to the portal, browse to by Web App management blade that hosts the WebJob, click on the WebJobs menu option, and confirm that the old WebJob is no longer there.

  5. Publish my new WebJob from Visual Studio.

That should guarantee you that you're running the code that you publish. Hope this helps.

Rob Reagan
  • 7,313
  • 3
  • 20
  • 49
  • I think I've seen this behavior previously. However I do actually periodically verify my webjob's running status immediately after publishing and in all cases the jobs restart. In any case, I've been seeing this problem for quite some time now, but coincidentally (for other reasons) the other day I went into the Azure portal, deleted the webjobs associated with the website, deleted the contents of wwwroot and then did a full redeploy... Something I just discovered - in some cases I'm getting > 5 retries. So I can't understand how the number is being determined. – Howiecamp Feb 15 '17 at 22:25
  • Are you running more than one instance of your WebJob? – Rob Reagan Feb 15 '17 at 22:31
  • No the website (and hence webjob) are single instance (and BatchSize=1). But the webjobs sdk guarantees that if multiple instances of the webjob are running that only one will pick a given item off the queue. By the way I just added some add'l logging to report the value of the MaxDequeueCount property to verify it was what I set it to and that it hadn't been somehow overridden. – Howiecamp Feb 15 '17 at 22:32
  • Is the JobHostConfiguration.MaxDequeueCount = 1 respected when you run the WebJob locally? – Rob Reagan Feb 15 '17 at 22:48
  • Great suggestion to run locally - I typically don't debug that way so I hadn't considered it. Tried and got some interesting results. In the local case, when the queue-triggered function exceptions it blows up the webjob. This is unlike the behavior in Azure where the webjobs sdk handles the exception. I expected that to be the case locally as well. I am going to make sure my configuration is correct and see if I should expect the same exception/retry behavior locally as in the cloud. – Howiecamp Feb 16 '17 at 18:15
  • Howie, I actually put together a skeleton webjob project with a QueueTrigger and MaxDequeueCount of 1. It worked as expected. Would having the source help you at all? I can push it to a public GitHub repo if you like. If it runs successfully for you, you can play the "what's the difference between these two" game. – Rob Reagan Feb 16 '17 at 18:45
  • That would be great - I'll stick it in Azure and test it. – Howiecamp Feb 16 '17 at 21:52
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/135905/discussion-between-rob-reagan-and-howiecamp). – Rob Reagan Feb 17 '17 at 00:02
  • 1
    Guys, if you can get a repro together for this please log a bug in our repo here https://github.com/Azure/azure-webjobs-sdk/issues. The dequeue count should never go beyond the max - I've not seen that happen. – mathewc Feb 25 '17 at 05:32
  • 1
    @matthewc I'd like to log the bug with as much constructive info as possible. Could you please quickly take a peek at my subsequent post at https://stackoverflow.com/questions/42381354/azure-webjob-logs-looking-for-logs-that-detail-the-sdks-handling-of-queue-tri in which I ask about webjob-related log info I'm looking for? – Howiecamp Feb 27 '17 at 20:04
  • @Rob Please see update (last paragraph) in my other question at https://stackoverflow.com/questions/42381354/azure-webjob-logs-looking-for-logs-that-detail-the-sdks-handling-of-queue-tri. – Howiecamp Feb 28 '17 at 18:45
0

I am seeing the same thing where messages go way past the max dequeue count. I will post more details in a bit, but I am also seeing what appears to be a very large number end up in poison queue. So I suspect that it is adding to poison queue after 5, but that trying more which ends up in lots in poison queue (hundreds).

  • were you seeing multiple poison messages for the same original message by chance? – Howiecamp Feb 25 '17 at 19:46
  • yes. It was still creating (multiple) messages for same original message, but also putting it back in the standard queue. So it seems to be moving to poison correctly, but not deleting from original queue. – Joel Karr Feb 27 '17 at 02:19
0

For anyone using the Azure WebJobs v3.x SDK:

In v3.x, hosts.json does not work for WebJob.

Instead, version 3.x uses the standard ASP.NET Core APIs, so you need to configure it using the ConfigureWebJobs method:

static async Task Main()
{
    var builder = new HostBuilder();
    builder.ConfigureWebJobs(b =>
    {
        b.AddAzureStorageCoreServices();
        b.AddAzureStorage(a => {
            a.BatchSize = 8;
            a.NewBatchThreshold = 4;
            a.MaxDequeueCount = 4;
            a.MaxPollingInterval = TimeSpan.FromSeconds(15);
        });
    });
    var host = builder.Build();
    using (host)
    {
        await host.RunAsync();
    }
}

Docs: https://learn.microsoft.com/pt-pt/azure/app-service/webjobs-sdk-how-to#queue-storage-trigger-configuration

Maicon Heck
  • 2,017
  • 1
  • 20
  • 27