0

I am building a solution that implements a RESTful service for interacting with metadata related to federated identity.

I have a class that is registered with Autofac like this:

        builder.RegisterType<ExternalIdpStore>()
               .As<IExternalIdpStore>()
               .As<IStartable>()
               .SingleInstance();

I have a service class (FedApiExtIdpSvc) that implements a service that is a dependency of an ASP.NET controller class. That service class has this IExternalIdpStore as a dependency. When I build and run my application from Visual Studio (which is in Debug mode), I get one instance of ExternalIdpStore injected, it's constructor only executes once. When I initiate a controller action that ends up calling a particular method of my ExternalIdpStore class, it works just fine.

When my application is built via Azure DevOps (which is in Release mode), and deployed to a Kubernetes cluster running under Linux, I initially see one call to the ExternalIdpStore class' constructor right at application startup. When I initiate the same controller action as above, I see another call to the ExternalIdpStore's constructor, and when the same method of the class is called, it fails because the data store hasn't been initialized (it's initialized from calling the class' Start method that implementes IStartable).

I have added a field to the class that gets initialized in the constructor to a GUID so I can confirm that I have two different instances when on cluster. I log this value in the constructor, in the Startup code, and in the method eventually called when the controller action is initiated. Logging is confirming that when I run from Visual Studio under Windows, there is just one instance, and the same GUID is logged in all three places. When it runs on cluster under Linux, logging confirms that the first two log entries reference the same GUID, but the log entry from the method called when the controller action is initiated shows a different GUID, and that a key object reference needed to access the data store is null.

One of my colleagues thought that I might have more than one registration. So I removed the explicit registration I showed above. The dependency failed to resolve when tested.

I am at a loss as to what to try next, or how I might add some additional logging to diagnose what is going on.

  • SO is a bad place to do interactive debugging and discussions, but I don't know where else you'd ask. Have you tried running the Docker container locally (not in Kubernetes)? Do you have more than one instance in Kubernetes such that you're getting info from multiple pods? Is there something you're not mentioning like a background service that also might be doing something? Tried [Autofac diagnostics](https://autofac.readthedocs.io/en/latest/advanced/debugging.html#diagnostics)? – Travis Illig Mar 09 '22 at 15:53
  • Those are good questions - I have not tried running in Kubernetes locally, I will look into that. There is definitely only one instance in our Dev Kubernetes cluster. The main thing that I find very odd is the fact that it behaves differently under Windows versus under Linux on Kubernetes. – Doug Belkofer Mar 09 '22 at 16:38

1 Answers1

0

So here's what was going on:

  • The reason for getting two sets of log entries was that we have two Kubernetes clusters sending log entries to Splunk. This service was deployed to both. The sets of log entries were coming from pods in different clusters.
  • My code was creating a Cosmos DB account client, and was not setting the connection mode, so it was defaulting to direct.
  • The log entries that showed successful execution were for the cluster running in Azure - in Azure Kubernetes Service (AKS). Accessing the Cosmos DB account from AKS in direct connection mode was succeeding.
  • The log entries that were failing were running in our on-prem Kubernetes cluster. Attempting to connect to the Cosmos DB account was failing because it's on our corporate network which has security restrictions that were preventing direct connection mode from working.
  • The exception thrown when attempting to connect from our on-prem cluster was essentially "lost" because it was from a process running on a background thread.
  • modifying the logic to add a try-catch around the attempt to connect, and passing the exception back to the caller allowed logging the exception related to direct connection mode failing.

Biggest lesson learned: When something "strange" or "odd" or "mysterious" or "unusual" is happening, start looking at your code from the perspective of where it could be throwing an exception that isn't caught - especially if you have background processes!