Service Fabric internal DNS suddenly stops working

Question

Normally, we can ping another service by its service name using Azure Service Fabric DNS. Around 1am last night, this stopped working. No code or configuration was changed, and nothing was deployed. Now, from within a container, we cannot ping another service:

Some info: We're running Service Fabric in Azure and running a Windows cluster. All our services are running in Docker containers, using Docker for Windows.

Things we've tried: Rebooting the VMs, deleting and re-deploying all apps, restarting the Naming Service and DNS service in the cluster.

Anyone seen anything like this? I'm looking for tips on what could be wrong, or on how to debug this issue further. Again, nothing was deployed and no code or configuration was changed. It just seems Service Fabric's internal DNS suddenly went down and will not come up again. Thanks!

Update: Output of Get-ServiceFabricNodeHealth on one of the nodes:

NodeName              : _B_41
AggregatedHealthState : Ok
HealthEvents          :
                        SourceId              : System.FabricNode
                        Property              : Certificate_client
                        HealthState           : Ok
                        SequenceNumber        : 132078454391466815
                        SentAt                : 7/17/2019 1:57:19 PM
                        ReceivedAt            : 7/17/2019 1:57:24 PM
                        TTL                   : Infinite
                        Description           : Certificate expiration: thumbprint = adf7ae93a524d181106b0467a1f8e3375e1bf65f, expiration = 2020-06-20 01:17:33.000, remaining lifetime is
                        338:11:20:13.853, please refresh ahead of time to avoid catastrophic failure. Warning threshold Security/CertificateExpirySafetyMargin is configured at 30:0:00:00.000, if
                        needed, you can adjust it to fit your refresh process.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Warning->Ok = 7/13/2019 11:22:17 AM, LastError = 1/1/0001 12:00:00 AM

                        SourceId              : System.FabricNode
                        Property              : Certificate_cluster
                        HealthState           : Ok
                        SequenceNumber        : 132078386480915827
                        SentAt                : 7/17/2019 12:04:08 PM
                        ReceivedAt            : 7/17/2019 12:04:23 PM
                        TTL                   : Infinite
                        Description           : Certificate expiration: thumbprint = adf7ae93a524d181106b0467a1f8e3375e1bf65f, expiration = 2020-06-20 01:17:33.000, remaining lifetime is
                        338:13:13:24.908, please refresh ahead of time to avoid catastrophic failure. Warning threshold Security/CertificateExpirySafetyMargin is configured at 30:0:00:00.000, if
                        needed, you can adjust it to fit your refresh process.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Warning->Ok = 7/13/2019 7:04:12 AM, LastError = 1/1/0001 12:00:00 AM

                        SourceId              : System.FabricNode
                        Property              : Certificate_server
                        HealthState           : Ok
                        SequenceNumber        : 132078441374480374
                        SentAt                : 7/17/2019 1:35:37 PM
                        ReceivedAt            : 7/17/2019 1:35:54 PM
                        TTL                   : Infinite
                        Description           : Certificate expiration: thumbprint = adf7ae93a524d181106b0467a1f8e3375e1bf65f, expiration = 2020-06-20 01:17:33.000, remaining lifetime is
                        338:11:41:55.551, please refresh ahead of time to avoid catastrophic failure. Warning threshold Security/CertificateExpirySafetyMargin is configured at 30:0:00:00.000, if
                        needed, you can adjust it to fit your refresh process.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Warning->Ok = 7/13/2019 4:35:41 AM, LastError = 1/1/0001 12:00:00 AM

                        SourceId              : System.RA
                        Property              : RAStoreProvider
                        HealthState           : Ok
                        SequenceNumber        : 132072866375071389
                        SentAt                : 7/11/2019 2:43:57 AM
                        ReceivedAt            : 7/13/2019 1:15:33 PM
                        TTL                   : Infinite
                        Description           : Store provider type ESE created and opened successfully.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Warning->Ok = 7/11/2019 2:44:27 AM, LastError = 1/1/0001 12:00:00 AM

                        SourceId              : System.FM
                        Property              : State
                        HealthState           : Ok
                        SequenceNumber        : 181
                        SentAt                : 7/11/2019 2:44:15 AM
                        ReceivedAt            : 7/13/2019 1:15:33 PM
                        TTL                   : Infinite
                        Description           : Fabric node is up.
                        RemoveWhenExpired     : False
                        IsExpired             : False
                        Transitions           : Warning->Ok = 7/11/2019 2:44:44 AM, LastError = 1/1/0001 12:00:00 AM

Update 2: Network interface info from within Docker container:

Are you seeing any errors or warnings in the System Health Reports? Maybe something to point us in the general direction of where the issue is coming from? https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-understand-and-troubleshoot-with-system-health-reports — micahmckittrick, Jul 15 '19 at 19:50
There are some issues with DNS inside windows containers. Here's a workaround that might help: https://github.com/docker/for-win/issues/2760#issuecomment-430889666 — LoekD, Jul 17 '19 at 05:53
@Micah_MSFT - Nothing of interest from the health report. I've attached it above. We also have an open ticket with Microsoft (I can send you the # if you want) but we haven't gotten much traction from them yet. This issue took our production site down for over 12 hours on Sunday, so we're pretty furious at the moment. — Mike Christensen, Jul 17 '19 at 15:25
@LoekD - The link seems to indicate you might have another adapter that has a lower InterfaceMetric (such as a WiFi adapter or something). In my Docker container I have two interfaces, `vEthernet (Ethernet) 2` with a metric of 5000, and a `Loopback Pseudo-Interface 3` with a metric of 75. I haven't compared this to our working environment, but it looks right? I've attached a screen shot above as well. — Mike Christensen, Jul 17 '19 at 15:35
@LoekD Confirmed the network configuration on the working machine looks identical. So, I don't think that's it.. — Mike Christensen, Jul 17 '19 at 15:37
@MikeChristensen can you share the ticket number? I can take a look offline and jump on the internal thread — micahmckittrick, Jul 17 '19 at 19:06

Service Fabric internal DNS suddenly stops working

0 Answers0