Normally, we can ping another service by its service name using Azure Service Fabric DNS. Around 1am last night, this stopped working. No code or configuration was changed, and nothing was deployed. Now, from within a container, we cannot ping another service:
Some info: We're running Service Fabric in Azure and running a Windows cluster. All our services are running in Docker containers, using Docker for Windows.
Things we've tried: Rebooting the VMs, deleting and re-deploying all apps, restarting the Naming Service and DNS service in the cluster.
Anyone seen anything like this? I'm looking for tips on what could be wrong, or on how to debug this issue further. Again, nothing was deployed and no code or configuration was changed. It just seems Service Fabric's internal DNS suddenly went down and will not come up again. Thanks!
Update: Output of Get-ServiceFabricNodeHealth on one of the nodes:
NodeName : _B_41
AggregatedHealthState : Ok
HealthEvents :
SourceId : System.FabricNode
Property : Certificate_client
HealthState : Ok
SequenceNumber : 132078454391466815
SentAt : 7/17/2019 1:57:19 PM
ReceivedAt : 7/17/2019 1:57:24 PM
TTL : Infinite
Description : Certificate expiration: thumbprint = adf7ae93a524d181106b0467a1f8e3375e1bf65f, expiration = 2020-06-20 01:17:33.000, remaining lifetime is
338:11:20:13.853, please refresh ahead of time to avoid catastrophic failure. Warning threshold Security/CertificateExpirySafetyMargin is configured at 30:0:00:00.000, if
needed, you can adjust it to fit your refresh process.
RemoveWhenExpired : False
IsExpired : False
Transitions : Warning->Ok = 7/13/2019 11:22:17 AM, LastError = 1/1/0001 12:00:00 AM
SourceId : System.FabricNode
Property : Certificate_cluster
HealthState : Ok
SequenceNumber : 132078386480915827
SentAt : 7/17/2019 12:04:08 PM
ReceivedAt : 7/17/2019 12:04:23 PM
TTL : Infinite
Description : Certificate expiration: thumbprint = adf7ae93a524d181106b0467a1f8e3375e1bf65f, expiration = 2020-06-20 01:17:33.000, remaining lifetime is
338:13:13:24.908, please refresh ahead of time to avoid catastrophic failure. Warning threshold Security/CertificateExpirySafetyMargin is configured at 30:0:00:00.000, if
needed, you can adjust it to fit your refresh process.
RemoveWhenExpired : False
IsExpired : False
Transitions : Warning->Ok = 7/13/2019 7:04:12 AM, LastError = 1/1/0001 12:00:00 AM
SourceId : System.FabricNode
Property : Certificate_server
HealthState : Ok
SequenceNumber : 132078441374480374
SentAt : 7/17/2019 1:35:37 PM
ReceivedAt : 7/17/2019 1:35:54 PM
TTL : Infinite
Description : Certificate expiration: thumbprint = adf7ae93a524d181106b0467a1f8e3375e1bf65f, expiration = 2020-06-20 01:17:33.000, remaining lifetime is
338:11:41:55.551, please refresh ahead of time to avoid catastrophic failure. Warning threshold Security/CertificateExpirySafetyMargin is configured at 30:0:00:00.000, if
needed, you can adjust it to fit your refresh process.
RemoveWhenExpired : False
IsExpired : False
Transitions : Warning->Ok = 7/13/2019 4:35:41 AM, LastError = 1/1/0001 12:00:00 AM
SourceId : System.RA
Property : RAStoreProvider
HealthState : Ok
SequenceNumber : 132072866375071389
SentAt : 7/11/2019 2:43:57 AM
ReceivedAt : 7/13/2019 1:15:33 PM
TTL : Infinite
Description : Store provider type ESE created and opened successfully.
RemoveWhenExpired : False
IsExpired : False
Transitions : Warning->Ok = 7/11/2019 2:44:27 AM, LastError = 1/1/0001 12:00:00 AM
SourceId : System.FM
Property : State
HealthState : Ok
SequenceNumber : 181
SentAt : 7/11/2019 2:44:15 AM
ReceivedAt : 7/13/2019 1:15:33 PM
TTL : Infinite
Description : Fabric node is up.
RemoveWhenExpired : False
IsExpired : False
Transitions : Warning->Ok = 7/11/2019 2:44:44 AM, LastError = 1/1/0001 12:00:00 AM
Update 2: Network interface info from within Docker container: