1

In order to create an Azure SF test environment, I created three azure VMs within a dev test lab. These are to be secured with X509s.

I used the information Here & Here

The machines are:

  • Windows 2016 Data Centre
  • On the same virtual network
  • All firewalls are disabled (Can ping each machine from the other)
  • All using the same administrator account

I have created self-signed certificates using the certsetup.ps1 file provided by the documentation. One certificate for Server & Cluster combined as suggested.

If I run the TestConfiguration.ps1, I am given the following output.

LocalAdminPrivilege        : True
IsJsonValid                : True
IsCabValid                 :
RequiredPortsOpen          : True
RemoteRegistryAvailable    : True
FirewallAvailable          : True
RpcCheckPassed             : True
NoConflictingInstallations : True
FabricInstallable          : True
DataDrivesAvailable        : True
Passed                     : True

Obviously the IsCabValid field is blank, but the "Passed" field still suggests installation is possible. I continue to run the next powershell command to begin installation.

.\CreateServiceFabricCluster.ps1 -ClusterConfigFilePath .\ClusterConfig.X509.MultiMachine.json

Following the above command, the process starts up and the console window is populated with the following text which suggests inter-node communication is fine..

Creating Service Fabric Cluster...
If it's taking too long, please check in Task Manager details and see if Fabric.exe for each node is running. If not, please look at: 1. traces in DeploymentTraces directory and 2. traces in FabricLogRoot configured in ClusterConfig.json.
Trace folder already exists. Traces will be written to existing trace folder: C:\StandaloneCluster\DeploymentTraces
Running Best Practices Analyzer...
Best Practices Analyzer completed successfully.
Creating Service Fabric Cluster...
Processing and validating cluster config.
Configuring nodes.
Default installation directory chosen based on system drive of machine '10.0.0.4'.
Copying installer to all machines.
Configuring machine '10.0.0.4'.
Configuring machine '10.0.0.5'.
Configuring machine '10.0.0.6'.
Machine 10.0.0.6 configured.
Machine 10.0.0.5 configured.
Machine 10.0.0.4 configured.
Running Fabric service installation.
Successfully started FabricInstallerSvc on machine 10.0.0.4
Successfully started FabricInstallerSvc on machine 10.0.0.6
Successfully started FabricInstallerSvc on machine 10.0.0.5

A long pause of a few minutes occurs after which the time out error is displayed, but with no real indication as to why. I have searched the window logs on the nodes, but have not been able to uncover any further information. The error displayed in the PS console is as follows:

 Timed out waiting for Installer Service to complete for machine 10.0.0.4. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
Timed out waiting for Installer Service to complete for machine 10.0.0.6. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
Timed out waiting for Installer Service to complete for machine 10.0.0.5. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
CreateCluster Error: System.AggregateException: One or more errors occurred. ---> System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.5. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeploye
r -> Fabric
   at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Parallel.ForWorker[TLocal](Int32 fromInclusive, Int32 toExclusive, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Func`4 bodyWithLocal, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Ac
tion`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, Action`1 body)
   at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.RunFabricServices(List`1 machines, FabricPackageType fabricPackageType)
   at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.<CreateClusterAsyncInternal>d__7.MoveNext()
---> (Inner Exception #0) System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.5. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
   at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )<---

---> (Inner Exception #1) System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.6. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
   at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )<---

---> (Inner Exception #2) System.ServiceProcess.TimeoutException: Timed out waiting for Installer Service to complete for machine 10.0.0.4. Investigation order: FabricInstallerService -> FabricSetup -> FabricDeployer -> Fabric
   at Microsoft.ServiceFabric.DeploymentManager.DeploymentManagerInternal.StartAndValidateInstallerServiceCompletion(String machineName, ServiceController installerSvc)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass17_0`1.<ForWorker>b__1()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass176_0.<ExecuteSelfReplicating>b__0(Object )<---

Trace folder already exists. Traces will be written to existing trace folder: C:\StandaloneCluster\DeploymentTraces
Cleaning up faulted installation.
Removing configuration from machine 10.0.0.5
Removing configuration from machine 10.0.0.4
Removing configuration from machine 10.0.0.6

Is there an Azure SF aficionado out there who can shed some light on the matter, or offer any suggestions as to where I am going wrong?

Hicki
  • 167
  • 1
  • 11
  • have you tried uninstalling the SDK as described here: https://stackoverflow.com/questions/38106961/create-on-premise-service-fabric-cluster-fails-with-exception?rq=1 – Oliver May 31 '17 at 14:48
  • @Oliver The SDK wasn't present on the machine at the time of install attempt, otherwise the TestConfiguration.ps1 would fail. – Hicki Jun 01 '17 at 11:37
  • What size are your VMs? You may need faster ones or change timeout on the installer (I believe there's a switch to do that) – Mardoxx Jun 01 '17 at 20:48
  • Run deployment using the -NoCleanupOnFailure flag and check the event logs under "Applications and Services Logs > Microsoft-Service Fabric > Admin". Error/Warning logs should indicate if there is an issue reading the cert, or if there is any other blocking issue. Check that the cert is ACLed to NETWORK SERVICE on each machine, as that is one of the listed requirements written in the doc. – Max Burlik Jul 12 '17 at 19:53

2 Answers2

0

This is a generic failure pattern seen when FabricHost is failing to come up, which could happen for a number of reasons.

Since you are using raw Azure VMs instead of the SF VMSS deployment, you will also have to make sure the upstream ports set under the cluster configuration NodeType are open on each machine. To test this is set up correctly, try to deploy an unsecured cluster across these VMs first.

If the above works, to investigate, run deployment using the -NoCleanupOnFailure flag and check on one of the failing machines the event logs under "Applications and Services Logs > Microsoft-Service Fabric > Admin".

Error/Warning logs should indicate if there is an issue reading the cert, or if there is any other blocking issue. Check that the cert is ACLed to NETWORK SERVICE on each machine, as that is one of the listed requirements written in the doc.

One of the other common failures happens when the cert thumbprint contains invalid characters. There is a bug in the Windows cert management tool that causes the displayed thumbprint to contain such hidden invalid characters, that when copied straight into the config, leads to deployment issues. Please validate using a hex editor (such as HxD) the config thumbprint only contains valid characters.

If this doesn't provide enough information for you to figure out the issue, please run the Log Collector tool from Tools\Microsoft.Azure.ServiceFabric.WindowsServer.SupportPackage.zip contained in the Standalone package, and upload the collected logs to your choice of storage to share with our team. You can mail the link to sfsa@microsoft.com and we can help you look into this.

Max Burlik
  • 336
  • 1
  • 5
0

For cluster/ server/ reverseProxy certs, 1) their private key loading privilege needs to be ACLed to ‘Network Service’, and 2) their CA certs needs to be added to TrustedRoot.