9

I need some advice on how to diagnose a hanging build. It’s only been happening in the last week or two and I have good reason to suspect it’s something that I’ve done recently and not just a coincidence

Setup

  • TFS 2013
  • 4 machine setup - 2 app tiers (in process of deprecating one of them), 1 sql server, 1 build server running 2 agents.
  • Build Controller is running on 2nd app tier along with the Job Agent
  • 1st App tier is serving the website (although that machine will soon be shutdown and everything will be passed to the 2nd app tier as the machine is getting old)

Symptoms

  • All executed builds (doesn’t appear to matter which build process template) never get marked as done, the last step always seems to be the same step “Copy Files to Drop Location”/“Workspace and Copy Files to Drop Location”/”Copy Binaries to drop, Reset the environment” (named differently in each build template)
  • The files appear to be getting dropped successfully in the build drop folder
  • Looking at the task manager it appears that all the build processes on the build server are exited (only TFSBuildServiceHost
  • Builds show their normal steps/logging while executing
  • Primary app tier has related warnings in the event logs (see warnings below)

Recent Changes

  • Installed Xamarin Android/iOS on the build server
  • Installed a few custom built plugins for Job Agent, Message Queue, and Web Services (been using them for years just had them disabled the last few weeks due to a app tier migration)
  • Installed Tiago’s Task Board Enhancer (again been using this for a long time, just had it disabled recently)
  • About a month ago we added the 2nd app tier and moved the sql off to another machine

What I’ve Tried

  • Rebooting both App tiers and build server
  • Uninstalling Xamarin (although I suspect some parts are still floating around as the Bonjour service appears to still be installed)
  • Removing the custom plugins
  • Turned logging diagnostics right up on one of the builds – nothing particularly of interest seems to turn up
  • Run the Best Practice Analyzer (nothing too unusual shows up)
  • Multiple build process templates (defaulttemplate, defaulttemplate.11.1, tfvctemplate.12.xaml)
  • Multiple build definitions
  • Checked the event logs of both AppTiers and Build server

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 3/02/2014 12:54:06 a.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130357641583538280 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process Name: w3wp Process Id: 2508 Thread Id: 2504

Detailed Message: A request for service host XXXX has been executing for 37 seconds, exceeding the warning threshold of 30. Request details: Request Context Details Url: /tfs/XXXX/XXXX/_api/_build/stop?__v=4 Method: ApiBuild.stop Parameters: uri = vstfs:///Build/Build/34064 User Agent: Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.102 Safari/537.36 Unique Id: 00000000-0000-0000-0000-000000000000

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 30/01/2014 11:10:01 p.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: 0dc282b5-59a8-4941-b541-a4f7d314cd0f Process Details: Process Name: w3wp Process Id: 70320 Thread Id: 14540

Detailed Message: A request for service host XXXX has been executing for 37 seconds, exceeding the warning threshold of 30. Request details: Request Context Details Url: /tfs/XXXX/Build/v4.0/BuildService.asmx Method: StopBuilds Parameters: uris[0] = vstfs:///Build/Build/34051 uris = Count = 1 User Agent: Team Foundation (devenv.exe, 12.0.21005.1, Premium, SKU:16) Unique Id: 4d2d3213-fd41-4c4d-8ab0-b87619c96a42

The Team Foundation service host request monitor has detected the following condition: Date (UTC): 31/01/2014 3:14:17 a.m. Machine: CODEBASE Application Domain: /LM/W3SVC/1/ROOT/tfs-1-130355232548668648 Assembly: Microsoft.TeamFoundation.Framework.Server, Version=12.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a; v4.0.30319 Service Host: Process Details: Process Name: w3wp Process Id: 70320
Thread Id: 14540

Detailed Message: There are no active requests for service host XXXX that exceed the warning threshold of 30.

A quick google suggests upping the timeout in the tfs registry (http://xavierdilipkumar.com/post/2013/07/04/TFS-event-7005-and-7006-warning.aspx) I've tried that and it doesn't appear to change anything.

Zword
  • 6,605
  • 3
  • 27
  • 52
Betty
  • 9,109
  • 2
  • 34
  • 48
  • After rereading those warnings it appears that they are related to me stopping the build from the web front end and not necessarily directly related to the issue. – Betty Feb 06 '14 at 20:08
  • kinda related but just give it a shot: http://geekswithblogs.net/kjones/Default.aspx – Isaiah4110 Feb 06 '14 at 21:58
  • Turns out the fix for us was to add an entry to the hosts file that pointed the SharePoint URL to 127.0.0.1 (the loopback address). We had already configured this for the other three SharePoint web application we had launched. We had neglected to do this with our new web application, which was only recently put into production. – Isaiah4110 Feb 06 '14 at 21:59
  • @Isaiah4110 no I'm pretty sure that's completely unrelated. – Betty Feb 06 '14 at 22:52
  • I know they are completely unrelated, but what i was talking about was the nature of the fix. Anyways, where was the build controller running before you added the 2nd app tier? Did the user running the build controller change? did it start happening after the build controller was moved to the 2 app tier – Isaiah4110 Feb 06 '14 at 23:26
  • The problem started a few weeks after moving the build controller to the 2nd app tier, it was previously working fine in this configuration. – Betty Feb 07 '14 at 02:31
  • My current feeling is that it is network related but probably something to do with what Xamarin did to the build server (installed Bonjour). I might try run wireshark to see how things are interacting – Betty Feb 07 '14 at 02:33
  • yup you are in the right path, that would be a good start to watch the network packet during that apibuild.stop call. Let me know what happens! Always fiddler though :) – Isaiah4110 Feb 07 '14 at 14:46
  • as i said earlier i'm pretty sure that api stop call is me manually killing the build from the web frontend (hence why it mentions chrome). I started with wireshark instead of fiddler because it's a pain to convince random applications (specially something like tfs with so many moving parts) to use a proxy. – Betty Feb 08 '14 at 08:00
  • Did you try disabling the plug-in to see if it makes difference. There may be a conflict there. – Nick Feb 12 '14 at 23:34

7 Answers7

3

can you look in the tfs bs logs at

Event Viewer -> Applications and Services Logs -> Microsoft -> Team Foundation Server -> Build-Services -> Operational

these timeouts generally relate to permissions. you should look for TF215106 access denied events. Although the files appear to be there, are they all the current date or are there some with different (older) dates? Also are they any alerts/steps happening when the file drop occurs?

Other than that it could be timing out because one of the dependencies is being used by another service.

Bozman
  • 477
  • 4
  • 16
  • There is an interesting error in those logs. Build machine lost connectivity to message queue tfsmq://buildservicehost-1/. Reason: TF400324: Team Foundation services are not available from server http://xxxx Technical information (for administrator): The underlying connection was closed: An unexpected error occurred on a receive. – Betty Feb 13 '14 at 01:46
  • have you tried clearing the cache for each tier in tfs? i think it's \%programfiles%\\Application Tier\Web Services\\*_data files. back them up though. – Bozman Feb 13 '14 at 15:17
2

You might fire up Sysinternals Process Monitor to see when the processes actually exit and what they were doing (Process Monitor monitors "real-time file system, Registry and process/thread activity").

Mark Leighton Fisher
  • 5,609
  • 2
  • 18
  • 29
1

The best course of action is to call Microsoft Support and open a Service Request. Make sure it gets priority A - your TFS production environment is not working - and be prepared to give them support and access.

The only hint from the log is that call to ApiBuild.stop. It suggests that the build workflow completed, so the code hosting it is calling back to the AT to mark the build completed. As you have no warnings from previous calls, it could be some problem at the database level. You may try activating SQL Tracing but it's not a trivial task, as you should be able to compare the trace with a working one.

Good luck

Giulio Vian
  • 8,248
  • 2
  • 33
  • 41
  • I logged a partner support ticket roughly the same time as posting this, unfortunately Microsoft were not a lot of help in this case (I've had great support from them before, albeit a bit slow). – Betty Feb 25 '14 at 09:22
1

I'm reluctant to mark this as an answer because I'm not entirely sure why it worked.

Suspecting something was wrong with the build machine I created a new Build Agent on a fresh install - the hanging issue still occurred.

I then added a Build Controller to that machine and noticed that new builds using that controller would complete. This suggested that there was a communication issue between either the BA and the BC, or the BA and the primary AT.

Given that our primary AT had other issues we decided to remove it from the picture, we switched the DNS to point at the second AT and disabled all services on the old primary. Instantly builds started to complete (including the ones that had been stuck for a number of days).

I still don't know which component was broken or why, especially since it worked fine in this configuration for a month prior. I can only assume there was either another change that I am not aware of, or the corruption of the primary AT was causing bigger issues.

Betty
  • 9,109
  • 2
  • 34
  • 48
1

We were having the same problem here, the builds were kept open even after successfully passing all workflow stages.

I logged into the build machine and noticed the build controller was "running 6 builds" for some reason, even though there were no builds at all showing in the queue in Visual Studio.

After restarting the controller, the next build worked the first time.

Just wanted to let this one here as a possible answer. I'm not sure yet why the controller had those stuck builds though.

julealgon
  • 7,072
  • 3
  • 32
  • 77
0

I had this issue when an activity tried to log a huge message in the build log (namely the FxCopCmd activity from the CodePlex TFS Build Extensions project).

The build agent would successfully finish the build but the controller had to chew the huge message into the build log, and it was silently crashing/hanging.

I was able to track the issue down by navigating to C:\Users\[TfsServiceAccount]\AppData\Local\Temp\BuildAgent\[AgentNumber]\Logs\[BuildNumber]\ActivityLog.xml.

The last build message was truncated and by looking at the content, I recognized the FxCop output. In my case, I just set the LogToConsole parameter to False for the FxCop activity in the build process template, and the build completed successfully.

Johann Blais
  • 9,389
  • 6
  • 45
  • 65
0

Also appears to happen if the build agent cannot connect to the build controller server on port 9191.

Easily testable with a telnet client.

Appears that my server decided it was on an unknown network and kicked the firewall into overdrive. (The second time I got this issue, not sure if this was the reason I got it the first time but it seems reasonable).

Betty
  • 9,109
  • 2
  • 34
  • 48