7

I have a really odd and hard to diagnose issue with MSBuild / TFS. I have a solution that contains about 12 different build configurations. When running on the build server, it takes maybe 30mins to build the lot and has worked fine for weeks now but now is occasionally failing.

Most of the time, when it fails it'll be an error like this:

19:25:45.037 2>TestPlanDocument.cpp(1): fatal error C1093: API call 'GetAssemblyRefHash' failed '0x8007000e' : ErrorMessage: Not enough storage is available to complete this operation. [C:\Builds\1\ICCSim Card Test Controller\ICCSimCTC Release\src\CardTestController\CardTestController.vcxproj]

The error will sometimes happen on a different file. It won't happen for every build configuration either, it's very inconsistent and occasionally even builds all of them successfully. There's not much different between the build configurations either, mostly it's just a few string changes and of course they all build locally just fine.

The API call in question is usually GetAssemblyRefHash but not always. I don't think this is the issue, as Googling for GetAssemblyRefHash specifically brings up next to nothing. I suspect there's some kind of resource issue at play here but I'm at a loss as to what: There's plenty of HDD space (Hundreds of GB's), plenty of RAM (Machine originally had 4GB minimum allocated but was dynamic as it's a Hyper-v - it never pushed above 2.5GB. I upped this to 8GB minimum just in case and there's been no change).

I've set the build verbosity to diagnostic and it doesn't really show anything else that's helpful, just the same error.

For reference, the build server is fully up to date on all patches. It's running Windows Server 2012 R2, has TFS 2013 and VS 2013 installed, both are on Update 4.

I'm really at a loss at this point and would appreciate any help or pointers.

EDIT: Just to keep people up to date, the compile toolchain was in 32bit mode however even after switching to 64bit, the issue persists.

Kushan
  • 1,200
  • 12
  • 22
  • Are you using the [64-bit toolset](http://msdn.microsoft.com/en-us/library/vstudio/x4d2c09s.aspx)? How about the [/MP](http://msdn.microsoft.com/en-us/library/bb385193.aspx) option? – Colin Robertson Dec 17 '14 at 21:21
  • Hi Colin - it's a 32bit application that's being built, but it is using the 64bit version of MSBuild. I've not tried tinkering with the /MP option, at the moment I believe it's using the default so perhaps that's worth toying with. – Kushan Dec 17 '14 at 22:50
  • The solutions I suggested fix a few problems that manifest with various error codes when running into memory and file locking issues during a build, and can smooth out spiky resource usage. Anti-virus programs can cause similar problems. You've already tried my next suggestion, which is to up the minimum RAM in your VM. I'm not sure that's the issue, though. Does your project have ginormous embedded resources? – Colin Robertson Dec 18 '14 at 19:58
  • Sometimes space issues on the build server are not actually space issues, can you go in the %temp% folder (for the build agent account) and verify how many files are actually in that folder. If there are more then 64K files (not space, quantity) some weird stuff happens. Not sure if this helps but you have nothing to lose deleting files in there anyway. – Etienne Dec 18 '14 at 20:55
  • @Colin Robertson: Still having no joy with this issue. I've been monitoring perfmon for things like GDI objects, handles, threads, etc. and nothing appears to be even coming close to a limit. I've upped the RAM, I've upped the pagefile and even tried moving it to another disk but nothing. The project doesn't have any massive embedded resources, either. It has a few text files that are maybe 500ytes a piece but that's it. – Kushan Dec 18 '14 at 23:06
  • @Etienne: Just checked and literally nothing in the temp files. I've ran disk cleanup and cleared about 3.4GB of crap though, see how that goes. – Kushan Dec 18 '14 at 23:08
  • Heisenbugs are the worst. I couldn't find any other cases of this particular bug happening in this way in my searches, just some old references to COM+ service limits. I'd suggest filing a Connect bug, but it would help to have a consistent repro. I think I'd try rearranging the project build order next, but at half an hour per test, that's slow work. – Colin Robertson Dec 18 '14 at 23:35
  • 1) Error code 0x8007000E is actually E_OUTOFMEMORY (or System.OutOfMemoryException in .NET). So, it's going to be memory related. – David Moore Dec 19 '14 at 07:29
  • 2) Try to re-produce this behaviour on your machine. One way that TFS normally differs is that it builds all projects to the same output / drop directory. This may mean that the build process is scanning more binaries that it normally won't see on your machine, as every project is building to their own bin dirs. To try this, pass /p:OutDir=C:\Temp\Bin\ (for example) to MSBuild e.g.: MSBuild.exe YourSolution.sln /p:OutDir=C:\Temp\Bin\ But, check your TFS Build log to get the EXACT MSBuild command, and see how that runs on your dev machine, where troubleshooting is obviously easier. – David Moore Dec 19 '14 at 07:34
  • @DavidMoore: 0x8007000E isn't strictly out of memory - it's used for out of paged pool, out of handles, out of disk, depending on the specific API or program. It isn't used consistently nor precisely – antiduh Dec 19 '14 at 15:18
  • To add to this discussion, today I spun up an entirely new build server. Installed 2012 R2, installed all updates from windows update, installed the necessary TFS components, VS 2013 Update 4 and....the bug persists! I have manage to alleviate it somewhat by tinkering with the /MP settings. If I disable /MP it nearly doubles the build time but does make it stable (I think, it's not failing as much today). I'm toying with the different /maxprocessors and /MP flags to see if I can strike a balance – Kushan Dec 19 '14 at 16:31
  • For those looking at helping, we already had some discussion on reddit - I think we eliminated paged pool exhaustion as an option: http://www.reddit.com/r/dotnet/comments/2po4zh/can_anyone_help_me_diagnose_this_msbuild_issue_i/ – antiduh Dec 19 '14 at 22:12
  • @Kushan - Theory N+1: Memory fragmentation is causing Windows to not be able to allocate memory to store the compilation unit. Trying using a tool like VMMap to diagnose? http://blogs.technet.com/b/askperf/archive/2010/01/29/vmmap-a-peek-inside-virtual-memory.aspx. Also, do you know if it happens mostly during linking or just unit compilation? – antiduh Dec 19 '14 at 22:16
  • Source code for GetAssemblyRefHash: search for asmlink.cpp, eg, https://github.com/gbarnett/shared-source-cli-2.0 – antiduh Dec 19 '14 at 22:22
  • Just so that people are kept in the loop, I've still not resolved this issue. I've raised a support query with Microsoft who are looking into the compile chain. I intend to update this when they have an answer for me. – Kushan Jan 08 '15 at 16:48

2 Answers2

2

I think I found the source, but I still don't know the reason.

Browsing through the Microsoft Shared Source, we can find the source for GetAssemblyRefHash():

HRESULT CAsmLink::GetAssemblyRefHash(mdToken FileToken, const void** ppvHash, DWORD* pcbHash)
{
    if (TypeFromToken(FileToken) != mdtAssemblyRef) {
        VSFAIL( "You can only get AssemblyRef hashes for assemblies!");
        return E_INVALIDARG;
    }

    HRESULT hr;
    CAssembly *file = NULL;
    if (FAILED(hr = m_pImports->GetFile( FileToken, (CFile**)&file)))
        return hr;

    return file->GetHash(ppvHash, pcbHash);
}

Only two places here to investigate - the call to m_pImports->GetFile(), where m_pImports is CAssembly *m_pImports;, the other is file->GetHash().

m_pImports->GetFile() is here, and is a dead end:

HRESULT CAssembly::GetFile(DWORD index, CFile** file)
{
    if (!file)
        return E_POINTER;

    if (RidFromToken(index) < m_Files.Count()) {
        if ((*file = m_Files.GetAt(RidFromToken(index))))
            return S_OK;
    }
    return ReportError(E_INVALIDARG);
}

file->GetHash(), which is here:

HRESULT CAssembly::GetHash(const void ** ppvHash, DWORD *pcbHash)
{
    ASSERT( ppvHash && pcbHash);
    if (IsInMemory()) {
        // We can't hash an InMemory file
        *ppvHash = NULL;
        *pcbHash = 0;
        return S_FALSE;
    }

    if (!m_bDoHash || (m_cbHash && m_pbHash != NULL)) {
        *ppvHash = m_pbHash;
        *pcbHash = m_cbHash;
        return S_OK;
    }

    DWORD cchSize = 0, result;

    // AssemblyRefs ALWAYS use CALG_SHA1
    ALG_ID alg = CALG_SHA1;
    if (StrongNameHashSize( alg, &cchSize) == FALSE)
        return ReportError(StrongNameErrorInfo());

    if ((m_pbHash = new BYTE[cchSize]) == NULL)
        return ReportError(E_OUTOFMEMORY);
    m_cbHash = cchSize;

    if ((result = GetHashFromAssemblyFileW(m_Path, &alg, (BYTE*)m_pbHash, cchSize, &m_cbHash)) != 0) {
        delete [] m_pbHash;
        m_pbHash = 0;
        m_cbHash = 0;
    }
    *ppvHash = m_pbHash;
    *pcbHash = m_cbHash;

    return result == 0 ? S_OK : ReportError(HRESULT_FROM_WIN32(result));
}

We can see that about halfway down, it tries to allocate room to store the byte[] result, and when fails, returns E_OUTOFMEMORY, which is the error code you're seeing:

if ((m_pbHash = new BYTE[cchSize]) == NULL)
    return ReportError(E_OUTOFMEMORY);
m_cbHash = cchSize;

There's other paths to consider, but this seems like the most obvious source. So it looks like the problem is that a plain memory allocation is failing.

What could cause this?

  • Lack of free physical memory pages / swap
  • Memory fragmentation in the process.
  • Inability to reserve commit space for this in the swap file
  • Lack of address space

At this point, my best guess would be memory fragmentation. Have you triple checked that the Microsoft CPP compiler is running in 64-bit mode? Perhaps see if you can debug the compiler (Microsoft symbol servers may be able to help you here), and set a breakpoint for that line and dump the heap when it happens.

Some specifics on diagnosing heap fragmentation - fire up sysinternal's VMMap when the compiler breaks, and look at the free list - you need three chunks at least 64 kB free to perform an allocation; less than 64 kB and it won't get used, and two 64 kB chunks are reserved.

antiduh
  • 11,853
  • 4
  • 43
  • 66
  • Okay, I thought I was on to something. I double checked and although MSBuild was running as a 64bit process, the C++ compiler used was still the 32bit version - boo. I did some digging and found the correct parameter within MSbuild to use the 64bit compiler chain and set off a build. I used process explorer to confirm that the various processes were running 64bit and they were. ...and the error is still happening! How on earth can a 64bit process have trouble allocating memory? For the record, the machine itself still has [oodles and oodles of memory free](http://i.imgur.com/6T5sWNv.png). – Kushan Dec 20 '14 at 17:52
  • Okay, I'm trying to debug further but I'm struggling pinning down the process I need to debug. The problem is, the issue manifests when /MP is used, which means multiple CL.exe process appear and disappear very quickly - how would you go about debugging this? It's impossible to know in advance which process is going to be the one that fails. – Kushan Dec 22 '14 at 21:15
0

Okay, I have an update to this! I opened a support ticket with Microsoft and have been busy working with them to figure out the issue. They went down the same paths as outlined above and came to the same conclusion - it's not a resources issue.

To cut a long story short, Microsoft has now acknowledged that this is likely a bug in the VC++ compiler, which is almost certainly caused by a race condition (Though this is unconfirmed). There's no word on if they'll fix it in a future release.

There is a workaround by using the /MP flag at the project level to limit the number of compiler processes opened by MSBuild without disabling multiple instances entirely (Which for me was doubling build times).

To do this, go to your project properties and under Configuration Properties -> C/C++ -> Command Line, you need to specify the /MP flag and then a number to limit the number of processes.

Setting the /MP flag

My build server has 8 Virtual CPU's and the normal behaviour is equivelant to /MP8 but this causes the bug to sometimes appear. For me, using /MP4 seems to be enough to limit the bug without causing build times to increase too much. If you're seeing a bug similar to this, you may need to experiment with other numbers such as /MP6 or /MP2.

Kushan
  • 1,200
  • 12
  • 22
  • To add more information to this issue (Yes, it's still ongoing with the compiler team), we have narrowed the issue down to being a problem with DFS file shares. If the assemblies are on a file share, this is absolutely fine, however if the file share goes through a DFS share (Something like \\\share\assemblies), then the problem will appear. – Kushan Apr 23 '15 at 13:57