1

Windows Server 2012 is the primary server environment. All machines are located in a server room that services a network testing lab. There are over 400 servers in the lab, most of which are virtualized, and give us a working range of 3500-4000 machines to be used by the test lab.

Scenario: Our backup DC went down about a month ago and no one realized it since the previous admin had just left and our new infrastructure guy was learning our network on the spot. The DC had hardware failure and is a complete loss.

Our PDC started having issues replicating and eventually gave us a message that it declared itself invalid due to not completing the role of managing the DCs. This task could not be validated (duh!).

About a week ago, we ran a Powershell script to add 400 machines to the AD/Domain. About 25% of the way through, the script quit recognizing the PDC as the domain controller. Ever since then, we are unable to add machines to the domain, we are currently running out of cache time on the other clients for their being logged into a domain with no actual PDC.

We were unable to remove the damaged DC from the PDC, we can't backup the AD and migrate data because of this so.

What we have done so far

  • Created a two new DCs from scratch.
  • The original PDC is now disconnected from the lab while we try to get the new one to take over.

Problems

  • We are unable to backup our GPOs and unable to export our AD list.
  • We can't demote the original PDC, it errors out.

Is there a way to avoid completely rebuilding our AD from scratch?

Are there options for recovering our GPOs without manually recreating them?

dcdiag data

After running dcdiag, all but CheckSDRefDom tests failed. All LDAP dependent tests failed with LDAP Error 0x3a (58). FMSO checks succeeded. DNS failed to respond and reported as not started even though the service was started.

I think we will take Mathius's suggestion and take this as an oportunity to redesign and learn.

Greg Mason
  • 187
  • 1
  • 1
  • 9
  • 6
    PDC/BDC? The first thing to do is get rid of NT 4.0 and update to something more modern. – Michael Hampton Nov 26 '13 at 20:53
  • 2
    This is very confusing. As Michael stated - PDC and BDCs haven't existed since Windows NT4, so it's difficult to understand what you're actually talking about. – MDMarra Nov 26 '13 at 21:00
  • Forgive my non-admin being :) Well, my understanding was that we have a single DC for our forest and one declared as a backup/failover DC. Was this not correct? Sounded fairly Primary/Backup to me but probably my need to grasp for something that made sense with what I know rather than current terms. – Greg Mason Nov 26 '13 at 21:05
  • You really need to elaborate on what exactly the errors were on you surviving DC. It seems likely that could have been and still could be fixed. You say you have setup new machines. Please elaborate more on what exactly you have done. Did you setup a new domain already, or are those machines just standalone? We need a lot more specifics about what exactly you have done. – Zoredache Nov 26 '13 at 21:05
  • run dcdiag /e /c /v and post a link to the results – mfinni Nov 26 '13 at 21:05
  • 2
    @TheCleaner - I think this is a good enough problem that it doesn't matter that it's in a lab. Lab problems are usually closed because they are exclusive to the lab - something like this is totally possible for happening in real life. – Mark Henderson Nov 26 '13 at 21:08
  • @TheCleaner The 'lab' is something like 500 machines with 4 network connections each, 1/2 of which are 10GbE connections that are virtualized with 8 VMs per port. The actual lab area is just for systems under test, this is an extensive infrastructure built around the support of a large scale testing environment. All these machines are running high-end and late model hardware. I did come into this problem late, our current admin is fighting lot of fires right now and I am attempting to help him get this resolved. I will get the dcdiag data posted asap. – Greg Mason Nov 26 '13 at 21:20
  • OK, OK...I'll delete my comment. I wasn't aware of the setup. That said, I'm actually concerned that any answer given will just not be enough. There's the whole "book answer" here. I'd be willing to attempt an answer myself about how to get your GPOs back, etc. but my concern is that it will be a huge post that would only get you part of the way there. – TheCleaner Nov 26 '13 at 21:24
  • 3
    Could be worth a quick peek just to see if the working DC has all operation masters roles, or if it needs to seize them. The failed bulk add sounds like a lost RID master. No RIDM would also hinder the adding of a new DC. Google "seizing fsmo roles" and see what it gives. – ErikE Nov 26 '13 at 21:38
  • 1
    `Well, my understanding was that we have a single DC for our forest and one declared as a backup/failover DC. Was this not correct?` No - that's not correct. Domain controllers are either in the domain or they aren't. You can't designate one as backup/failover. They are multi-master peers. – MDMarra Nov 26 '13 at 21:46
  • @MDMarra Ahh, ok. The "primary" just refers to the master role, where in all other cases the DCs are equal. Yeah, I guess a lot has changed - the Hyper-V documentation still refers to these masters as a PDC often enough that I fell into the trap of old knowledge. Thanks for the clarity. – Greg Mason Nov 26 '13 at 22:13
  • What Hyper-V documentation? Give us some links I want to spam the author with complaints. Unless you are talking about the 'PDC emulator' which is a particular FSMO role that is responsible for the timekeeping on the domain. Which is about the only think Hyper-V would care about. – Zoredache Nov 26 '13 at 22:32
  • Don't take this the wrong way but the smartest move might now might just be to buy even just one day of consultancy from someone who properly understands AD, Domain controllers, and their various roles and use that time to assess precisely what is going on with the current domain. It sounds like the DC that cratered had at least one of the operations master roles on it (at least RID master, as @greenstonewalker suggests) and possibly some others too. I'd certainly say that right now there's no definite need to go building a new domain. – Rob Moir Nov 26 '13 at 22:39
  • @Zoredache The first one that came to mind was this Technet article on [AD DS Virtualization](http://technet.microsoft.com/en-us/library/hh831734.aspx) – Greg Mason Nov 26 '13 at 22:40
  • @RobM I would have done that if it was an option. You know how things often are though: Millions of $$$ in hardware and management won't spend a dime on the needed help to make it work. I don't even work in IT which should be obvious from my questions, just an apps developer. – Greg Mason Nov 26 '13 at 22:52
  • 1
    @GregMason PDC Emulator is different than PDC. – MDMarra Nov 26 '13 at 23:24
  • 1
    @GregMason I've updated my answer to help you with the existing GPOs, hope it helps :) – Mathias R. Jessen Nov 27 '13 at 01:16
  • Greg, what kind of backups do you guys have of the DC(s) prior to the problems? – Trondh Nov 28 '13 at 10:07
  • @Trondh Old admin left during this new ramp up in the network for virtualization. The only reason we had a 2nd DC was that I threw a fit since we didn't have any redundancy and my apps depended on it. We were in the process of setting up an actual backup solution for our servers with the new tech that found himself in charge of the network when this happened. Yeah it sucks, but it really was literally on our short list for end of year. – Greg Mason Nov 29 '13 at 20:11

2 Answers2

4

My guess is that the dead server was the RID Master.

Domain Controllers allocate unique SIDS by using a pool of Relative IDs (RIDS). One DC in a domain, the RID Master, is responsible for giving unique pools to each domain controller. When a domain controller runs out of RIDs then it can't create any more security principals.

In this case, adding more domain controllers won't help at all.

TO fix it, you need to seize the RID Master role on a working DC. Important: After you have seized the role, do not let the old RID master back online! This can lead to problems.

As an aside, the answer suggesting creating a new domain is an enormous boatload more work than it might seem. Among other things, you must now configure every service on every server with a new service account, which may mean setting service principal names and permissions. You must now recreate your messaging system, possibly requiring an export and import of every mailbox. Flattening an AD forest is an option of complete-last-resort, only do be done after you have exhausted other options!

Also, I suggest you wait a few days before accepting an answer here - don't just accept the first one presented.

Greenstone Walker
  • 779
  • 1
  • 5
  • 16
  • One serious problem I can't overcome though, how do I get a working DC up on the network with no working DCs? I do think your answer may be the correct answer, but I may not have the expertise to get this done in our time frame. We don't have a lot of the concerns you mentioned since this is a contained environment, for the most part, and exists seperate from the corporate side. – Greg Mason Nov 26 '13 at 23:46
  • 1
    It sounds like you are at the "last resort" point, Greg. :-( Building a new AD forest sounds good; at least you'll know you are getting a good forest, no inherited any issues from the old one. Good luck. – Greenstone Walker Nov 27 '13 at 02:38
3

I'm afraid to sound elitist and patronizing, but from the information you've provided and the way you've worded your question, I think the most feasible long-term solution is to

  • Create a new domain in a new forest with a new FQDN on a "new" (freshly installed) server
  • Unjoin all the clients from the existing domain
  • Reboot and join the new domain
  • Re-install the operating system on the existing Domain Controllers
  • Install AD DS on the servers and join the new domain
  • Use this opportunity to revisit your understanding of Active Directory, what it is and how it works

Microsoft has published a number of guides to approaching Active Directory design and deployment, worth mentioning is definitely:

The AD DS Design Guide on TechNet
The Active Directory Domain Services guide from Microsofts Infrastructure and Planning Guide Series

Good luck!


Backing up your GPO's:

From your recent updates it sounds like AD DS is not currently operational, so here is a last resort GPO backup-and-recover solution, not including Links and WMI Filters.

A Group Policy Object consists of 2 parts, a Group Policy Container and a Group Policy Template.

The Container is an object in active directory that holds the Group Policy links that are used to apply the given GPO to a given OU - if the DSA is unavailable to you at this point, you won't be able to retrieve these without mounting and exploring an offline copy of your NTDS database (not as easy as it may sound).

The Template on the other hand, contains the meat and potatoes of the GPO, all the settings, the name, version information and so on, and is stored in the SYSVOL folder on the filesystem.
With the default configuration you'll be able to find all your GPT's in C:\Windows\SYSVOL\domain\policies\. With a file level backup of the GPT's, you'll be able to recreate the GPO's in the new domain, preferably using PowerShell as demonstrated below:

$gptBackupFilePath = "C:\backup\policies\"
$ServerName = $env:COMPUTERNAME

Import-Module GroupPolicy

$GPTs = Get-ChildItem $gptBackupFilePath -Directory |Where-Object {$_.Name -imatch "^\{([0123456789abcdef-]){36}\}$"}

foreach($GPT in $GPTs)
{
    if("{31B2F340-016D-11D2-945F-00C04FB984F9}" -eq $GPT.Name.ToUpper())
    {
        Write-Host "Skipping Default Domain Policy "
    }

    if("{6AC1786C-016F-11D2-945F-00C04FB984F9}" -eq $GPT.Name.ToUpper())
    {
        Write-Host "Skipping Default Domain Controllers Policy "
    }
    $GPTPath = $GPT.FullName
    $GPOName = (Get-Content (Join-Path $GPTPath "GPT.ini") |Where-Object {$_ -match "^displayName="}).Substring(12) |Select -First 1
    if(-not($GPOName))
    {
        Write-Warning "Unable to read GPO name from $GPTPath"
        continue
    }

    $newGPO = New-GPO -Name $GPOName -Server $ServerName
    if(-not($?))
    {
        Write-Warning "Unable to create new GPO $GPOName"
        continue
    }

    $GPOGuid = $newGPO.Id.ToString()

    $Destination = Get-Item ("C:\Windows\SYSVOL\domain\policies\{" + $GPOGuid + "}")
    if(-not(Test-Path $Destination))
    {
        Write-Warning "Unable to access new GPT for GPO $GPOName"
        continue
    }

    Get-ChildItem -Path $GPTPath -Recurse -Exclude @("gpt.ini") |Copy-Item -Destination $($_.FullName -replace $GPTPath,$DestinationPath.FullName) -Force
    if($?)
    {
        Write-Host "Successfully recreated GPO $GPOName as $GPOGuid"
    }
}

I doubt this is a supported solution, and unlike a regular GPO import with migration tables, you'll need to appropriate UNC paths other domain-specific references by hand.

The above example is intended to be run on a Domain Controller in your new forest with $gptBackupFilePath changed to the folder containing the contents of [..]\policies on the old Domain Controller


The only other current answer to this questions, suggesting that you've lost the Domain Controller currently possessing the RID Master FSMO role, and have exhausted the current RID pool is with all probability entirely correct, and you may very well be able to recover the forest from it's current state.

My recommendation to start from scratch is not an easy-frag default goto response, but a carefully chosen one, based on personal experience with AD Disaster Recovery, and more importantly, cleaning up after other peoples disastrous disaster recovery efforts.

If you don't fully understand what to expect from a healthy Active Directory environment, and by trial-and-error do what you're told needs to be done (FSMO seizing, metadata cleanup etc.), underlying unresolved issues may still be present - but too elusive and therefore hidden to the untrained eye.

Any inconsistency introduced during the last 30 or 60 days might not manifest itself right here and now, but if and when it does - you're gonna wish you started from scratch when you had the opportunity

Mathias R. Jessen
  • 25,161
  • 4
  • 63
  • 95
  • 1
    That was exactly my fear with the unknown issues trickling by and biting us in the long run. We have already started putting the network back together from your post (and no `elitist and patronizing` involved, RTFM was a good response in this case since it has been helpful). The GPO script should be extremely helpful. – Greg Mason Nov 29 '13 at 20:21