48

It's been somewhat mysteriously reported that FB employees couldn't fix some router (BGP) misconfiguration in a timely manner

because "the people trying to figure out what this problem was couldn't even physically get into the building" to work out what had gone wrong.

It's also mentioned that "The shutdown meant ads weren't served for over six hours across its platforms."

FB may not want to say more because of embarrassment or something, but it sounds like a rather odd story. Is there any corroborating evidence that lack of physical access was the culprit for the prolonged outage?

Fizz
  • 57,051
  • 18
  • 175
  • 291
  • 17
    What does the status of ads have to do with the central question? – Daniel R Hicks Oct 05 '21 at 21:37
  • 5
    @DanielRHicks: the BBC only gave the downtime in those more concrete terms.... for ads. They only said "several hours" for other services. – Fizz Oct 05 '21 at 22:06
  • @GordonDavisson ah yep, I didn't see that – Aaron Lavers Oct 06 '21 at 07:32
  • 3
    In a reddit thread during the outage, a user named ramenporn who claimed to be on the recovery/investigation team posted (among other things): "There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified." They later deleted the comments and their account, so it's hard to verify this. See: https://archive.is/Idsdl – Gordon Davisson Oct 06 '21 at 18:33
  • The question is a bit ambiguous: there is a difference between "their own building" in the sense of a building belonging to FaceBook, and the same in the sense of a building where those employees ordinarily work. It certainly seems like the building in question satisfies the former description, but I'm not so sure that it satisfies the latter one. The claim is colored by which of those interpretations is applied. – John Bollinger Oct 07 '21 at 14:42
  • @JohnBollinger: yeah, chances are a number of people had to go to a datacenter that they hardly ever visit otherwise... even if it's owned by FB. The way the story is phrased is as if FB is some small company with one or few buildings. – Fizz Oct 07 '21 at 14:55
  • 1
    I can't comment on how Facebook does their security... but I can tell you that, as a Google employee, if I showed up to a random data center and demanded access, security personnel would likely escort me from the premises (and I doubt my badge would work, either). Datacenters are extremely sensitive buildings with very different security to "regular" corporate offices. – Kevin Oct 08 '21 at 03:12

2 Answers2

65

It took "extra time" to get onsite.

From Facebook's report about the outage:

...these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers.

This doesn't break down how long it took them to get into the building vs. how long it took them to be able to modify the hardware and routers. It's not too hard to imagine how this distinction could be lost and attributed solely to just being unable to enter the building.

We do know that getting into the building did prolong the outage to some degree, though for all we know it could have been just a few minutes.

Rob Watts
  • 5,661
  • 3
  • 26
  • 37
  • 22
    Something I read (can't remember where...) stated that three different groups of people needed to be involved: one group who had physical access to the servers, another who had the credentials to log in and modify settings, and a third who had the technical expertise to make the necessary fixes. Remote access was not an option (because network issue...!) so they had to get someone from each of the three groups physically present in the server room. – avid Oct 06 '21 at 09:16
  • 2
    @avid I assume most if not all of Facebook's employees possess cellphones? Not sure why it'd be necessary for 3 people to all be present so long as they have some means of communicating with eachother... – Darrel Hoffman Oct 06 '21 at 13:34
  • 10
    @DarrelHoffman Not sure a cell phone really helps. Its not like they knew what was wrong, they had to diagnose the problem. Unless the cellphone camera was going to be pointed at the screen, which kindof works, but it a painful way to work – Richard Tingle Oct 06 '21 at 13:53
  • @DarrelHoffman Who knows? At any rate, that was what I read. – avid Oct 06 '21 at 13:57
  • 38
    @DarrelHoffman I'd assume that high security datacenters are built as a faraday cage, rendering any radio communication (including cellphones) useless. I also imagine that there are some gates requiring biometric authentication. And that's not all ... so, in short, no matter which of those measures are actually implemented there: The information that a security system that is cut off from the internet requires multiple people to physically be there is completely believable to me. – orithena Oct 06 '21 at 14:30
  • 11
    @orithena: There's not really a need to build them as Faraday cages. They usually have false floors (i.e. floor plates on a regular grid of metal), false ceilings (i.e. ceiling plates hanging from a regular grid of metal), long rows of metal racks full of metal boxes with metal piping and metal cable trussing running between them. Even without purposefully designing them that way, they are essentially giant metal boxes filled with lots of small metal boxes installed in middle-sized metal boxes, all emitting EM radiation. https://i.redd.it/pu3odmbdpqm71.jpg – Jörg W Mittag Oct 06 '21 at 19:34
  • 8
    @DarrelHoffman Pretty sure the ability to gain physical access / enter credentials / fix the problem is distributed among three groups of people for security reasons, so that not a single person can alter things on the servers by himself. So, if a person from the "fix the problem" group can simply place two cellphone calls to also get physical access and log in, it defeats the whole purpose, don't you think so? Therefore, they all needed to be physically present. – dim Oct 06 '21 at 20:33
  • 4
    @avid That information came from a Reddit user who claimed to work for Facebook, commenting on [this r/sysadmin post](https://www.reddit.com/r/sysadmin/comments/q181fv/looks_like_facebook_is_down/). The user deleted their comments and their entire account shortly afterwards. Comment was archived [here](https://archive.is/Idsdl). – MJ713 Oct 06 '21 at 20:44
  • @JörgWMittag I'll just expand my sentence: "... built as a faraday cage (whether intentional or not)" :) – orithena Oct 07 '21 at 09:07
  • 3
    As the old joke goes, security is easy. It's *letting people in* that's difficult! – Mason Wheeler Oct 07 '21 at 12:12
  • As someone who has worked for big tech for years... It is quite normal for employees not to have on-prem access to their badges to random datacenters they normally don't enter. Now if this happens out of hours (middle of night or weekend), well holy cow it can be hours until you get the right person on the phone to OK things. Now if they needed three different groups into the datacenter it is likely only one of them had access so they would have had to find 2 VPs to sign off on DC security to let them in... This could take 4-8 hours easy. – blankip Oct 08 '21 at 16:48
14

As Robb Watts' answer states, Facebook has acknowledged this was part of the problem, so we know the claim is true. ("...it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers.") Personal communication by unnamed sources to a credentialed tech reporter specifically made the claim that card access was down, although Facebook isn't giving us that level of detail.

That answer is the only one needed to address the immediate claim.

This answer looks at some proposed mechanisms for why this might have been the case, given the larger context of the outage. (Consider it a supplement--if the original question was "Why did JFK die?" and the strictly correct answer is "He was shot," this answer is explaining how that results in death.)

As of this writing, Facebook has not given more detail, however many different social media posts have explored mechanisms for how a networking problem would impede building access--namely, that the electronic systems authorizing access were also caught up in the outage.

Outside parties like CloudFlare, a major network facilitation company unrelated to Facebook, originally became aware of the issue through missing DNS records. DNS is the lookup system that converts between memorable resource names--like websites--and the actual numeric addresses that are currently providing the resource. Early speculation suggested that with DNS down, Facebook also could not access its own systems, including the LDAP server directory system that would track which employees are allowed to access which facilities.

However, the Facebook writeup of the outage indicates that the order of events was a little different. A routine maintenance operation (gone awry) accidentally turned off the main internal networking connections ("backbone") between Facebook data centers. As a result, none of Facebook's internal systems could communicate. Facebook's internal DNS servers--the machines that tell traffic how to get to Facebook--also lost connectivity to the data centers. Now, those systems are designed to function only if they think they can provide reliable data: if they lose connection to the actual Facebook servers, they can't do their job of telling others where to find Facebook resources. So they tell the whole Internet to stop asking them, using something called the Border Gateway Protocol, or BGP (a system which helps networking machines map the best ways to send traffic back and forth).

Essentially, at that point, Facebook's DNS servers all called in sick at once, and nobody could find Facebook any more. But this wasn't strictly a DNS, or even strictly a BGP, problem, as careful observers realized soon after (though the BGP-to-DNS issue caused splash damage to the whole Internet in the form of elevated DNS traffic). Connections between Facebook services' load balancers (that direct traffic from outside to specific locations inside Facebook networks) and the broader Internet still worked in some cases. The root cause was that Facebook had nuked its own internal networking.

Regardless of the exact mechanism, the impact on physical access would be a breakdown of communication between the door lock readers--which get an ID code from an employee's badge--and the directory system that confirms which employee IDs are supposed to have access to which facility. I had originally stated this was due to the DNS problem (meaning that the door readers could no longer find the location of the LDAP server) but best practice is to make directory servers accessible only on private (or virtual private) networks, not the Internet (see also here and probably more other references than I have time to track down). It's more likely that the directory server that grants access was connected through the same internal backbone connection that went down to begin with.

In any event, there's a physical override for this, with an old-fashioned key. But you don't issue a copy of that key to everybody with access to the building--they might make copies, you'd have to get them back when their roles changed, etc. etc. Instead, there's a small security team with overrides for physical access. However, to the extent that the engineering team uses Facebook internal products (e.g. Messenger) for communication, those would also have been impaired by the outage; and there would have been delays in finding other contact information due to the directory being unreachable.

Again, this is a reconstruction of the mechanism through which physical access would have occurred. We won't know for-sure-for-sure until and unless Facebook releases a more specific post-mortem, but my aim is to demonstrate the plausibility of the reported claims based on the surrounding circumstances.

Tiercelet
  • 1,127
  • 1
  • 5
  • 8
  • This analysis is mostly correct, but there's one issue. If you have a contact in WhatsApp, you have their actual phone number. You don't need to use LDAP at that point. Solving the rest of the issue is still a nightmare tho. – Arturo Torres Sánchez Oct 07 '21 at 15:51
  • 1
    I was under the impression (confirmed by the Cloudflare blog post, etc) that the issue was BGP related, not just DNS related. So why would facebook's internal servers not be able to communicate with each other? I would assume they are all on the same ASN, because Facebook doesn't seem to own multiple? – vikarjramun Oct 07 '21 at 16:19
  • 6
    I understand why you believe this is the most likely answer, but you haven't shown that it actually happened. – Oddthinking Oct 07 '21 at 16:45
  • @CJR If the DNS servers are reporting certain kinds of error, _that error_ will propagate and be cached. – wizzwizz4 Oct 07 '21 at 16:51
  • 2
    @CJR That's the "chain of events" in the third paragraph. I was trying not to get too into the weeds, but as I interpret [1](https://blog.cloudflare.com/october-2021-facebook-outage/) and [2](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/) the original screwup caused Facebook's DNS servers to lose connectivity with the data centers; the DNS servers responded to the error by sending BGP updates that cut them off from the Internet. – Tiercelet Oct 07 '21 at 17:56
  • 4
    According to [this blog post on engineering.fb.com](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/), the initial trigger was a routing problem on FB's internal backbone network; that led their facilities to stop advertising themselves to the outside world (via BGP) as ways to reach FB; this made the FB DNS servers unreachable, so DNS entries started expiring. So the chain of failures was internal routing -> external routing (BGP) -> DNS. – Gordon Davisson Oct 07 '21 at 18:56
  • This means the problems seen *inside* facebook's network are rather different from those seen from the outside. We lost access to their DNS because of BGP problems, but from the inside BGP is irrelevant. Depending on how their DNS servers are set up, they might even have had secondary servers at each facility, meaning DNS would keep working fine (internally) during the outage. The real problem is more likely to be that the card readers at facility A couldn't reach the LDAP server at facility B because the internal backbone was down. – Gordon Davisson Oct 07 '21 at 19:04
  • That was my interpretation as well - all the routing tables went down, data centers fell off the network, and nothing could talk to anything that wasn't the next rack over. DNS was the external failure, but nothing could route even if DNS was up so it didn't matter. It's hard to tell from press releases though, I'm interested in reading a white paper or something if Facebook ever puts out details. – CJR Oct 07 '21 at 19:09
  • 7
    @GordonDavisson I don't think we have any real evidence that there was card access problems (LDAP or otherwise), just tweets. If not-normally-onsite engineers had to be sent onsite, I doubt they even had card access in the first place. The way their [writeup](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/) is written, I interpret it as just being a slow process to allow a person into the building (potentially exacerbated *if* HR records/etc showing that this guy is a real employee were also offline) – mbrig Oct 07 '21 at 20:34
  • @GordonDavisson Hmm... my read was that card readers at A couldn't reach LDAP at B because Facebook DNS had exploded, so once the (presumably short) DNS cache TTL had expired, it couldn't find routes any more. But it probably makes more sense that they would not have exposed the LDAP to the wider Internet. – Tiercelet Oct 07 '21 at 20:39
  • @mbrig True; I was just using (hypothetical) card readers and LDAP servers as an example. My point was that the problems inside Facebook's network were different from what we saw from the outside. From the outside, we saw problems with BGP and DNS; from the inside, BGP is irrelevant and for all we know DNS may've been working fine. (Or DNS may've failed internally as well. We just can't tell from the outside.) – Gordon Davisson Oct 07 '21 at 21:35
  • 4
    In facilities that I'm familiar with, the card readers don't depend directly on the network. They have dedicated wiring to a battery-backed control box in each building. The control box maintains a local copy of the access control list in internal storage that can be updated in near real from servers on the network, but can function independently using the last known data in the not-so-unlikely event of a network or power outage to avoid exactly this type of situation. – user46053 Oct 07 '21 at 22:28