Advanced Computing in the Age of AI | Thursday, March 28, 2024

Facebook and Natural Disasters: Extreme Scale Data Center Resiliency 

Fans of “The Social Network,” a movie about the early days of Facebook, (and one that, last we heard, Mark Zuckerberg has never seen), will remember the scene when the Eduardo Saverin character intentionally closes a bank account with funds that keep Facebook online.

The outraged Zuckerberg character tells him:

“Okay, let me tell you the difference between Facebook and everyone else. We don't crash EVER! If those servers are down for even a day, our entire reputation is irreversibly destroyed! Users are fickle, Friendster has proved that. Even a few people leaving would reverberate through the entire user base. The users are interconnected, that is the whole point. College kids are online because their friends are online, and if one domino goes, the other dominos go, don't you get that?”

Regardless of whether this happened in real life, in the real life of Facebook the imperative to remain online, to build resilient systems that can withstand power outages, is a very real and present one. The extensive efforts for staving off major data center crashes in the face of natural (or man-made) disasters was the topic of the keynote address delivered by Facebook’s head of engineering and infrastructure, Jay Parikh, at the company’s @Scale conference in San Jose this week.

Taking into account Facebook’s billions-plus user base (Facebook: 1.7 billion active users, Messenger: 1 billion, Instagram: half a billion), it’s a strategic program of immense complexity, one that requires extensive planning, simulations and practice runs.

“Resiliency is such a critical part of what we actually do here,” Parikh said. “It’s the whole point of how we work here.”

Whereas in the early days of Facebook a service outage imperiled the company’s survival, the primary concern of Facebook managers today is the impact on its users.

“Our community depends on it,” Parikh said. “If we’re down or slow, a lot of things don’t work. People depend on communicating with their friends and family during natural disasters and other events. We have become a really important communications medium. And we take this very seriously.”

rsz_facebook_ny_data_centersFor Facebook, the wake-up call for how natural disaster could cause a shut-down disaster was Super Storm Sandy, the mammoth 2012 hurricane that pushed its way up the East Coast from the Caribbean and made landfall near New York City, wreaking havoc on the East Coast internet infrastructure. While many companies were disrupted for days and weeks, Parikh said Facebook's mid-Atlantic data centers came through unscathed, but barely. He said the experience made the company realize it might not be as fortunate the next time.

“We had built up enough redundancy over the years that we weathered the storm, unintended,” he said. “We really came pretty close for us. While we got through this and we didn’t see any major disruptions in our service, we asked ourselves: what would happen if we lost a data center region or a data center due to something like this storm?”

The scale at which Facebook operates obviously compounds its resiliency challenges.

“We all care about scale,” he told the @Scale audience, “we’re obsessed with running things at high volume - lots of customers, lots of people dependent on our applications and services. And we’re solving unprecedented problems, these are problems that are not being solved anywhere in the industry, generally speaking. So…every day is chock full of lot of scalability problems.”

He said a single data center region is comprised of tens of terabytes bandwidth traffic, tens of megawatts of power to serve thousands of servers and thousands of different software systems. “It’s pretty hard to decompose this,” Parikh said. “How are we going to build a more resilient service when you’re looking at these numbers? You see something very scary and daunting.”

As challenging a technical problem as it was for Facebook, the management and cultural issues are just as important. To start, Parikh said, the company needed to overcome the natural tendency toward inertia and avoidance of what amounts to a maintenance problem on a grand scale.

“Instead of just kind of wondering and assuming we’d probably be OK,” Parikh said, the company in 2014 created a SWAT team called Project Storm that involves the entire engineering team and other groups in the company working up massive-scale storm drills to simulate what would happen if some major outage happened, building a system that ensures when a data center goes down that the consequent unsupported traffic is instantly absorbed by other data centers.

The SWAT team began with a series of mini shut down drills and began development of a preliminary emergency system before eventually deciding to “pull the plug and see what happened.”

“To be honest things didn’t go all that well the first few times we did this,” Parikh said. “But because we built a lot of instrumentation tooling and preparedness ahead of time, (end users) in the community didn’t notice what happened. We learned a lot and this was exactly the goal of the drill. We wanted to force ourselves to look at what would work and what didn’t work with this massive type of drill that we did.”

The major lesson learned: traffic management load balancing is really hard. During the initial drills, “all hell broke loose,” Parikh said, when the team started the drain of a large set of software systems.

Facebook load balancing chaos

“This is a graph of load balancing gone wild.”

“This is an awful user experience when this happens and you don’t have a good control system,” he said in reference to the above image. “If you’re an engineer and you see a graph like this, three things come to mind. One: either you have bad data and you should go fix it; two: you’ve a control system that’s not working and you should go fix it; or three: you have no idea what you’re doing, and you probably should go fix that.”

The team found that the data was accurate, which meant they had to build a better control system that could shift traffic from one data center to another instantly. After extensive development work, the result was the below image:

rsz_facebook_good_load_balancing

“This is what we got – much, much better,” Parikh said. “You can see drain on the left side and all that traffic is automatically absorbed, it kind of goes on the other services, the other capacity picks it up, you see there’s low variance here, it kind of looks pretty boring, it looks pretty nice. So we strive to make our graphs for our drill exercises to look like this graph.”

In the development of the control system, Parikh said, tooling was a key. And the tools they developed have had the ancillary benefit of improving routine management of Facebook’s data centers.

“One lesson we learned is…these sophisticated tools have been very helpful for our day-to-day operations, too,” he said. “So this isn’t just about disaster scenarios, we’ve actually been able to extend and adapt these tools to do a lot of normal things on a daily basis. So this has been really powerful for us to help us move fast and help us have more confidence in our systems.”

One major obstacle still lay in the SWAT team’s path in the aftermath of shutting down a data center: getting it back online. Parikh compared it to a life lesson he learned as a kid when he was given a toy: it was easier to take it apart than put it back together.

“What we learned is that when we do a drain exercise and we take a data center or region down, that actually happens quite fast,” he said. “But trying to put it back together and get it to operate the way it did before the drain was actually much harder. We had to invest a lot in actually trying to make this (process) predictable and reliable.”

Among other things,, this called for investment in a run book tool that provides details of everything that goes into turning a data center off and on, including the hundreds of individual manual or automated tasks that have to happen to orchestrate the drill.

“The thing that I realize is that this is no longer a toy car that I got for my birthday,” he said. “This is like taking apart an aircraft carrier. It takes hundreds of people putting it back together in just a few hours. That’s the scale were dealing with and that becomes really hard.”

He said getting a data center back online is like the pit crew for a race car.

We actually time ourselves and grade ourselves to make sure we meet a certain time goals,” he said. “You really want to time this like a pit stop at a car race. You want to get all the tires and the gas taken care of in the shortest amount of time, and it has to be perfectly orchestrated.”

EnterpriseAI