Advanced Computing in the Age of AI | Tuesday, March 19, 2024

Facebook Tech Director Reveals Open Networking Plan 

Facebook is a classic example of an exponentially growing business with extreme scale IT needs that cannot do things in the data center the way traditional large enterprises do. It would go broke if it did. Or more precisely, it might have never gotten outside of Mark Zuckerberg's Harvard dorm room.

Facebook's software and the infrastructure on which it runs are literally the business. The company has been relentless in hacking all parts of its stack, from PHP compilers down to servers and out to data center designs. And as EnterpriseTech reports elsewhere, the social media giant has shared what it has learned about building vanity-free servers and advanced data centers through the Open Compute Project, and it has also open sourced its enhancements to PHP. Now, Facebook wants to pry open the network and hack it to make it better.

Through the Open Compute Networking project established earlier this year, Facebook is working with all of the major switch and networking chip makers as well as startups in the software-defined networking arena to bring switching into the modern era, as Najam Ahmad, director of technical operations at Facebook, explained it to EnterpriseTech this week at the company's office in New York. Facebook wants switches to be built differently so networks are easier to build and manage at the scale its business requires.

Timothy Prickett Morgan: I suspect that this Open Compute Networking effort is about more than having vanity-free switching in Facebook's data centers. What is the problem you are trying to address with this project?

Najam Ahmad: If you look at networks today, the fundamental building construct we have is an appliance. It doesn't matter whose appliance it is, you get hardware and a set of features that are vertically integrated by a vendor. So you pick the speeds and feeds and a set of protocols and you get a command line interface to manage it. If the protocols do what you need, you are good. But if you need a change in the protocol, then you get into a little bit of a fix. Do you go to IETF or IEEE to get a protocol spec modified? Or do you work with the vendor and work with their product managers and maybe six months or a year later you can get that feature that you want.

TPM: Or, you have to buy a completely different switch because vendors have increasingly broad feature sets as they move up their product lines.

Najam Ahmad: I don't want to pick on any single vendor, but that is how the whole industry is. To keep track of all of the features and protocol sets in a product line is a problem, but you also get into that rip-and-replace conversation a lot. Any time you have to do physical work, it is expensive and it takes a lot more time. It is a simple physics problem at that point.

Najam Ahmad, director of technical operations at Facebook

Najam Ahmad, director of technical operations at Facebook

What we want to do is bring networking to the modern age. I will use a mobile handset as an example. Ten or fifteen years ago, you used to buy a phone from Nokia or Motorola, and you had to pick between the features they had. And when you picked a phone, that was it, that was the phone you had. If you wanted another feature, you had to buy a different phone. That whole ecosystem has changed. Now we have a bunch of hardware suppliers – HTC, Samsung, LG, Apple – and you have operating systems on top – several versions of Android, iOS, Windows – and then you have a bunch of apps on top of that. With smartphones, if you don't like an app, you get a new one. And if you don't like any of the apps, you can write your own.

That is where network needs to go. To do that, what we really have to do is disaggregate the appliance into its components, which at the top level are hardware and software.

TPM: Hence, the Open Compute Networking project.

Najam Ahmad: The specification that we are working on is essentially a switch that behaves like compute. It starts up, it has a BIOS environment to do its diagnostics and testing, and then it will look for an executable and go find an operating system. You point it to an operating system and that tells it how it will behave and what it is going to run.

In that model, you can run traditional network operating systems, or you can run Linux-style implementations, you can run OpenFlow if you want. And on top of that, you can build your protocol sets and applications.

TPM: Does the Open Compute Network project assume that you will have custom ASICs as we have in all of these switches today, or will it be based on an X86 engine with a bunch of ports hanging off it?

Najam Ahmad: Certain things are specialized. In your phone, for example, you have a GPS function. You can't expect the general-purpose ARM chip to do GPS. Switching is like that. When you want 32 ports running at 40 Gb/sec, and you need MPLS [Multiprotocol Label Switching] as well, there is no way for a general purpose X86 chip to keep up.

The idea is to use commodity, network-specific chipsets, but to make them open. All of the network semiconductor guys have attended Networking project meetings. We need those ASICs, and we can marry them to X86 processors to do network functions on top of that. The spec is not tied to a particular ASIC or operating system.

That is how OCP wants to do switches and that ties into how Facebook wants to do networks.

TPM: What is your plan for adopting Open Compute networking gear? How long is this going to take?

Najam Ahmad: I don't write a spec without the idea of getting it into production. I like to use the car analogy. Every major car manufacturer has a concept vehicle, which has all of the bells and whistles they can think of in it. . . .

TPM: And nobody ever gets to buy it. . . .

Najam Ahmad: True. [Laughter] But at the same time, a bunch of the features on the concept car make it into production. We want to shoot for all of the things we want in open switches, but you can't boil the ocean. We want hardware disaggregated from software, and we want to deploy that hardware in production. We may do it with a traditional network operating system, or we may write our own. It depends on the pieces and the timing of when we want to go into production.

TPM: I know the spec is not even done yet, but when do you expect to deploy open switches?

Najam Ahmad: We haven't officially set a date. But what I can tell you is that we are far enough along that at the Open Compute Project workshop hosted by Goldman Sachs this week in New York, we had a box that we booted up and demoed passing traffic. It is a contribution from one of the big guys, and I am not allowed to say who because we are still working through the contracts for them to contribute their IP.

It is still a work in progress. But we passed packets through it and we are further along than I expected at this stage. We still have a lot of work to do.

TPM: All of this open switch work is separate from the silicon photonics work that OCP and Intel were showing off earlier this year at the Open Compute Summit. How does this all fit together?

Najam Ahmad: In some sense it is orthogonal, and in some sense it is not. The plan for the open networking project is to get that disaggregation going in the switch. When we prove that model works, we can take any box in the data center and show that you can do this.

Then you say the rack is now different, it is disaggregated. What do we need in that? That is where the silicon photonics comes in. That is a little further away because it is even more of a concept to disaggregate a server into its components.

TPM: Some people take Intel's talk about Rack Scale and the related work with OCP in the area of silicon photonics and they might walk away with the impression that this is right around the corner. Others point out that Intel has never done 100 Gb/sec networking before and that this might take a bit more time than they are expecting.

And I think it will take a particularly long time to disaggregate memory from processors in this exploded server, but everyone seems to agree that it needs to be done. Sometimes people have to buy a heavy server just because they need more memory and they don't need the compute, and conversely, if they need a lot of compute but not much memory, you sit around with a bigger physical box with a lot of empty memory slots.

Najam Ahmad: That's a really hard thing to do, but we are working on it. All of the components in the server have very different lifecycles, and they need to be replaced on their own lifecycle, not on the lifecycle of some box. There are smarter people than me working on this problem.

TPM: Let's take a step back and talk about the networks at Facebook. Many of us know how Facebook has evolved from using plain-vanilla servers to custom gear made by Dell's Data Center Solutions division to creating its own servers and open sourcing them through the OCP. But what have you done with switches?

Najam Ahmad: We have all sorts of flavors, and we have a bunch of OEM stuff in there as well as experimenting with whiteboxes.

In our data centers, there are tens of thousands of machines in them, and we have been building our networks to scale that far. For the past year and a half, we have been deploying 10 Gb/sec ports on the servers, and the network fabric is all 10 Gb/sec. We have also taken Layer 2 completely out of the environment. We are an all-Layer 3 network now. We don't do traditional VLANs, spanning tree – things of that nature – because they don't scale to our size. So we run BGP [Border Gateway Protocol] all the way to the top of the rack. A lot of the big guys are doing that now – Microsoft, Google.

The core is built in two layers, and we have an optical network that we are building underneath. It connects to all of our data centers and connects to all of our point of presence centers that we serve traffic out of. It is not hundreds or thousands of POPs, but tens to get closer to the end user and reduce the latency.

TPM: So you have 10 Gb/sec in the servers, and you have 10 Gb/sec in Layer 3 switching. What does the backbone run at?

Najam Ahmad: We have 40 Gb/sec in the data center core and in the backbone it is 100 Gb/sec. We have a mixed model there. In some cases we are still leasing, but we are buying dark fiber and moving away from that.

TPM: What about InfiniBand? Does Facebook have any InfiniBand in its networks?

Najam Ahmad: When Ethernet did not have its current capabilities, InfiniBand was too expensive and cumbersome, too. Ethernet has kept growing and improving, and Ethernet is good enough for us. At scale, we can do the same things with Ethernet and it doesn't make sense for us to change it. At a network interface level, we are not busting out of 10 Gb/sec Ethernet. There are some applications, like Hadoop, where we are pushing it.

Latency is always important, but we don't try to shave off microseconds. Milliseconds we try to shave off, but not microseconds. We are not after the ultra-low-latency stuff, where InfiniBand can help.

TPM: So now let's talk about what software-defined networking means to Facebook. You don't have virtual machines, you don't use virtual switches. All of your applications run on bare metal. You have different networking issues compared to a typical enterprise with a lot of virtualized applications.

Najam Ahmad: The central problem we have is agility. The pace at which our applications, and other things like storage or Hadoop, move is much faster than the network can move, primarily because of the environment that I described of an appliance with a very closed system and the only interface you have into the system is a command line interface.

TPM: You can't hack it, so you can't break it, and then fix it.

Najam Ahmad: You can't hack it at all, which is what we do and what we do best.

We like the work to be done by robots, and the people to build the robots, which is essentially software. So we want to build software. We don't want to have people sitting in front of large monitors watching alerts, and clicking and fixing things. If we see a problem a couple of times, we automate it. If you don't have the hooks in these boxes to do that – if someone has to log into a box, do a bunch of commands, and reboot it – we are going to need an army of people at our scale. And it will be slow. And it will cause outages. Software does things much faster and more reliably.

facebook-nyc-office

TPM: How do you manage your network now? Did you create your own tools because no one has all this magical SDN stuff?

Najam Ahmad: Yeah, we created our own tools. We still manage the network through software, but it is much harder to do than it needs to be. Let me give you a concrete example.

A few months ago, we were seeing some issues with Memcached in our environment, with transactions taking longer and there being a lot of retransmissions. So we were trying to debug it to find out what the heck was going on. And we just couldn't find it. And this went on for a couple of weeks. And then our switch vendor and one of its developers came out to help us troubleshoot. And the developer said, "Wait, hang on, let me log into the ASIC." This is a custom ASIC. There was a hidden command, and he could see that the ASIC was dropping packets. And we had just wasted three weeks looking for where the packets were going. They had a secret command, and the developer knew it, but the support staff didn't and it wasn't documented.

We figured this out at 5:30 in the evening, and we had to log into every damned ASIC on hundreds of boxes – and most of them have multiple ASICs per box – and you run this command, it throws out text, and you screen-scrape it, you get the relevant piece of data out, and then you push it into our automated systems, and the next morning there were alerts everywhere. We had packet loss everywhere. And we had no clue.

And you just shake your head and ask, "How did I get here?" This is not going to work, this is not going to scale.

This is the kind of thing I want to get rid of. We want complete access to what is going on, and I don't want to fix things this way. We want to run agents on the boxes that are doing a lot of health checking, aggregate their data, and send it off to an alert management system. SNMP is dead.

EnterpriseAI