Advanced Computing in the Age of AI | Thursday, March 28, 2024

Airstone Clusters Drive Down Latency With Custom Network 

Some hardware and software engineers at a startup called Airstone Labs, who have built scale-out systems for Google, among other places, want to build a better cluster. And as crazy as this might sound, Airstone has been compelled by necessity to come up with its own proprietary network interconnect to build low-latency, high-bandwidth clusters for running a variety of workloads.

EnterpriseTech caught up with Kai Backman, founder and CEO of Airstone, at the STAC Summit in New York this week, and got some insight into the future Airstone Labs clusters and the Ultramarine proprietary network fabric that underpins it. The company has not provided all of the details of the system it has created, but it is a classic example of the combination of off-the-shelf compute and storage components and custom interconnects and software stacks. Backman says that the machine embodies the idea that "terabit is the new gigabit" and that its bare metal cluster is an example of "big data meeting big network."

Airstone has some very heavy hitters in datacenter and scale-out system design among its staff of a dozen people.

Backman was previously CEO of a company called Tinkercad, which was founded in 2011 and had created a Web-based computer-aided design program that borrowed from the ideas of modern cloud software like that constructed at Google. The Tinkercad program used WebGL, a new 3D standard for Web applications. Users accessed the application only through a browser and the back-end servers out in a remote datacenter did all of the heavy lifting of the application. The idea was to put this back-end software on a public cloud and the last thing that Backman wanted to do was design his own systems and interconnect. But, a year later, Tinkercad was growing by leaps and bounds and the company could not get enough capacity of the kind its software required.

"Nobody was selling what we needed, and as you know, even today, HPC resources are hard to get from cloud providers," explains Backman. "They are not really scaling this type of service. We were looking at big data and technical computing workloads and it was just hard to buy it. We had enough people on the team and we asked ourselves, "How hard can it be to build something from scratch?'"

After about six months of development on the Ultramarine networking protocols and hardware, the engineers at Tinkercad realized that this was an important technology in its own right, and they decided to commercialize it as Airstone, which was initially billed as an interactive simulation environment. Backman and his team raised $2.5 million in private equity financing from Borealis Ventures, True Ventures, and Lifeline Ventures in January 2013 and announced to the Tinkercad community the shift in focus to Airstone. In May last year, the Tinkercad software was sold off to CAD software giant Autodesk for an undisclosed amount, and presumably sets up Autodesk as an initial customer for the Airstone clusters. (Backman is not saying at this point.)

Backman knows a thing or two about building scale-out systems and prior to the Tinkercad and Airstone projects, he was a software engineer at Google, working on the Go programming language, Google Docs, and cluster infrastructure. Jana van Greunen, vice president of engineering at Airstone, was in the same position at Silver Spring Networks, which created large-scale mesh networks for smart grid software sold to cities and utilities. Mark Andrews, who is vice president of products at Airstone, spent two years working on infrastructure at Microsoft and then seven and a half years at Google, where he built the global content distribution system for Google's datacenters and its YouTube service, which is called Google Fiber internally, as well as various payment and mobile systems. Colin Corbett, who is vice president of operations at Airstone, has built datacenters and networking infrastructure for YouTube, PayPal, Dropbox, and Netflix. The team includes several chip designers with specialties in networking and graphics from Apple, Sun Microsystems, and Broadcom.

The first generation of the Airstone cluster, called C1 internally, was a prototype and was not really put through any paces with beta customers. The C2 cluster was built using Intel's "Ivy Bridge" Xeon E5-2600 v2 processors, and specifically a two-socket machine with ten-core E5-2690 v2 processors running at 2.3 GHz. This cluster starts out with thirteen nodes, which have 12.8 GB of main memory per core or 3.3 TB across the cluster. The Airstone cluster has 123.6 Gb/sec of raw network bandwidth coming out of each node, and using a custom, low-noise Linux kernel running on bare metal iron, it can deliver an end-to-end latency between any two nodes in the cluster in 610 nanoseconds. That is a measure of the latency from the userspace in one Linux environment, out through the drivers and over the physical network, into the drivers of the other machine, and into its own userspace.

The interesting bit about the Ultramarine network fabric is that it is has a fully connected mesh topology, using point-to-point links that hang right off the QuickPath Interconnect (QPI) ports that come off of the Xeon E5 processors. Backman says that one of the frustrating things about writing code for any cluster is that you always have to take into consideration the network topology when writing your application program and software stack, and the idea with the Airstone cluster and the Ultramarine interconnect is that with a mesh network, each node in the cluster is always just one hop away from all of the other nodes. Each processor socket has its own Ultramarine network interface and can talk directly out to the mesh. The C2 cluster can scale to about 100 nodes, according to Backman.

With the Airstone C3 cluster, the compute nodes stay the same but the Ultramarine network fabric is being pushed harder and scaled further. The base setup comes with 45 nodes, which have an aggregate of 11.5 TB of memory. Each node has around 200 Gb/sec of bandwidth coming off it (around 100 Gb/sec per socket), and it has an end-to-end latency (again, from userspace to userspace in the Linux environment on each node) of around 750 nanoseconds. The C3 cluster is designed to scale up to around 2,000 nodes and deliver a peak of 620 Gb/sec coming out of the two nodes (or 310 Gb/sec per socket).

With such a fast interconnect, the obvious question is can Airstone make a shared memory system out of this Ultramarine interconnect, with coherency across the nodes?

"We looked into that, and right now they do not," says Backman. "You use message passing just like you would in other clustered systems. It is not clear that shared memory is the right solution for the software, and we have not seen a strong demand into making it a shared memory device. We might do it – it is technically very feasible." Backman says that any time you can push down to latencies of 500 nanoseconds or lower, a cluster can start behaving like a shared memory system.

The C2 and C3 systems are currently built using a whitebox blade server design from Supermicro. The ideal configuration is to keep the machines physically close to one another, in two rows with maybe 10 or 15 racks scaling up the compute. Storage modes link into the system using the same Ultramarine interconnect, with one port available for storage traffic.

On the 45-node Airstone C3 system, those 45 nodes have 57 Tb/sec of aggregate non-blocking network fabric bandwidth and 2.37 Tb/sec of file system bandwidth. The flat topology of the Ultramarine interconnect allows for algorithms to be simpler because application programmers don't have to care about network topology. They can talk directly to the Ultramarine API stack or they can use TCP/IP sockets or even Message Passing Interface (MPI) on top of the Ultramarine protocol. Backman says that the thing about the Ultramarine interconnect and topology is that it is designed to handle very heavy network loads and maintains its low latency even under congested conditions – something that neither Ethernet nor InfiniBand can do. On certain workloads, such as traversing graphs in graph analytics workloads or running machine learning algorithms, where congestion is a big issue, the Airstone clusters are going to do particularly well, according to Backman. On early benchmark tests, the Airstone C3 cluster with 45 nodes was able to load a 10 TB dataset into memory in about 40 seconds, and it was able to scan 1 PB of data in about 60 minutes.

Airstone has opened up the beta program for testing its C3 clusters. At the moment, the system runs a customized version of Linux, and Backman says that the company will support two or three of the popular flavors of Linux – those are easy to guess – and adds that it is looking at supporting Docker containers on the bare metal as well. The Airstone cluster could also support Windows, although Backman says that there is no demand among early prospects for the system for Windows at this time.

Pricing has not been set for the Airstone clusters, but the design goal of the machines is to have a cost that is 70 percent lower than the cost of buying HPC-style cloudy infrastructure on Amazon Web Services and offering a factor of 10X better bang for the buck.

If all goes well with the Airstone cluster beta program, then the product should be formally launched later this year. We'll keep you posted.

EnterpriseAI