Advanced Computing in the Age of AI | Thursday, April 18, 2024

Applied Micro X-Gene ARM Waves The 64-Bit Banner 

A few years back, Intel made its foray into the networking arena with its Xeon and Atom processors, and several makers of specialized chips for networking are fighting back by taking their expertise and creating server variants of the 64-bit ARM architecture with embedded networking and other kinds of acceleration. Applied Micro, which has three generations of its X-Gene processors in different stages of development and production, has become the standard bearer for ARM's entry into the server market because it is first out the door with a 64-bit design.

AMD, Cavium, and Broadcom are also fielding very serious ARM server chips and all four vendors, and perhaps other entrants, including Samsung, Nvidia, and a few others like Google or Amazon who might be messing around with ARM chip designs, could be cooking up other interesting stuff as well.

At the International Super Computing conference in Leipzig, Germany, this week, a number of system makers were showing off development and production machines based on the first generation X-Gene processor from Applied Micro, and GPU accelerator maker Nvidia was also on hand to remind everyone that its Tesla coprocessors and CUDA parallel programming environment worked on 64-bit ARM platforms (technically known as ARMv8 architecture but often called ARM64 colloquially) just as they do on X86 chips and, soon, Power processors from IBM and its OpenPower Foundation partners. Applied Micro is turning up the volume on ARM64 and trying to convince potential customers in the hyperscale and supercomputing spaces, who are willing to entertain an alternative to the X86 architecture for compute, that its brawny implementation of the ARM architecture is more than worth the trouble of porting software. These initial users at the top end of the market tend to have their own code and they are the ideal customers to attack first.

Applied Micro has not revealed all of the details of the three X-Gene processor designs, but it certainly will begin doing so when companies actually start putting the chips in production machines later this year and early next. Here is the basic shape of the X-Gene 1 processor:

applied-micro-x-gene-block

Applied Micro is a full licensee of the ARMv8 spec and in fact was the first such player to license the 64-bit designs from ARM Holdings that allow it to make its own custom cores. The X-Gene 1 processors are made by Taiwan Semiconductor Manufacturing Corp in its 40 nanometer processes and have been sampling to partners since early 2013; after some tweaks here and there, production wafers for this initial X-Gene chip started at the end of March. The X-Gene 1 has eight cores running at 2.4 GHz and implements a high-speed, on-chip fabric to link the cores to each other and to a high speed memory controller as well as to various peripheral controllers. The important accelerators for server workloads include four 10 Gb/sec Ethernet ports and other kinds of network function accelerators; the chip also has the ability to drive six PCI-Express 3.0 slots. With the X-Gene 2, which is just now starting to sample, the chip will shrink to a 28 nanometer process and will stay at eight cores on the system-on-chip.

The interesting twist with X-Gene 2 is that it will support the Ethernet variant of Remote Direct Memory Access, which is called RDMA over Converged Ethernet or RoCE. RDMA for InfiniBand and RoCE for Ethernet are designed to allow a server node in a cluster to directly access main memory in another server node without having to go through the entire network stack in an operating system, thus driving down latencies. RDMA is not just used for supercomputing clusters, but is now being used to link servers to storage (Windows Server can make use of RDMA now to speed up file access) and to lash parallel database servers and clustered file systems together.

applied-micro-x-gene-roadmap

The X-Gene 3 chip will be based on TSMC's 16 nanometer FinFET process (which is a 3D transistor design that is similar in concept to Intel's Tri-Gate transistors that were first used in its 22 nanometer chips), and all that Applied Micro has said to date about X-Gene 3 is that it will have at least sixteen cores on the SoC.

Like other ARM server chip upstarts that want to take on Intel in the datacenter – a daunting task, given the overwhelming preference to deploy Windows and Linux applications on X86 iron – Applied Micro thinks it can bring to bear some advantages. First and foremost, enterprises like to have multiple sources for their compute even if they do this time between the dot-com boom and now when the X86 instruction set has pretty much come to dominate. It is simpler to have one architecture to support, and Intel has and will continue to argue that one architecture is cheaper to support than two. But among the Internet companies, hyperscale operators, and supercomputer centers of the world, supporting two or even more architectures is less of a big deal. They largely have their own code and they have the technical chops to do this. The rest of the enterprise market will rely predominantly on Linux software for the Web tier, for analytics, and for storage systems to drive the adoption of ARM servers – all workloads that are I/O intensive more than they are compute intensive. The thinking is to make an ARM SoC that has cores that are brawny enough to do real compute work even if it is not the same as the top-end Xeon E5 or E7 part that Intel can ship, and give it enough memory and I/O bandwidth to tackle these I/O intensive jobs.

There is a general misunderstanding, Applied Micro contends, that paints all of ARM chips as low-end, comparing it to Atom processors from Intel, for instance. Applied Micro is understandably hesitant to make any bold performance claims for the X-Gene processor at this point because it is early in the testing cycle, but it also wants to dispute this characterization. Sanchayan Sinha, senior product manager, tells EnterpriseTech that the X-Gene 1 has about the same level of single-threaded performance as a four-core "Haswell" Xeon E3 and about the same memory bandwidth as a “Sandy Bridge” Xeon E5. Add in very fast, on-chip networking and other network acceleration functions, and X-Gene has a fighting chance to take away some market share from X86 machines.

James Hamilton, the server guru at Amazon Web Services who used to have the same job at Microsoft, did a study and showed that the cost of the server represents something on the order of 57 percent of the total cost of ownership of the server and power distribution to the machine and cooling it only represented 18 percent of the cost and the actual power to keep the server humming represented 13 percent of the TCO. Server cost and power issues represent most of the TCO as far as AWS is concerned, and Applied Micro cites these numbers in saying this is where you need to attack costs.

One way to attack those costs is through integration of components onto SoCs, and Sinha says that by eliminating chipsets, external network interfaces, and other components on a typical X86 server board, an Applied Micro X-Gene system can reduce the cost of the bill of materials for a system board – including the cost of the processor in the comparison – by 50 percent.

Norman Fraser, who is co-founder and CEO at ARM server upstart SoftIron, says that the company is working on motherboards for a blade design that will put 18 blades based on X-Gene processors, with four memory channels per socket, in a 2U enclosure that will be available later this year. The company is looking at various accelerator and interconnect options for the system to make it appealing to both hyperscale and HPC customers alike. "We're shooting for twice the performance per watt and twice the density of Xeon E5," says Fraser.

At the moment, SoftIron has a motherboard that it is showing off at ISC'14 this week that can incorporate two variants of the X-Gene processor. The SoftIron 64-0400 server board has an X-Gene 1 which called the APM883204, which has four cores running at 2.4 GHz for running compute jobs. The system also has four 32-bit ARM V5 cores for network and security acceleration and one Cortex M3 core for server management. The board comes in a Micro ATX form factor and has two memory slots that support up to 128 GB of 1.6 GHz DDR3 low-power memory. The board has four SATA 3.0 drive ports, one PCI-Express 3.0 x8 slot, and one 10 Gb/sec Ethernet port. There are three Gigabit Ethernet ports for server management and other uses, and two USB 3.0 ports. The SoftIron 64-0800 server board has the same features but it comes with an X-Gene APM883208 processor, which has all eight cores on the SoC fired up at 2.4 GHz. Both boards are certified to run recent releases of Canonical's Ubuntu Server or Red Hat's Fedora.

The eight-core board is shipping now, and SoftIron is charging $1,950 a pop for them; the four-core variant is not yet shipping (but will soon) and will sell for a few hundred bucks less. Over time, as Applied Micro ramps up production on X-Gene, the costs for the boards, which have the processors welded on them, will come down, says Fraser.

"We took a deep breath a few years ago when we were founded and set out to build our own board rather than use a reference design," explains Fraser. And now, everyone from all walks of the IT spectrum is trying to get one of the SoftIron boards. "There are people who want to get their hands on a 64-bit ARM server – almost any server – to see the performance and to see if virtualization and other features of the chip actually work. HPC folks are looking for attachment to GPUs and other networking options. And cloud computing companies are looking for a new story that will give them an edge. Hyperscale customers are also poking around, seeing what the possibilities might be." Fraser was not at liberty to name names, of course.

At the ISC'14 event, Eurotech previewed a new hot liquid-cooled machine in its Aurora line that will pair the X-Gene processor with multiple Nvidia Tesla GPU processors using a motherboard of its own design that solders the X-Gene chip and its main memory to the system board to make it compact. The precise ship date for this machine has not been set. Cirrascale and  E4 Computer Engineering are showing off rack-based server designs that pair Applied Micro's "Mustang" X-Gene 1 processors with a Tesla GPU coprocessor, both of which are intended for software development and testing and which will ship in July. Nvidia's CUDA 6.5 parallel development environment can split code across ARM and Tesla processors, and it supports Fedora and Ubuntu Linuxes, too, so early adopters in HPC and hyperscale can start porting code and doing their own benchmark tests.

EnterpriseAI