Advanced Computing in the Age of AI | Thursday, April 18, 2024

Peek Into AMD’s New Atlanta Glass House 

Like many enterprises these days, chip maker AMD is looking to consolidate its myriad datacenters and smaller data closets to save operating expenses and to drive up the utilization on its machinery. As part of a massive company reorganization that has seen AMD expand into chips for game consoles and ARM server chips, the company has been working for the past year and a half to squeeze all of its gear into two datacenters, one outside of Atlanta, Georgia and the other in Cyberjaya, Malaysia.

For many years, the flagship datacenter for AMD was in Austin, Texas, explains Jake Dominguez, who came to AMD a little more than three years ago to take on the CIO role after running product development and engineering as well as managing the data warehouse and ecommerce platforms at Hewlett-Packard. Like many datacenters in major metro areas, the one in Austin was in a "makeshift building," as Dominguez put it, that was not ideal for housing IT gear. AMD sold the facility in 2013 as part of its reorganization after deciding to put its primary datacenter in the suburbs of Atlanta.

The move from Austin to Atlanta coincides with an upgrade of some pretty old and electricity-eating and heat-spewing systems in the AMD server fleet as well as a shift in data analytics platforms to help steer the business. Dominguez tells EnterpriseTech that the Austin datacenter had 289 racks of gear and consumed 2.8 megawatts of juice. Believe it or not, AMD had lots of single-core and dual-core systems in its fleet, and like many IT shops during the Great Recession, the company extended the life of these older machines perhaps a bit longer than it might otherwise.

Keeping those old machines around is not as strange as it might sound, particularly for a company that could get CPUs at cost if it really wanted to. Most electronic design automation (EDA) jobs used during chip design and verification run on a single core, and so clock speed is the most important factor in completing that job. As long as AMD was willing to pay for the electricity, it saved money on buying new servers. This is a trade that many companies make every year. There is plenty of old iron in hosting companies and even cloud companies like Amazon Web Services, even if it is not at the same vintage as the Opterons in AMD's old Austin glass house.

The facility in the Suwanee suburb northeast of Atlanta was chosen for a number of reasons, says Dominguez, including the fact that Atlanta has always been a major hub for telecommunications and technology.

AMD CIO Jake Dominguez

AMD CIO Jake Dominguez

The Suwamee facility has redundant power sources and network links coming into it, as well as redundant water chillers to keep machinery cool. It also has diesel generators and enough fuel to keep the datacenter running for around 36 to 48 hours, depending on the load on the systems. The datacenter has its own water supply for this cooling, which is a plus and which saves AMD money, and has hot aisle containment, as many modern glass houses do, to improve the efficiency of the cooling and has higher than normal chilled water inlet temperatures to further increase cooling efficiency. The Atlanta suburbs are not a particularly good environment for ambient air cooling during the summer, so AMD is not going to be flinging the walls of the Suwanee datacenter open any time soon.

The datacenter, which is leased from Manulife/John Hancock and which has AMD as its sole tenant, has a design power usage effectiveness (PUE) rating of 1.5. (This is the ratio of the total power consumed divided by the total power consumed by the IT gear, a common measure of datacenter efficiency but by no means the only one or even necessarily the most important one.) With the current gear installed, the PUE is around 1.56. While this is not pushing the limits of PUE like Google, Microsoft, Facebook, and Amazon do, it is a big improvement over the prior facility, says Dominguez and it will get better over time. The new Atlanta datacenter has a 30,000 square foot capacity and a 10 megawatt power envelope, but at the moment AMD has set up two modules with a total of 6,000 square feet and a 2 megawatt potential power draw and enough room for 204 racks of gear. That is about the physical size of a top-end supercomputer these days, just to give you a reference point.

Back in Austin, all of that vintage gear plus some relatively new machinery ate up 289 racks of space and consumed 2.8 megawatts of electricity. This has been compressed down to 160 racks in the Atlanta datacenter, which is currently drawing 1.2 megawatts. This is obviously a huge reduction in electricity costs, and it also represents a 22,918 metric ton reduction in carbon emissions per year for AMD.

The systems inside the new facility use a mix of Opteron processors from the past three generations, and Dominguez says AMD will be putting its systems on a replacement cycle of three to four years. Just like rival Intel, which has already detailed the 610,000-core EDA systems and the heavily virtualized back office systems in its datacenters to EnterpriseTech, the company runs its EDA systems on bare-metal Linux servers and has been working hard to virtualize as much of the other workloads as possible to drive up utilization.

Before the move, the Austin datacenter had 1,197 physical servers dedicated to back-office functions and somewhere around a third of these machines were virtualized. Concurrent with the shift to the Atlanta center, the back office server count was compressed by nearly a factor of five to 258 machines, and now 90 percent of the machines are virtualized – including the machines that support AMD's SAP back office systems. AMD had already virtualized the desktops for end users, and now the engineering workstations are being virtualized using a mix of Hewlett-Packard BL465 and BL685 and Dell M915 servers, which are all blade servers. All of these corporate functions and virtual desktops and workstations are running atop VMware's ESXi hypervisor, but Dominguez says that AMD is looking at using Microsoft's Hyper-V for some workloads.

The EDA systems dominate IT operations at AMD, as you might expect, and are used by the several thousand engineers who design and test AMD's processors and chipsets. The core EDA system in Austin had about 100,000 cores before the move, and after the consolidation of several facilities into the Atlanta datacenter is completed around the middle of this year, that EDA setup will have around 107,000 cores across "many thousands" of servers. The exact number is not yet known because the consolidation is not yet done. The EDA systems use a mix of rack and blade servers and come from multiple vendors, and they are fed by just over 4 PB of storage. (It is not clear if AMD is going to be using its own SeaMicro machines, but presumably it will where they are appropriate.)

amd-atlanta-datacenter

"We are going through some upgrade planning for our EDA system to see how we can enhance its performance and capability," explains Andy Bynum, vice president of global infrastructure and operations at the chip maker. "We would like to get into a three- to four-year cycle for the grid. But the nice thing about the EDA grid is that when I look at the performance stats and the way we are using the grid, there is not always much of a performance gain by refreshing every year or two. We can meet the engineer's performance expectations with a 3.5 to 5 year cycle. We are looking at using a burst service with an external provider right now."

The main EDA cluster was in Austin, but the company also has a smaller EDA cluster about a mile down the road from its Sunnyvale, California headquarters in what is called the "Space Park" datacenter. There is a smaller facility still in a glass house in Fort Collins, Colorado, and in an engineering office in Markham, Ontario that came through the acquisition of GPU maker ATI Technologies a few years back. These facilities are currently being moved down to Atlanta, again with old servers being replaced and more current ones retained. The engineering teams in Austin, Sunnyvale, Boston, Markham will all submit jobs remotely to one large EDA system. AMD also has engineering teams in Shanghai and Suzhou in China as well as in Israel, who will tap into the EDA cluster remotely.

AMD has chosen its datacenter in Cyberjaya – literally meaning Tech Town – in Malaysia as a backup and disaster recovery facility. This facility is much smaller, rated at only 600 kilowatts total at the moment, but will expand as more disaster recovery and local processing for Asia is moved in. This location will have disaster recovery systems for both back office and EDA workloads.

EnterpriseAI