Advanced Computing in the Age of AI | Friday, March 29, 2024

AMD Accelerates Chip Testing With Flash Storage 

Like most large enterprises, chip maker AMD has some big central systems that it uses to run its core business, and on top of that, it has a cluster of many thousands of systems with over 100,000 cores running its electronic design automation (EDA) software, which is used to create its CPU, GPU, and APU chips. The company no longer manufactures its chips, but it does do some of its own chip packaging and all of its testing for wafers and finished processors. And the systems behind these testing operations are being goosed by flash storage.

AMD spun off its fabs to GlobalFoundries nearly six years ago, which makes a number of its processors. AMD has had a manufacturing relationship with Taiwan Semiconductor Manufacturing Corp. since its acquisition of graphics chip and card maker ATI Technologies for $5.4 billion eight years ago this summer, and TSMC now makes some of AMD's processors as well as its graphics chips. A set of applications created by AMD that is used to test wafers is still running at GlobalFoundries, according to Ross Alaspa, enterprise architect for production applications at AMD; TSMC never did testing of AMD chips so it does not have a set of the AMD code running locally at its facilities, but AMD does have eight production facilities and two engineering facilities that make use of its Unit Level Serial Data chip and package testing application.

It is these clustered systems that run the ULSD application that are being upgraded with flash-enhanced storage, and the reason is simple. To a certain extent, AMD's $5.4 billion business and its chip volumes are both gated by its ability to package and test its chips and get them to its system partners and into its sales channel. AMD does not do all of its own chip packaging and farms some of it out to three companies called ASE Group, STATS ChipPAC, and STIL, which are all based in Asia. Alaspa cannot be precise about exactly what AMD's volumes are, but tells EnterpriseTech that one facility alone tests more than 1 million parts per month and that the total volume of testing is close to 100 million chips per year.

In general, AMD tests chips while they are still part of a wafer and then does significantly more testing once they are packaged. The initial testing is just to see which cuts of the die are good enough to be used in some fashion and therefore merits the cost of being put into a ceramic package. Once a chip is packaged up and has pins either for direct soldering to a motherboard or to plug into a socket, it is put through more robust testing and validation and then "fusing the part" to lock in its clock speed, cache size, and core count, among other features. For Opteron server parts, the testing is even more rigorous and involves system-level tests by putting them in a motherboard and booting up various operating systems on the test board and running diagnostics on the chip as well as an extended burn in time to see how the chips perform under load and in a hot environment to reduce infant mortality on the chips in the channel.

While the ULSD chip testing application is an important one that has a direct effect on the chip throughput at AMD, the company's IT department is, like many in the enterprise, extremely conservative and loathes to change any system that is working and whose downtime would radically affect the business. The ULSD application was created about ten years ago, says Alaspa, and was initially deployed on clusters of Sun Microsystems Sun Fire Opteron-based servers running the Solaris variant of Unix. Back then, these clusters were based on a lot of four-core servers that ran the Oracle 9 database management system, which was used to store the unit test data for each chip as it rolled out if the fab or packaging facility. Several years ago, to boost the performance of the Oracle databases and the ULSD application servers that feed into it and, to a certain extent, act as a cache and security buffer for those database servers, AMD upgraded to Opteron-based DL385 machines from Hewlett-Packard with machines that have 24 cores each. AMD also moved from Oracle 9 to Oracle 10. Each testing cluster has a database server and redundant ULSD application servers, with data replicated between the databases in the various sites for high availability. More recently, the databases have been upgraded to Oracle 10.2 – Alaspa says AMD is hesitant to move to either Oracle 11g or Oracle 12c databases because once something is working, you only mess with it if you need a feature enough to risk downtime. The ULSD application servers – which AMD calls socket servers because they act as a communication buffer between the testing and packaging equipment and the Oracle databases – have been ported from Solaris to Red Hat Enterprise Linux, but the database servers are still running the vintage Oracle 10.2 database atop the vintage Solaris 10 Unix operating system. The largest database in the main testing facility operated by AMD has 8 TB of data and the two engineering facilities have databases that are about 1 TB in size.

The HP DL385 servers in the ULSD chip testing clusters were backed by disk-based CLARiiON AX4 network-attached storage arrays from EMC. All but the smallest of the testing clusters had four of these units, which had about 2 TB of usable capacity after you used RAID data protection across the drives. Despite the fact that AMD makes its own chips, the IT department has to justify every penny spent and to boost the performance of the ULSD testing application, AMD decided after some benchmarking and modeling was not to move to faster processors, but to get disk arrays that were based on flash or at least had some flash caching to speed up their performance. AMD has chosen the T620 hybrid flash and disk arrays from Tintri to goose its Oracle databases. These arrays have 1.2 TB of flash memory and 13.5 TB of SATA disk capacity, and as Alaspa puts it, the flash caching "gives better than 15K RPM disk drive performance without having to pay the all-flash price."

Moving to the Tintri T620s allowed AMD to cram get the same storage in one array that it took four of the CLARiiON AX4 arrays to hold, and the flash caching on the device, which keeps the hottest Oracle data in flash memory, was able to reduce test and packaging data query times by nearly 50 percent. AMD has to keep all of the data concerning all of the chips, stored by serial number, that are in the process or being tested or packaged in the facilities at any given time. It also needs to hold 18 months of data about chip testing in those facilities as well. It just so happens to work out that the archival data fits on the spinning disks and the current data, after being put through Tintri's de-duplication software for the flash, fits in the flash. Which just goes to show you that the real trick in hybrid arrays is getting the right ratio of disk to flash to suit the particular workload, and that boosting processor core count or speed is not always the answer – not even for a CPU maker like AMD.

EnterpriseAI