Advanced Computing in the Age of AI | Tuesday, March 19, 2024

Migrating Toward Advanced Scale Computing: Overcoming Paralysis 

If it’s true that only 10 percent of U.S. manufacturers have adopted advanced scale computing, it’s also true that 100 percent of the C-suite members of those companies are aware that HPC technology can deliver significance product design competitive advantage. They also have the gnawing awareness that some competitors are leveraging this technology – technology that, according to a recent report from industry watcher IDC, returns an astounding $514 for every dollar invested. Yet most companies remain stuck with their traditional technology infrastructures, missing out on the cost savings, accelerated development cycles and superior design capabilities that HPC holds out for them.

Why?

It’s complicated. Adoption of advanced scale computing is challenging in myriad ways. That much bandied-about 10 percent figure, cited above, is part of a study by U.S. Council on Competitiveness that also found that only 10 percent of servers worldwide are used in HPC clusters. Most companies are confined to the limited capabilities of desktop-based CAD/CAM/CAE systems, but more than half admit they have modeling and simulation needs that their current systems can’t handle.

The cost of new HPC equipment is only one barrier to broader use. An IDC report indicates that less than 10 percent of HPC TCO stems from hardware. Staffing and staff training account for nearly 70 percent. A raft of specialties and skill sets are required to successfully launch an HPC implementation: trained computational scientists with domain software expertise, performance engineers who can maximize hardware performance and job scheduling, and visualization experts, to name a few. The problem is acute for smaller companies. It’s often observed that the smaller the company the greater the need for HPC support resources.

Yet some companies, including SMEs, are overcoming these challenges by leveraging outside resources to help lower the cost and ease access not only to HPC hardware, software and networking, but also the expertise to tailor the technology for companies’ unique requirements.

Brewer Science is one. The Rolla, MO, company, which develops materials, processes and equipment for the fabrication of devices used in the microelectronics industry, encountered challenges faced by many small manufacturers migrating to HPC: computing resources inadequate to the task at hand, limited simulation expertise and a need for quick turn-around.

David Martin, Industrial Outreach, Argonne National Laboratory

David Martin, Industrial Outreach, Argonne National Laboratory

Brewer needed enhanced molecular modeling resources to better understand the use of adhesions for bonding substrates of integrated circuits used in their microdevices. The company turned to the industrial outreach program at Argonne National Laboratory, led by David Martin, who discussed Brewer’s partnership with Argonne at SC15 last month in Austin. Martin and his team facilitated Brewer’s access to the Argonne Leadership Computing Facility (ALCF), whose mission is to provide advanced computing resources, computational scientists and other support services to research organizations and commercial businesses.

The featured ALCF computing resource is Mira, a 10-petaflops IBM Blue Gene/Q supercomputer, one of the fastest supercomputers in the world. The ALCF normally seeks applicants seeking to use Mira to advance big science challenges, so working with a company like Brewer, with a relatively small-scale product design objective, was something different for the ALCF.

“The large companies we work with tend to look like research organizations in the Department of Energy or the National Science Foundation,” Martin said. “We work with GE Research or Boeing, and they have enough resources, long enough time lines and the skill sets, so they’re very successful at getting through our allocation process.

“But small business are not, small businesses have a lot of challenges,” he said. “They typically have small compute resources, just workstations or small clusters, or maybe they’ve played around with AWS (Amazon Web Services), but they tend not to have dedicated computing resources, or the ones they have are not well adapted to HPC work, they have slow Ethernet interconnects or a network of workstations that don’t have the processing capabilities to run parallel jobs.”

Martin said small companies also tend to use commercial software more heavily than ALCF’s typical user organization – applications such as MATLAB and ANSYS, which he said work well in limited environments but tend to “fall over” when used for higher resolution simulations or larger-scale workloads.

“A lot of companies understand simulation but it’s from a canned perspective,” said Martin, “they understand how to run a simulation of an existing process or an existing physical simulation, but they don’t understand how to implement the new models.”

Another characteristic of small businesses: limited time. Large organizations are often comfortable with a six-month application process followed by a nine-month wait for compute and consulting allocations. But smaller businesses usually don’t have the funding for long-term projects of this kind.

So ALCF handled Brewer differently in a number of ways. Instead of requiring a project of major scientific or engineering proportions, they were open to Brewer’s relatively small project goals. Instead of Mira, they offered Brewer compute cycles on Jazz, a 350-node, x86 cluster with Infiniband, a system that, though modest compared to Mira, was “massively larger” than resources Brewer had access to, according to Martin. Jazz also offered a familiar environment for Brewer, including Linux and standard types of job submissions that the company was familiar with.

“This is the way we can reach out to non-traditional communities, by changing our model,” Martin said. “If we had put them through the normal ALCF process, they would have made an application, we would have rejected it and that would have been way outside the scope of what we were able to do.”

Argonne computational scientists and ALCF staff worked with Brewer to optimize their computational chemistry software portfolio, helping them migrate from Gaussian, which Martin said has scaling challenges, to GAMESS, a quantum chemistry application. “We were able to analyze what they were trying to do, and GAMESS provided that capability.” This helped accelerate Brewer’s time-to-results, which was important to the managers advocating funding for the project with Brewer senior management.

In addition, ALCF connected Brewer with the Institute for Molecular Engineering at the University of Chicago, with whom the company signed a small consulting contract for molecular modeling development.

The results: In less than three months Brewer determined they had been coating their substrates too thickly with adhesions. A thinner coating not only improved adhesion, it lowered Brewer’s costs by reducing their materials use.

“We were able to help them be successful fairly quickly,” Martin said, “and now they use some of the GAMESS software that they have experimented with to start running at higher scales. We’re hopeful that will be a feeder into the ALCF application program to run larger workloads.”

The ALCF program at Argonne is one of many industrial outreach programs available to commercial companies looking to migrate to advanced scale computing. Many supercomputing centers offer computing allocations to small and medium sized companies, such as the Ohio Supercomputer Center, which partners with companies across the country.

The U.S. Council on Competitiveness is a federal policy group whose mission includes expanding the base of industrial HPC users. Its well-known NDEMC (National Digital Engineering and Manufacturing Consortia) was a pilot project that succeeded in helping 16 of 20 manufacturers implement advanced scale computing systems.

“You have to work very diligently with companies that many not know the ROI of HPC,” said Chris Mustain, Vice President of the Council, “that may not have the expertise in-house, have been doing CAD/CAM on the desktop, and maybe are not fully appreciating how they can move on from there. It’s a slow process that will ultimately be lubricated by market necessity. Because companies will have to compete with someone else who is using this technology. If you’re a manufacturer and you’re not using it yourself, you know you need to.”

Another HPC resource center, the National Center for Supercomputer Applications (NCSA) at the University of Illinois, tends to partner with large manufacturers, such as Kodak, Procter & Gamble and Shell, to deliver HPC systems and consulting expertise.

Merle Giles, National Center for Supercomputing Applications

Merle Giles, National Center for Supercomputing Applications

Merle Giles, director of NCSA’s Private Sector Program & Economic Impact, emphasizes the cultural and behavioral changes – from the R&D department up to senior management – that enterprises must undergo in order to fully exploit HPC.

“HPC in the enterprise tends to be managed by IT,” Giles said, “and IT is typically an expense line, a cost to be minimized. IT isn’t usually viewed in terms of providing an ROI. It’s typically tasked with making existing business processes more efficient, like payroll. HPC is the opposite, it’s about building new things, it’s about innovation. So IT department oversight is one barrier to HPC in the enterprise.”

Giles said this extends to companies accounting for HPC costs as a “G&A”” (general and administrative) item, noting that costs are something companies try to minimize.

Another cultural challenge for HPC is complexity. The difficulty many senior managers have in grasping the technical aspects of advanced scale systems has confined HPC to the laboratory and R&D space, rather than being a boardroom conversation, Giles said.

“The primary barrier is behavior,” he said. “We see this routinely. The way many companies use HPC is often a carry-over from everything being done on a desktop. So the way in which HPC can be used, understanding how fast these things can work, is very different from how it would run on a single node. That’s what companies have not quite embraced yet. They’re smart as all get out, but if they haven’t been on a modern machine and pushed their own code in ways that address all aspects of HPC, then they’re leaving a big performance delta on the table.”

Giles said behavioral barriers extends to procurement: with HPC innovating and accelerating so rapidly it’s imperative that enterprises purchase technology in a way – via leasing, cloud computing or outsourcing – that lets them keep pace with technological change. Yet this approach can conflict with the conventional buy-and-hold corporate procurement model.

“I look at HPC as a lifestyle,” said Giles. “We’re in this game forever. We’re 30 years in. Manufacturing companies have pushed single discipline simulations hard for a dozen years. Now they want to go multi-discipline. The complexity is going through the roof. So how do we deploy HPC in ways that don’t look like what we did a dozen years ago?”

EnterpriseAI