Advanced Computing in the Age of AI | Tuesday, March 19, 2024

On the Front Lines of Advanced Scale Computing: HPC at Boeing 

The Boeing Company has one of the largest corporate IT portfolios in the world, an organization comprised of some 7,500 people, most of them in the United States. Instrumental to product design at the company’s commercial, defense and R&D units is the Enterprise HPC Service. Developed in recent years, it’s a centralized advanced scale computing capability and a resource sharing organization that was initially met with skepticism, if not downright hostility, by some Boeing engineers.

The HPC service is part of an overall strategy for delivering advanced scale computing into the hands of Boeing design engineers when they need it. The strategy isn’t just concerned with the why of using HPC, or which HPC technology to use. It’s also a methodology to maximize the value of HPC at Boeing and ensure fair distribution of HPC resources.

“We don’t just do HPC because it’s cool, even though it is,” said Joerg Gablonsky, associate technical fellow and chair of the Enterprise HPC Council, Information Technology. “We are a commercial company, so it’s all about driving business value. We need to drive future products. We need to design things faster, more efficiently, more accurately, more digitally and reduce the cost of designing.”

Boeing's Joerg Gablonsky

Gablonsky spoke last week in Austin at the SC15 “HPC Impact Showcase,” a series of presentations on real-world applications of HPC to advance business competitiveness and innovation.

Boeing is among the largest global aircraft manufacturers, it’s the second-largest defense contractor in the world and is the largest exporter in the United States by dollar value. It’s stock is listed in the Dow Jones Industrial Average. Yet for all that, the company – like virtually all companies – is in a perpetual, high-pressure race to innovate. Gablonsky and other IT managers at Boeing are in an ongoing quest for the most powerful and effective technology to support the incredibly complex job of designing, testing and building aircraft.

Pressure comes to bear on Boeing not only from other aircraft manufacturers but also from government regulatory requirements, from the FAA in the United States and the European Aerospace Safety Administration. “If we bring out a new airplane in 2020 that has the same noise levels as current airplanes, it won’t be permitted to go into production and service. So we have this natural forcing function that always requires us to improve our processes and get better.”

Today, the HPC capability that supports Boeing’s commercial, defense and R&D organizations is founded on a water cooled HPE Apollo 8000 and Panasas parallel file storage. It supports 30 organizations with roughly 1000 end-users, nearly all of them design engineers. The Apollo 8000 is the latest in a series of system upgrades at Boeing that began in the 1980s with a Cray 1. By the early 2000’s, Gablonsky said, Boeing was using an SGI shared memory system to support a small, decentralized HPC service comprised of smaller data centers across the country wherever large Boeing facilities were located.

But management came to realize that local data centers are less cost-efficient and constrained in their capabilities. An unconsolidated compute capability limited the top-end design and test work that individual engineers could perform.

“If I have a small cluster,” said Gablonsky, who joined Boeing in 2011, “even if I get the whole cluster, I only have a small set of nodes that I can utilize, I can’t run a really big job.” So the organization moved toward a centralized HPC service, about 75% completed, featuring large-scale strategic data centers with high availability and bigger shared systems.

Critical to the success of the service is resource elasticity. “Nobody has steady workloads,” said Gablonsky, “everybody has times when they use more, other times when they use less. When you have lots of different users from geographically diverse sites, working in all kinds of different disciplines, you have to have a way for everybody to work together.”

So Boeing created the Enterprise HPC Council, chaired by Gablonsky, that brings together HPC stakeholders: end users, software developers, software support staff, design teams running clusters, finance. Engineers who need to complete major jobs under tight deadline requiring more nodes than usual petition the council for extra resources. The council convenes, the stakeholders hear the request, and those who have relatively light workloads are asked to pull back on their use of HPC resources in return for more resources later, when needed.

“We’ve built trust among the stakeholders to allow people to work together and address these spike needs, without impacting significantly anybody else,” Gablonsky said. “A lot of it is trust – building trust is very important.”

Combining a centralized Enterprise HPC Service with a council that has the authority to apportion resources was not greeted with universal applause. A Boeing design engineer told Gablonsky he never expected the scheme would work. But later, that engineer found himself needing to complete a major job under a tight schedule while another program also faced a project deadline. The council came together and a plan was drawn up to provide resources needed for both projects. Gablonsky said the skeptical engineer later conceded that the council is successfully carrying out its resource sharing mission.

With consolidation of HPC capabilities comes more compute power in the hands of engineers. While that’s attractive, it’s also true that some engineers liked having local systems in their own end-user organizations, in their own control, under the decentralized HPC arrangement that preceded the Enterprise HPC Council. “But now they are now running way more than they ever would have with their local systems,” Gablonsky said.

“We really need to be a service,” he said. “While I love HPC, when it comes down to it, we need to enable the engineers to do their work to build the next gen of products. They’re the ones who bring in the money. We have to have a service perspective and attitude to really make sure that’s what we do and that’s what we drive to.”

Another aspect of the council’s strategy is to generate awareness and support for HPC among senior management through billing. Many organizations simply bill HPC usage at a corporate level, it’s paid by accounting, and that’s the end of it. But the Enterprise HPC Service has a different model, one based on confidence in the value delivered by HPC and the belief that the more awareness there is of HPC the more value will be recognized. Instead of issuing a general-purpose bill, the individual business organizations are charged back for their use of HPC resources. In recent years, there was major growth in the Boeing defense unit’s usage, bills were sent out and this brought about questions from defense unit executives.

“They wanted to know, ‘What’s this HPC thing I’m paying for?’ Gablonsky said. “That’s actually great for us because it meant our engineers had to explain to the executives why HPC is so valuable and so useful.”

Still, questions occasionally persist at Boeing about the value of HPC. Some argue in favor of using public clouds – or even standard IT servers in the Boeing data center. “But we’ve done the benchmarks, and once we do the benchmarks those questions go away.”

Gablonsky said public clouds have security and cost problems.

“A lot of the data we have is very sensitive, and so it’s going to be very hard to convince anybody to have that data leave the company premises,” he said. “And there’s also the cost factor. At the scale where we are at, we are doing quite well regarding cost (being on-premises). Certainly I’m a big fan of cloud for small and mid-sized companies, but I think for the big companies that have the expertise, that understand how to run an HPC service, I think it’s going to be quite a while before cloud becomes competitive.”

Operationally, HPC resources at Boeing act like a private cloud service.

“We have resources pooled,” said Gablonsky, “that’s the whole concept of bringing all the clusters that used to be distributed all over the country into a small number of data centers that are broadly accessible from within the company. Anybody who’s in the company can utilize it, a user can just submit a job, get an account and be up and running in a day or two – they just log onto the ad node and submit a job. When you talk to people (at other companies) who are used to having to wait a couple of months to get a VM, this is a big deal.”

A centralized resource requires remote access and remote graphics from the desktop, with the HPC system serving as a compute engine in the background.

“When you do CFD or anything with big grids, where you need to do pre- and post-processing, you need to have the ability to do that where the data is, and not having to transfer big data sets back and forth,” Gablonsky said. “So we have an environment with clusters of parallel storage and then we have remote graphics workstations that in many cases are actually better than what you can on a desktop system because they have the same processors we use in the HPC system, just with GPUs. So you can do full 3-D graphics with them even across pretty wide distances. It works really well.”

Gablonsky cited two examples of the HPC Enterprise Service in action. One was a ground noise assessment as part of a contract Boeing won from NASA for a supersonic vehicle. This was a computationally intense problem, involving integration of results from multidisciplinary analysis.

Gablonsky said NASA put stringent ground-level noise requirements in place based on physical tests in which people were subjected to noise levels to determine acceptable decibels.

The methodology developed by Boeing engineers would require thousands of CFD runs, so the engineers went before the Enterprise HPC Council to request more compute resources. Fully a quarter of the HPE Apollo 8000 was granted to the engineers while workloads from other groups were moved to another data center.

End result: the NASA deadline was delivered successfully and on time, and Boeing hopes for follow-on work with the space agency.

Another example relates to using CFD to simulate air flow around a new wing design – a common enough, but nevertheless highly complex, job. Boeing engineers ran more than 1000 core runs using production codes for several weeks. This was a large-scale compute job for the Enterprise HPC Service that required major reallocation of HPC resources by the council. The engineers completed the study, worked with Georgia Tech to build a physical model and ran the wind tunnel test. The simulation and the physical test aligned.

From Gablonsky’s perspective, this was a double win: the simulation was proven out in the physical world, and the Enterprise HPC Council was able to deliver the necessary compute power to complete the 1000-plus core run without disrupting other work done using HPC resources.

Gablonsky said he believes Boeing has an effective technological, operational and organizational HPC strategy in place for the future. The major variable is compute power. As more Boeing design engineers come to appreciate what HPC can do for them, they will request more powerful systems.

“I don’t foresee being in a position where the engineers say, ‘You know, we really don’t need any more computing capacity. We’re good.’ That’s never going to happen.”

EnterpriseAI