Advanced Computing in the Age of AI | Tuesday, March 19, 2024

An Urgent Plea from the Manufacturing Sector: Give Us Scalability 

(Source: Shutterstock/branislavpudar)

Mohamad El-Zein has been on the front lines of advanced scale computing for decades, and like a veteran combat soldier who has seen one too many failed battle plans, he views the promises of next-generation HPC technology with a wary eye toward his vulnerable flank. As Manager, Advanced Materials and Mechanics at John Deere & Co., he’s a senior engineer working with design teams on heavy machinery for the farming industry and other markets.

Not that El-Zein is negative about HPC. But he is resistant to HPC hype, noting that the Linpack benchmark used to rank the world’s most powerful systems, the vaunted Top500 list, lacks relevance to the design problems he would like to solve on the clusters at John Deere. On the other hand, he is impressed with the performance improvements in hardware, networking and storage. And he is willing to adopt more HPC when it can be proven to perform tasks better, faster and at reasonable cost.

His primary technology grievance is the inability of commercial CAE, CFD and FEA applications to fully leverage the power of newer high performance systems and architectures.

John Deere's Mohamad El-Zein

John Deere's Mohamad El-Zein

Speaking at the recent EnterpriseHPC Conference, co-hosted by EnterpriseTech and HPCwire, El-Zein argued that advances achieved at the national labs in modernizing engineering codes, advances that are being adopted by big-ticket manufacturers (such as aerospace), must trickle down to mainstream industrial companies if HPC in the manufacturing sector is not to be permanently hamstrung. Otherwise, the full potential of HPC – for faster design cycles and fewer costly physical models and tests – won’t happen.

The core of the problem: John Deere’s extensive portfolio of CAE, FEA and CFD applications – “You name it, we got it,” El-Zein said – has become outmoded with the onset of advanced scale clusters.

“These codes are old and they were written for (mono-core) supercomputers,” he said. “Then there was a big transformation (to clusters). They don’t parallelize very well. In FEA, it’s very difficult to get good scalability because the problems aren’t very scalable. With finite element you have a big problem. The problem can explode on you.”

A classic problem for engineers, he said, is “weldment,” welds being the weakest point in metal machinery.

“Your car, your truck are full of welds,” El-Zein said. “We want to simulate a weld because that’s where cracks happen. If your car fails in a section you better give it away. That’s because everything fails at joints, either a bolt, or a rivet or an adhesive joint, or a weld. If they don’t fail there then that designer should be fired. That’s a rule. Cars don’t break in half. Joints are where they fail.”

But welds are more complex than non-engineers might think. Simulating all the variables involved in a weld is a major challenge for conventional CAE software running on a multi-core system. “You have to get all the right frequencies. Why? If the mass is off, the stiffness of the whole machine is off. So if the mass if off, and the stiffness is off – you’re off.”

The solution, presumably, is applying more cores. But using more cores raises problems, starting with cost (software users are charged on a per-core basis).

“If I throw more at it, I’ll improve the solution, right?” El-Zein said. “The question is: what’s the cost? What’s the cost of me going into that little weld and testing with all that money. I’ll get fired. Note that I’m talking about welding – the crappiest method of joining things. But it’s very complicated. We really haven’t figured out much, trust me. We’ve figured out very little. We’ll keep improving. So when we think about HPC, it’s all relative. What do you want to get out of it? What scale do you want to go down to? So that’s why these problems can become very difficult.”

Then there’s scaling limitations. El-Zein said that while it’s theoretically possible to scale out to several thousand cores, the law of diminishing returns applies after several hundred.

“If a problem in CFD takes more than 500 or 600 cores, I don’t care what ISV code you use, you can attain a state of limits,” he said, citing problems associated with MPI. “I’m not gaining much by utilizing thousands of more cores. That’s from an industrial point of view. Maybe from an academic point of view they keep adding more cores. For me, I would just be wasting time with 3,000 cores that don’t add value. We don’t have the luxury of letting things run like that.”

Another challenge in industrial design, El-Zein said, is a shortage of software engineering experts for managing and optimizing the hardware-software clustered environment.

“The expertise is what everybody misses,” he said. “You have to have an expert looking at all these problems and if you don’t, you’re missing the boat. You need expertise – in the codes themselves, in the IT side of managing clusters, there are a lot of issues in the industry that have not been resolved.”

Those few experts are expensive.

“If I were selling an $85 million combine,” he said to laughter from the audience, “I would go and get 12 PhD’s in CFD. There’s an investment in people. You can say whatever you want about computers, but in the end it’s the people who count. That expertise is hard to find. I’m telling you this is the main problem. I have to have the ability to define my problem, what it is, and how I can make an investment in it.”

Yet in the public sector, El-Sein said, major progress is happening in CAE software scalability. The government labs, he said, have developed design and testing codes that scale to thousands of processors. While companies like John Deere are barred from adopting them because of complexity and cost, higher-end manufacturers with bigger IT budgets are hiring software engineers out of NASA and other government agencies to pollinate the use of advanced software in their organizations.

He cited Boeing and other aerospace companies that “seem to be using a NASA or national lab code in order to scale.”

The problem for most industrial companies, El-Sein said, is “I cannot afford to have a person or two or three or four to go get familiar with the national lab code that they wrote themselves, and then figured it out. There’s a lot of money involved from an investment point of view.”

A key barrier, he said, is that scalable codes used in government are not commercial-ready. Much existing CAE applications originally came from the government sector, and were adopted by manufacturing companies only after extensive customization and GUI development.

“For many of those (new) codes, this needs to be done to it, somebody has to go and invest some time and effort in order to bring it to the public,” he said. “The government has to play a role. They have to do something. They have validated those codes to the utmost. Whether it’s discrete element analysis, FEA, CMP – they have their own codes. They were written to be scalable. The problem is I don’t have the money or the expertise on my team to go and learn those codes.

“See, in industry, we pay for GUIs. That’s very important to us. If you go to the national labs, they enjoy command lines. They find it exciting. But it’s not exciting for industry because we have to make something.”

Despite his frustrations, El-Zein said he retains a belief in the future of HPC in manufacturing.

“I’m not saying HPC is unimportant,” he said. “On the contrary, I believe HPC is the answer to many things we do today. However, we have to take care of the software issues. The hardware is no problem today. People have figured out storage and speed. But if I don’t have the right codes and the trust in them, I won’t take that step, I’ll stay on the conservative side.”

EnterpriseAI