Advanced Computing in the Age of AI | Friday, March 29, 2024

Giving Businesses Access to the World’s No. 2 Supercomputer for ‘the Intractable Problem’ 

If you’ve blown through the limitations of a cluster that can’t handle your modeling workloads, Oak Ridge National Laboratory’s Titan system is the stuff dreams are made of. Number two on the Top500 list of the world’s most powerful supercomputers, Titan is a Cray XK7 system equipped with tens of thousands of AMD Opteron CPUs and NVIDIA Kepler GPUs, total system memory of 710 TBs and a theoretical peak performance of 27,000 trillion calculations per second.

An object of computing desire like this is only available to scientists at the national labs researching the inception of the universe or modeling the earth’s climate, right?

Wrong.

Titan is potentially available to any private enterprise through a program at Oak Ridge called ACCEL (Accelerating Competitiveness through Computational Excellence). In 2014, 34 industrial projects used 263 million compute hours on Titan, or 6 percent of system’s total compute hours. The program grew in 2015. Specific numbers for the year have not yet been released, but Oak Ridge said total industrial projects using Titan jumped to 43.

ACCEL_FD bar chartTitan users range from large companies to startups: Chrysler/Fiat, Procter & Gamble, General Electric, General Motors, and United Technologies Research Center; rocket manufacturers Orbital ATK and Space-X; software firms Ansys, Dassault Systemes Simulia, Numeca and startup Appentra.

Gaining access to Titan involves a peer-review application process, ranging from 30 days for smaller projects up to five months for larger ones, in which applicants make the case that the proposed project could deliver a critical competitive advantage. There’s no charge for Titan other than the user’s willingness to share project details publicly (or, for confidential projects, there's a cost recovery fee of $0.03196 per core hour – but no company has chosen this option in the program’s seven years).

According to Suzy Tichenor, director, industrial partnerships program computing and computational sciences at Oak Ridge, the ACCEL program is looking for companies that want to use Titan in the right way.

“You have to be committed to scaling up,” she told EnterpriseTech. “You can’t just come to our center if you’re running out of internal compute resources and you have a big deadline coming up. Our projects generally are six to 12 months in length. They (companies) need to bring strategic problems that would really help them break out of the pack competitively. They have to be seemingly intractable problems that exceed what they can do internally. If they’re willing to put in the time and effort to scale up on a bigger system and solve that problem, that’s the kind of problem to think about taking to a national lab.”

One company with such a challenge is FM Global, a large commercial and industrial insurance company that has engaged in two research projects using Titan to simulate fires that occur in large-scale storage warehouses.

FM Global provides coverage to one in three Fortune 1000 companies, offering property insurance products and loss prevention research and engineering services to help clients prevent fires and minimize loss. Fire is the leading cause of commercial property damage in the U.S., resulting in roughly 40 percent of industrial property loss annually. Understanding how fires start and spread can save insurer and insured millions of dollars. Businesses with large storage warehouses are at particular risk because as warehouses get bigger it becomes increasingly difficult to provide adequate protection using traditional ceiling-mounted sprinkler systems.

While FM Global has an in-house cluster that has grown over the last several years to 2000 cores, the company sought access to Titan to scale its FireFOAM code, based on open source fluid dynamics software called OpenFOAM, to simulate in greater detail the complex physics that occur during an industrial fire. The company also has a large fire testing experimental facility for analyzing large-scale industrial warehouse fires.

WANG_YI_5652

Yi Wang of FM Global

But the facility can’t duplicate the scale of the new mega-warehouses, according to Yi Wang, group manager of FM Global’s Fire Dynamics Group. These industrial warehouses and distribution centers can exceed 100,000 square feet and rise from 60 to 100 feet in height. Many companies choose to use this extra height to store their commodities—in corrugated cardboard boxes— on wooden pallets stacked in tiers. Studying fires that could break out in warehouses of this size was beyond the capacity of FM Global’s physical and virtual testing capabilities.

FM Global worked with computer scientists at Oak Ridge to scale FireFOAM from 100 CPUs to thousands of CPUs, enabling very fine mesh simulations, with each cell calculating the processes for a very small area and sharing the data with neighboring grid points. The finer the grid, the more computationally demanding the simulation becomes and the more accurate the simulation. The result: FireFOAM was scaled to simulate seven tiers (35 ft. high) of storage.

Wang told EnterpriseTech that Titan delivers superior performance by a factor of five over his company’s internal cluster and has been used to run between 40 and 50 fire simulations, each one lasting approximately 48 to 72 hours. He said the initial research conducted on Titan helped FM Global develop a new “in-rack” sprinkler design for more effectively suppressing fires in large warehouses, a strategy it has shared publicly and with its clients.

“Fire modeling is a very challenging task,” Wang said. “We needed access to computer scientists who have access to larger hardware to push the limit of our model to solve the problem quicker and even more accurately. So we realized Oak Ridge would be the perfect match. They started to profile our code so it could be more efficient, using GPUs (in Titan) and the latest technology to help us.”

EnterpriseAI