Advanced Computing in the Age of AI | Friday, April 19, 2024

Dow Seeks Chemistry Between Clusters And Cloud 

Like many organizations, Dow Chemical Company has to provide local compute and storage for running modeling and simulation applications at multiple facilities. But it wants to be able to run applications across those facilities in such a way that it can drive up the utilization across those clusters, and it is also exploring how it might also make use of public cloud capacity to supplement its on-site clusters.

Dow is headquartered in Midland, Michigan, and not surprisingly, this is where one of the company's largest simulation and modeling clusters is located. It is in fact the central cluster in a hub-and-spoke setup across the company's research facilities, and one that will be upgraded in the coming year to reflect that status. The idea is to run as many jobs at the hub as possible to keep utilization high, with jobs that can't be easily run back at headquarters hosted on clusters at the remote sites.

William Edsall, senior HPC analyst at Dow, gave a presentation at the recent SC14 supercomputer conference in conjunction with cluster management software maker Adaptive Computing, and he talked a bit about the clusters Dow has and the challenges it is facing when it comes to managing workloads for its research and development arm.

Dow is one of the largest manufacturers in the world, with $57 billion in sales in 2013 and over 53,000 employees working to make the over 6,000 different chemicals and plastics it sells to all manner of customers. The company operates five different divisions. Agricultural Sciences is a $7.2 billion division that sells seeds, pesticides, and other crop protection. Consumer Solutions is a $4.6 billion unit that sells materials for the auto and electronics industry. The Infrastructure Solutions unit brings in $8.5 billion in sales and offers building, construction, and coating materials as well as various energy and water purification products. The Performance Materials and Chemicals division, which had sales of $14.9 billion, makes epoxies, polyurethanes, vinyl, chor alkali, and other chlorinated organics. And the Performance Plastics division, with $22.6 billion in revenues, makes specialty plastics and packaging for a wide range of uses in electronics, telecommunications, and consumer industries.

The Midland datacenter has a cluster linked together using 40 Gb/sec InfiniBand that has a total of around 1,700 cores. A similarly configured machine is located at the Dow chemical plant in Freeport, Texas. The Dow Agro Sciences research site in Indianapolis, Indiana, has its own cluster with around 1,700 cores but is using 10 Gb/sec Ethernet networking instead of InfiniBand. And a facility in Collegeville, Pennsylvania, (northwest of Philadelphia) has a smaller cluster with around 500 cores, lashed together with 10 Gb/sec Ethernet as well.

Dow was an early pioneer with Linux and Beowulf clusters back in the day, and its current machines are configured with CentOS Linux and run a variety of applications that are used to simulate the processes of chemical and plastic manufacturing at Dow's plants, among other things. According to Edsall, Dow is using hyperscale-class servers, which cram four server nodes into a 2U chassis, as its main cluster component these days, and it tends to buy machines from Dell and Atipa Technologies, the latter being a reseller of Supermicro iron. The machines use a mix of nodes that are based on Intel's "Nehalem" Xeon 5500 and "Sandy Bridge" Xeon E5-2600 v1 processors.

The storage that back-ends the clusters is mix and match, says Edsall. Dow has Isilon arrays from EMC for applications that have really high I/O requirements, about 750 TB in total, and another 180 TB of various other kinds of storage.

Dow mostly runs code from third party application suppliers and does not have a lot of homegrown code. About half of the utilization on the clusters is driven by ANSYS Fluent, the popular computational fluid dynamics application, which among other things is used to simulate chemical plants and the reactions that take place inside of them. Dow also uses Star-CCM+ from CD-adapco, which combines CFD with heat transfer and stress analysis, and Gaussian 09, which is used to simulate the chemical properties of materials. The Dow application stack also includes Materials Studio from Accelrys, which is also used to simulate all kinds of materials, including catalysts for chemical reactions.

Given that its software stack comes off the shelf, Dow's hardware requirements are driven by the scalability of that ISV code and the accelerators and other features that it supports. So to increase the throughput on the clusters, Dow has to be clever about the cluster management and job scheduling tools it chooses and how it uses them. Dow uses the open source Rocks cluster management tool, along with the Torque resource manager and the Moab HPC Suite 8 job scheduler from Adaptive Computing. Edsall calls the method of scheduling that Dow uses overflow job scheduling, and the basic idea, he explained to EnterpriseTech, is to flood the clusters with pre-emptible jobs that can be suspended at any moment and their state stored in memory. This allows them to be fired back up in an instant when a high priority job is done with the resources. By cranking up the number of low priority (but high throughput) workloads on its clusters, Dow can significantly increase the utilization on its clusters.

Before moving to overflow job scheduling, the Fluent and Star-CCM+ CFD applications were scheduled on the clusters using reserved resources and usage caps were set for other workloads. The average utilization on the clusters was around 50 percent of their peak compute capacity, and in the best of cases it hit as high as 75 percent. After moving to overflow job scheduling, the usage caps came off and utilization on the clusters is running near 100 percent all the time. So far in 2014, the Dow clusters have run 4.4 million jobs and consumed 21.3 million total CPU-hours of compute.

Now that it has the utilization up on its clusters, Edsall says that Dow is dabbling with cross-cluster job scheduling and is looking to burst to the cloud where it is appropriate. The company is looking at Moab 8.1, which is now in beta, to see how to deploy work to the Amazon Web Services cloud. And because it has its clusters running at nearly 100 percent capacity, Dow is looking to upgrade those clusters, starting with the hub system in Midland.

"There is no burst for us," Edsall explained to EnterpriseTech after his presentation. "The iron is red hot."

That makes it a lot easier to ask for an upgrade to support growing workloads. The hub cluster at the Midland headquarters, which is nick-named "Tsunami," will be getting an upgrade in 2015. The system will have up to 5,000 cores when the upgrade is finished and will be based on Intel's latest "Haswell" Xeon E5-2600 v3 processors. Dow is sticking with FDR InfiniBand (40 Gb/sec) rather than moving to 100 Gb/sec (or EDR) InfiniBand, which Edsall says is a requirement driven by its software suppliers. The Tsunami cluster will have a full fabric with 2:1 oversubscription. Importantly, Dow will be standardizing on its InfiniBand networks, using consistent switches and adapters across its clusters, which will simplify interoperability and maintenance.

The Tsunami machine will be equipped with some Tesla GPU coprocessors from Nvidia as well as some Xeon Phi accelerators from Intel, but the precise number of each has not yet been decided. "We know we want some GPUs and Xeon Phis, but we don't have a strong use case yet," Edsall said. Dow has done some in-house application development on Xeon Phi coprocessors so far, and some of its ISV partners support Tesla GPU accelerators, but sometimes the acceleration is not supported on the portions of the code that Dow is using, according to Edsall.

While using public cloud resources is also something that Dow is examining, the company's internal clusters have their costs amortized over three to five years and are running at very high utilization rates. And for the workloads that Dow has examined, running the applications on the AWS cloud are about three times as expensive as running them in-house.

The cloud presents a number of different challenges for Dow and indeed all companies that run simulation and modeling. The ISV applications that Dow is using in its research and development efforts can, in theory, run on the public cloud, but generally speaking this software was tuned to run on bare metal, not on virtualized systems and networks. (To be fair, virtualization has come a long, long way in the past decade on X86 systems.) Moreover, software vendors are just now working out their licensing for public clouds. Another issue is that for some simulations that Dow runs, the initial datasets are small, which means they could be uploaded to the cloud easily enough, but the output from the simulations is quite large. This makes it a challenge to move the data back to the internal clusters.

"It is not the inputs, but the results that are the problem," Edsall explained. This is particularly true of any time-series simulations, which can output up to 1 TB of data in a run, he added.

EnterpriseAI