Advanced Computing in the Age of AI | Friday, March 29, 2024

Gaining Visibility into the ‘Black Box’ of At-Scale Data Storage, Usage and Performance 

There’s Big Data, and then there’s the oceanic volume of external and machine data processed to generate a weather forecast of, let’s say, the upcoming 30 years. In such an at-scale environment, it’s doubly difficult to get insights about usage patterns and which users or workloads are impacting performance and capacity. But this is the scale of volume and complexity grappled with by Vaisala, a Finnish global environmental and industrial measurement company (with U.S. headquarters in Seattle).

The extreme-scale storage technologies embraced by Vaisala’s Alternative Energy Group, which resides within the company’s weather division, are illustrative of the data storage and data management challenges faced by weather services providers. The requirements of Vaisala customers, which include solar, wind and hydro energy companies, are particularly complex – some need forecasts for upcoming decades, others for the upcoming (literally) 10 seconds.

vaisala-logoAccording to Paul English, IT Manager, Energy R&D, Vaisala has one of the largest private supercomputer infrastructures dedicated to renewable energy, a cluster comprised of 2,000 cores of Intel Xeon-based compute power with capacity for more than 400TB of usable data. English told EnterpriseTech that each day the system moves “tens of terabytes of data” on behalf of its clients.

Operating on this scale, English said, Vaisala ran into major data and performance management challenges. He characterized the previous system (he declined to name the vendor) as “a black box” that didn’t allow visibility into such factors as storage utilization, data load and the speed and location of changes to the system, which is critical given Vaisala’s increasing velocity of data ingest.

“Over time, as the system grew, a major challenge was managing all the meta data,” said English, such as how “many files there are in a given directory, how much storage there is, what the performance throughout the system is.” He said figuring out what storage a given piece of the file system is using, for example, took hours or days, as did reporting on workflows.

“We inevitably get questions about why a given HPC compute node is slow, and it had always been assumed to be the storage,” English said. “For every other storage system I’ve used, it was nearly impossible to determine the truth. We found ourselves constantly fussing with the system, moving data around as ‘SmartPools’ filled up — trying to understand what’s being used and what’s stored where, running reports that were out of date before they finished.”

Vaisala’s HPC architecture and file-oriented modeling puts a premium on storage for linear scalability of capacity and performance. That limited its options, so the company considered a simple refresh of the previous system. But last year, Vaisala and English began looking at alternatives, including Qumulo Core, positioned within the high-end storage market as delivering “data-aware” scale-out NAS, in which the storage system will “cease to be a passive, ever-growing dumpster for digital waste. Instead, it will become an intelligent collaborator in the storage, retrieval, management and curation of trillions of data objects.”

Vaisala uses Qumulo’s QC208, a software-only, flash-first hybrid design designed to handle a variety of workloads and file sizes that is programmable through an interactive REST API. Its real-time analytics help users obtain answers about their data footprint by explaining usage patterns, and it identifies which users or workloads are impacting performance and capacity.

qumulo-capacity-trends“What Qumulo Core does is explain how the performance resources of the system are getting used in terms that our customers understand,” said Qumulo CEO Peter Godman. Using data visualization dashboards, the system shows “which machines on the network are consuming performance, and what individual assets, what directory structures are they working in when they’re consuming all of that performance.

“The traditional way of dealing with performance problems in enterprise data storage is you go look at every client, you figure out what they’re doing,” Godman said. “A lot of people will actually do periodic system call traces on client machines in their network to find out what’s accessing what and where the storage load is coming from.”

“But over time, as the system grew, a major challenge was managing all the meta data,” English said. “That’s where Qumulo came in. They gave us instant access to almost all the meta data about the system, such as how many files there are in a given directory, how much storage there is, what the performance throughout the system is. That was a huge value-add.”

Qumulo Core’s combination of analytics and intuitive dashboards, he said, “allows users to go from problem to solution in seconds, rather than hours.

“I can just pop into the console and say ‘Oh, it has nothing to do with the storage, you’re saturating the link on the HPC compute node itself,’” English said. “Qumulo is the only one that provides these visibility capabilities. In fact, non-technical people can find it through the GUI. And we actually have developers who use the Qumulo programming interface to get results that way as well.”

Vaisala’s data management complexity is exacerbated not only by the volume but the variety of data stored and processed to produce weather reports that project future solar, hydro and wind power generation conditions. Much of this data is high volumes of in-the-field machine-generated sensor data combined with massively complex forecasting models, to generate meaningful assessments on the ground. And above it.

“Our simulations cover a cube of space across a rectangle of land, all the way up to the clouds – and then over decades of time,” said English. “As the resolution of those models and sensors go up, the volume, variety and velocity of that data increases exponentially.”

Qumulo Core utilizes the Qumulo Scalable File System, which provides real-time visibility into data and storage to solve the data management problems created by first generation scale-out NAS products. QSFS was designed to leverage the price/performance of commodity server hardware coupled with flash, virtualization and cloud, and it works across all client/application layers, including UNIX/Linux, Windows and Mac.

English said his team uses Qumulo’s REST (REpresentational State Transfer) API, a widely used architectural style and approach to communications in the development of Web services, to build custom tools and integrations to optimize data modeling and forecasting efficiency.

“Our ‘ah ha!’ moment was the realization we can integrate Qumulo’s API calls with other APIs to make the overall impact much greater,” English said. “We’re leveraging Qumulo’s calls and we can customize these insights to our specific needs.”

Vaisala’s machine data processing and storage requirements will only grow, English said, due to increases in the volume of sensor data and higher resolution of government weather forecasting models. Vaisala has added another Qumulo storage node and anticipates more in the future.

“The baseline requirement is that we can add more storage and more throughput, basically more performance, as we go,” he said. “But it turns out that actually knowing what data you have and where you have it and how it’s acting is at least as important as being able to grow as our business grows. We’ve traded our time managing storage for opportunity leveraging the data. Storage management is not particularly productive for the organization; we’d rather concentrate on working with internal customers on products and services. We get way better ROI when I can talk with our teams and say ‘how can I help you?’ instead of being asked ‘what can you do for me?’”

EnterpriseAI