Advanced Computing in the Age of AI | Friday, March 29, 2024

The ‘Extinction Event’ for Performance Analysis and Capacity Management 

The traditional way IT conducts performance analysis and capacity management for their infrastructure is, officially, a dinosaur. From the most advanced scale HPC architectures to commodity hardware and software based distributed environments to proprietary platforms, the standard approaches are increasingly irrelevant and incapable of realizing the required efficiencies and performance.

First, a quick review of what performance analysis and capacity management (planning) were always about: achieving that fine balance between just enough physical (real) resources to continuously and cost-effectively ensure service performance measured by the only metrics that matter: response-time (and it’s cousin, latency), and transactional throughput. How much work gets done, how quickly, and at what cost. At the end of the day, no matter the architecture, all compute work must execute on physical resources. The traditional approach for decades has been to measure IT resource utilization rates and avoid poor service performance.

The first game-changing factor is the ever-increasing pace of commoditization of IT hardware at all levels of the stack: CPU, memory, storage or network. Ever more high performance “components” are available at ever lower prices enabling economics of scale to finally be applied to high performance computing. With the fundamental “units” of computing increasing in performance and lowering in cost, the benefit of increasing utilization rates is a decreasing sum game.

The second revolution involves the creation of massively parallel architectures (from these commodity components) and, importantly, the increasing availability of software languages, development environments, and applications that have the ability to scale horizontally in a linear fashion distributed across “N” compute units. If you can just “add more” to get more performance, the benefits of optimizing individual “units” begins to decrease towards zero!

Virtualization at every layer represents a third “shot amidships” to traditional approaches. From servers now in their second decade of virtualization and the massive adoption of virtual machines like JVM, to storage and networking virtualization, abstraction layers are being added across the stack. Next comes Docker, which adds yet another layer of complexity to the task of “Just how do we measure resource utilization, and which (virtual and changing all the time) resources do we measure?” Billion-dollar companies are invested in trying to solve just ONE of these layers, let alone optimizing across all of them.

Taking these three factors and then adding automated provisioning as a wrapper completes the “extinction event” for traditional tools and approaches. After all, if an ever cheaper and higher performing set of resources can be architected in a horizontally scalable architecture virtualized into logical “units of compute” and made available on demand, what’s the point of worrying about individual resource utilization rates? And the rapid adoption rates of public clouds would seem to be a testament to this growing reality.

Finally, though, not all workloads are candidates for the “nirvana” of pure public cloud environments where someone else worries about all the “technical stuff” and all you do is worry about cost. There are many workloads whose latency requirements demand an on-premises solution. There are many applications whose transactional data is governed and regulated such that it cannot be exposed externally. And there are many applications that are not built in the stateless fashion. Most vexingly, public cloud providers don’t guarantee service performance and response time; they have the ability and the right to “bounce” a host at any time; it’s the subscriber’s responsibility to architect for failure.

So, with a “cloudy forecast” in everyone’s future (whether public, hybrid, or private), what should IT professionals do to optimize the age-old performance and capacity balancing equation?

A few hints:

First, for performance/capacity you must measure what matters: response time and throughput at every level of the stack: end users, network, servers, storage, and for the application code/logic itself. There is no one magic tool that can do it all, so you must plan on best of breed for the foreseeable future.

Second, regarding cost, you must measure those elements that drive the operating expenses of your IT resources. Yes, this includes traditional CapEx and OpEx for your data center assets. But it must include the cost factor that driving a larger share of expense: power consumption.

Only by marrying these new metrics and gaining a holistic view of what matters, how it performs and what it costs, can the goals of performance analysis and capacity optimization be realized in today’s complex world. Only this way do you have any hope of true optimization of performance and cost over time.

Dave Wagner is the CTO and co-founder of OpsDataStore.

EnterpriseAI