Advanced Computing in the Age of AI | Friday, April 26, 2024

Nutanix Goes All-Flash For Server-Storage Hybrids 

The competition in the hyperconverged system market is heating up, and companies that once worried about taking on incumbent server and storage makers with their server-storage hybrids now have to worry about – and counter the attacks of – all-flash storage array makers who are positioning themselves as the fast storage hub at the center of server clusters.

Nutanix is one of the front-runners in the hyperconverged system market, and it pioneered the idea of having a distributed storage area network that ran on the same server nodes in the cluster where applications were running inside of virtual machines. (Hyperconverged systems tend to have other components, but this is the key differentiation between plain vanilla converged systems, which simply jam server nodes, storage, and networking into a single box with a single management framework and usually a single virtualization substrate.)

This Nutanix Distributed File System distributed file system predates VMware's VSAN by several years, and was created by Mohit Aron, formerly the lead designer of the Google File System that underpinned the search engine giant's MapReduce tools for analyzing unstructured data. Aron was CTO at Aster Data Systems, where he worked with Nutanix co-founder Dheeraj Pandey to create the parallel database by that name (now owned by Teradata). Pandey also created Oracle's first versions of its Exadata database cluster. Suffice it to say, Nutanix understands distributed systems and clustered file systems. NDFS runs inside of virtual machine guests in a server cluster and presents itself as a single entity for accessing data across the cluster; it can scale to hundreds of server nodes, well beyond the limits of most hypervisors and their management domains. NDFS started out on VMware's ESXi hypervisor but in recent releases is supported on Hyper-V and KVM hypervisors.

The Nutanix stack has lots of pieces besides the file system, which EnterpriseTech walked through back in August, and it relies on plenty of open source code even though the Nutanix stack is not itself open source. The Cassandra NoSQL data store that Facebook created because of the limitations of the Hadoop Distributed File System manages all of the metadata that describes all of the data stored in NDFS; Apache Zookeeper, a management piece of Hadoop, is used to keep track of the configuration information for all of the Nutanix nodes. Nutanix has tweaked the MapReduce parallelization method at the heart of Hadoop to create parallel implementations of data compression and data deduplication algorithms, which are important for flash storage. The MapReduce method allows for compression (using Google's Snappy algorithm) and deduplication (using a homegrown algorithm) to run in post-process mode, after data has settled down on the storage drives (flash or disk) in the cluster. Customers can do inline data deduplication against data in main memory and can also do compression against data inline as well, but this eats precious compute cycles and adds to latency. In many cases, customers don't mind using post-process compression and dedupe.

The most important thing is that Nutanix did not have to add deduplication and compression features to its NDFS so it could use flash in a cost-effective manner, as many disk array makers who added flash drives to their systems had to do. Compression and dedupe are generally poor performers on disk arrays (they have gotten better in recent products) and were more tick boxes than widely used. But with flash arrays, with their much more expensive storage media, data reduction is a must to keep from breaking the budget.

With the combination of compression and dedupe, Greg Smith, senior director of product and technical marketing at Nutanix, says that typical customers see something on the order of a 3X reduction for typical enterprise applications and up to a 10X reduction for virtual desktops (which have a lot of common software in their virtual machine images). This gives the Nutanix cluster about the same compression ratios as all-flash arrays.

Smith says that while all-flash arrays can offer lots of IOPS and decent effect capacity after dedupe and compression, they are nonetheless another silo in the datacenter and this cuts against the very principles of hyperconverged infrastructure. And so Nutanix is rolling out a high-end NX-9240, which only has flash SSD storage, in this case, S3700 flash drives made by Intel. The NX-9240 has two processor nodes in a single 2U enclosure, and each node has two ten-core "Ivy Bridge" Xeon E5-2690 v2 processors, which clock at 3 GHz. Each node comes with either 256 GB or 512 GB of main memory and also has six SSDs, and customers can choose from 800 GB or 1.36 TB units, for either 4.8 TB or 9.6 TB per node. Across a cluster with 32 nodes, which is a typical pod size for Nutanix clusters among enterprise customers, using 800 GB flash drives in the nodes would yield 153.6 TB of raw flash capacity and somewhere between 460 TB and 1.5 PB of effective capacity after deduplication and compression, depending on the workload. Call it 600 TB using the conservative industry average of 4:1 data reduction for all-flash arrays. At list price, including Nutanix software stack, that 32-node cluster would cost $3.52 million. (That is $110,000 per node.)

Let's do a little comparing. SolidFire, one of the all-flash array upstarts, told EnterpriseTech this week that it cost about $200,000 to $250,000 for customers to buy an SF9010 array with 60 TB of effective capacity after 4:1 data reduction was added. So boosting that to around 600 TB would cost somewhere between $2 million and $2.5 million, and that is without the 32 node cluster for actually doing compute. In this case, the flash storage nodes and the compute nodes are separate. A server node minus disks and flash with the same processor, memory, and dual 10 Gb/sec Ethernet ports will list at just under $10,000, so call it another $320,000. Call it a worst case scenario and the SolidFire plus cluster example costs $2.8 million at list price.

The Nutanix array is a bit pricier in this simplistic comparison, but after the discounting and the reckoning of the management benefits of having a single integrated cluster, for certain kinds of workloads the Nutanix all-flash approach might make more sense than a cluster feeding out into a bank of all-flash arrays. (For other workloads, the reverse might be true.) As always, it comes down to cases, and all enterprise customers should benchmark their own applications on Nutanix and any other flash array options before choosing.

Smith tells EnterpriseTech that Nutanix customers that have pushed it to create all-flash options want the same thing that other customers buy flash arrays want: Consistent and predictable I/O performance. The immediate use cases for this are the usual suspects: accelerating virtual desktop infrastructure, online transaction processing, and databases. "We think we can cover 98 percent of enterprise workloads," Smith says of clusters built using the dual-node NX-9240.

The NX-9240 machines are shipping now and have been deployed in a number of early adopter customers already. Nutanix is working on benchmark tests to show their performance compared to all-flash arrays and will circle back with EnterpriseTech once these are completed. It will be interesting to see what the aggregate I/O operations per second in a cluster and the latency for transactions is in the Nutanix setup. All that it is saying for now is that a dual-node NX-9240 can support about 230 virtual machines.

In addition to the new all-flash nodes, Nutanix is also previewing a new Metro Availability feature for the upcoming 4.1 release of its software stack. With Metro Availability, the links between two Nutanix clusters  can be stretched as far as 400 kilometers (just under 250 miles) and so long as the round trip delay in the network linking them is under 5 milliseconds one cluster can synchronously replicate its data to the other cluster. That 5 millisecond limit is not a hard one, but rather a suggested speed latency to ensure sufficiently robust performance of applications running across the clusters. (With synchronous replication, data has to be committed to the second cluster before it is released on the primary cluster, so latency matters.) The bandwidth necessary between the two sites will depend in large part on the rate of change of the data on the cluster; the more the data changes, the more bandwidth will be needed to keep the sites in synch. Nutanix is recommending customers have redundant physical networks taking different paths between the two sites for availability purposes. You can mix and match different Nutanix node configurations at the two locations – they do not have to be perfectly mirrored clusters in terms of physical setups. Customers can use asynchronous replication from the second site to a third one if they are truly paranoid about recovery, and because this replication is asynchronous there is no distance, latency, or bandwidth limitation.

The Metro Availability option will be bundled into the top-end Ultimate Edition of its Nutanix Operating System 4.1 software stack. It will be available later this year concurrent with the 4.1 release and will initially be available running inside of ESXi virtual machines. Support for Metro Availability with Nutanix running inside Hyper-V and KVM virtual machines will come in the first quarter. It takes five clicks to set up Metro Availability, two clicks to failover, and one click to recover. Everything is run from the Prism Central management console.

This edition of the software runs on clusters of any size and which has all of the bells and whistles and significantly allows for 2X or 3X redundancy factors of data replication inside the cluster for data resiliency. The Pro Edition has the same basic features, but strips out multi-site disaster recovery, which includes Remote Replication and Metro Availability. The Starter Edition can only span clusters of up to a dozen nodes and only has 2X replication of data on the nodes. It also only has inline compression and dedupe and has the MapReduce distributed and post-process compression and dedupe stripped out. The Prism Central console and access to the REST APIs in the stack are also removed from the Starter Edition. Pricing for each of these editions was not available at press time, but clearly, the bulk of the price of a Nutanix system is in the software, not the hardware.

EnterpriseAI