Advanced Computing in the Age of AI | Thursday, March 28, 2024

Hadoop Finds Its Place In The Enterprise 

The echoes of the blitzkrieg of product announcements at the Strata + Hadoop World conference in New York are fading. Now is a good time to step back and give an assessment of the role that the Hadoop platform is playing at large enterprises and how it is evolving.

Hadoop is a transplant from the hyperscale datacenter world to the enterprise, and it has come a long way as a platform in the six years since it started to go commercial. Its effect on data analytics specifically and on modern application development and the very nature of business cannot be understated. But the fact remains that the actual installed base of Hadoop clusters remains a lot smaller than many might expect given the amount of innovation that is going on around the platform.

We did some poking around at Strata + Hadoop World, talking to the main distributors and analysts about the state of the market and installations at enterprises. This time last year, EnterpriseTech did the rounds to try to get a sense of what the hockey stick curve of Hadoop cluster sizes looked like as companies moved from proof of concept to production clusters. The curve was basically the same shape at most companies, and in the case study sessions at this year's conference, the basic pattern still held.

The largest Hadoop clusters in the world – none of which can be identified by name because customers think of their Hadoop systems as so strategic that they can't be described in detail – have several thousand nodes. These are the exceptions, not the rule. Most customers start out with dozens of server nodes and certainly well under 100 nodes for proof of concept projects. Then, as they move into production, those Hadoop clusters grow to hundreds of nodes as the datasets they chew on expand. Over time, as companies find more data to correlate and also more application use cases for that data, the clusters continue to grow and can reach as high as 1,000 or 2,000 nodes. This is a supercomputer-class cluster if it was focused on floating point rather than text processing.

The growth in the clusters is mitigated to a certain extent by Moore's Law, which allows processor makers to increase the number of cores (and hence the aggregate compute capacity) inside of a server node at a fairly regular clip even though the raw performance of a core does not improve all that fast. Similarly, while disk drives are getting more capacious, they are not increasing their performance, with 15K RPM basically being as fast as any disk will ever go. (A little north of 3 GHz seems to be the top base clock speed for an X86 core, and this will not change much in the future, either.) Disk drive capacity and cores per chip have doubled in the past three years, and this makes up for the overall growth in data in the IT market at large, which is estimated to be growing at 50 to 60 percent per year depending on who you ask. You can have several petabytes of storage in a rack of Hadoop servers these days, and the capacity that can be crammed into a single server using inexpensive disk drives is by and large the deciding factor for the cluster size.

In general, it is still desirable to have one core per drive, but that ratio is not held to in a religious fashion. For instance, at American Express, which has one of the largest production Hadoop clusters in the world (not including outside of the hyperscale datacenter operators and giant Web application providers), the server nodes tend to have 16 cores and 24 drives. A rack of Hadoop iron at American Express has just under 300 cores and nearly 1 PB of disk capacity, the company said in a recent presentation.

While the size of Hadoop clusters has not changed that much, in part to the mitigating effects of Moore's Laws for chips and disks, the size of the base has grown, albeit from a smaller customer base than many might think given all of the hoopla around Hadoop.

Tony Baer, principal analyst for big data research at Ovum, tells EnterpriseTech that he estimates that there were perhaps around 1,000 Hadoop clusters in various stages from proof of concept to production installed in the world in early 2014. The big distributors such as Cloudera, MapR Technologies, and Hortonworks were adding on the order of 50 to 75 new Hadoop customers per quarter, which will put the installed base of enterprise customers at somewhere around 1,500 to 2,000 clusters by the end of this year. This does not sound like a large number to some. But you have to remember that many companies are still, after all of these years of boisterousness in the data analytics arena, only getting started. Retailer LL Bean talked about its first Hadoop cluster in a presentation at the conference, and at lunch we sat with the manager of the very first – and still prototype – Hadoop system going into credit card provider Capital One. Enterprises are very conservative, unlike the hyperscale companies that are for the most part inventing the Hadoop stack along with academics and professional software companies hoping to get rich on Hadoop.

"Think of this as the early days of data warehousing back in 1995," Baer explained. "That's where we are right now. We are in a hockey stick of growth, there is no doubt about that."

Baer says that the proliferation and maturation of SQL query tools that ride atop the Hadoop Distributed File System and its alternatives in the Hadoop stack will be the key driver to enterprise adoption in 2015. SQL has long-since been the method of choice for tickling information out of relational databases. And the advent of the Drill, Impala, and HAWQ query engines for the Hadoop stack allows companies to leverage their SQL skills on the semi-structured data stored in Hadoop.

Matt Aslett, director of data management and analytics at 451 Research, puts the commercial Hadoop market at around the same, more or less. It is north of 1,000 and south of 2,000 across all of the commercial distributors, Aslett reckons, and he is working to get more precise numbers right now. Excluding hardware and systems integration costs, Aslett says there was about $374 million in Hadoop subscription revenue in 2013, and that this will grow at a compound annual growth rate of 49 percent through 2018, when revenues for support and software will reach $2.7 billion, according to 451 Research's model. Incidentally, the company estimates that Cloudera and the Elastic Map Reduce (EMR) service on the Amazon cloud both generated about $100 million in revenues in the past year.

If downloads are any judge, then there are many more companies playing with Hadoop than even doing formal proofs of concept. Jack Norris, chief marketing officer at MapR Technologies, says that there have been many tens of thousands of downloads of its Community Edition, and of these probably thousands of them are used and a few percent of them turn into paying customers.

According to the latest Big Data Analytics Survey from Wikibon, of the 110 companies polled, 51 percent said they had downloaded the Apache Hadoop code to roll their own distribution, and another 24 percent of companies said they used a freebie version of a distribution provided by one of the commercial distributors. Only the remaining 25 percent said they were using a commercial Hadoop distribution. Obviously, converting those free customers to paying customers will be part of the growth for the Hadoop market, but the increasing complexity of the Hadoop platform, with its myriad extensions, will also probably drive commercial adoption. At some point, it makes sense to pay someone else to babysit the code and get on with creating applications – unless you are operating beyond a scale that the commercial software providers are doing. Facebook broke Hadoop so long ago with its clusters – it has tens of thousands of nodes with several hundred petabytes of data dedicated to analytics – that it had to essentially create its own distribution.

A trend that MapR is seeing is that companies start out with the open source Apache Hadoop distribution or another one of the freebie editions of other distributions and then move to a commercial-grade MapR enterprise edition. MapR says that it has 500 commercial customers using its distribution, and while it cannot name many names, it does tend to have some large installations. And, it has customers that are mixing and matching different generations of components.

"We have customers that have 30 applications running on one cluster, others that are using YARN and some that are using MapReduce 1.0," says Norris. "The other distributions can't run multiple versions of components in the same cluster like we can."

The other interesting statistic from MapR is that about 30 percent of its paying customers have gone with its Enterprise Database Edition, which includes the MapR-DB integrated NoSQL database. While some customers are playing around with flash storage for accelerating Hadoop and others are kicking the tires on Spark for in-memory processing, these are still largely experimental, according to Norris.

Over at Pivotal, the platform spinout of VMware that pulls together a Hadoop distribution, the Gemfire in-memory database, the Greenplum data warehouse, and the Cloud Foundry platform cloud, it is still early days for Hadoop. Todd Paoletti, vice president of product marketing and operations for Pivotal, tells EnterpriseTech that "traditional enterprises are getting their acts together" when it comes to data analytics. Square, Google Wallet, and Apple Pay are putting competition on the big banks, and Nest (another division of Google now) is putting pressure on Johnson Controls, just to name a few of the many battles between upstarts and industry titans that has analytical firepower as the main weapon on the battleground.

Pivotal was created by VMware and its parent, EMC, to give customers a complete analytics and application platform as well as application development services when they need help. (That's the Pivotal Labs part.) By its very nature, this presents a longer sales cycle but, according to Paoletti, a deeper engagement with customers. For instance, the Bank of New York has moved some of its techies into the Palo Alto offices at Pivotal to do co-development right there, where the product is created and enhanced.

At the moment, Pivotal has about 1,200 customers who have at least one element of the portfolio, with over 1,000 Gemfire customers and on the order of 500 Greenplum data warehouse customers. The Hadoop distribution is around a "double digit" percentage of the Pivotal base, but growing fast as companies adopt the Pivotal platform. The company rolled out is Big Data Suite back in April, with flexible licensing that is priced on a per-core basis and that allows companies to switch around Hadoop and HAWQ, Gemfire, or Greenplum as their applications change within the same cluster. Customers really want a suite of tools, says Paoletti, as well as the flexible licensing and the suite is selling at a much faster pace than Pivotal expected. The key metrics that Pivotal is watching is the adoption of the Big Data Suite and Cloud Foundry – both of which are experience what Paoletti called "very high growth rates."

But again, what enterprises really want is an SQL interface into Hadoop data. with these tools maturing rapidly and their SQL command coverage growing – and provided the performance is not too bad compared to data warehouses with real relational databases – SQL could be the big driver for adoption of Hadoop in the enterprise. The Hadoop distributors have been very careful to show benchmarks that pit various SQL layers for Hadoop against each other, but the real test is how well SQL-on-Hadoop performs against real data warehouses using relational databases, and at what cost they do it.

Baer says that companies will typically spend somewhere on the order of $50,000 for a proof of concept Hadoop cluster, including servers, switches, and software. The sweet spot for initial deployments is a starter cluster that costs on the order of $150,000 to $250,000. By the standards of large shared memory systems or commercial data warehouses, this is not a lot of money. A Hadoop distribution, depending on the features, costs roughly $7,000 per node for a software license, but a data warehouse with a big relational database sitting on it can cost $40,000 per TB and an in-memory database costs even more than that.

In a world where IT budgets grow at 2 percent per year – at best – and data is growing at 60 percent per year, something has to give. And it particularly has to give in a world where more data beats better algorithms, as Google has so brilliantly illustrated. This is why the Hadoop platform is going to see rapid – but probably not explosive – growth in the coming years.

EnterpriseAI