Advanced Computing in the Age of AI | Thursday, March 28, 2024

Big Hadoop Shops Are on a Hockey Stick Growth Curve 

As the dominant supplier of commercially supported Hadoop software, Cloudera has perhaps the best view of what is going on among large enterprises as they use the platform. To take the pulse of Cloudera's largest customers, EnterpriseTech sat down with the company's chairman and chief strategy officer, Mike Olson, and its chief executive officer, Tom Reilly, at the Strata/Hadoop World conference in New York.

Cloudera's largest and most established customers have moved out of the early adopter phase and think of Hadoop as one of their key platforms. However, Cloudera's business with these shops is not hitting a plateau, but accelerating. What is going on here?

Timothy Prickett Morgan: Let's start by talking about what is happening in the Hadoop base? How big are Hadoop clusters getting and how much data are customers storing? Are customers building Hadoop silos, or are they consolidating them? Are companies goosing clusters with solid state memory or looking at in memory processing to speed up Hadoop performance?

Mike Olson: This is going to be kind of backwards to the way you asked me, but think about the way we have developed the Hadoop platform over the years. We began with MapReduce and in 2010 we added HBase. Last year at this time, I announced Impala, which went into beta, and it went into general availability earlier this year. This year we announced Solr search, and it has now also gone GA. Hadoop is batch, and these new capabilities let you get at that same data in a variety of ways, and you just have a better toolkit.

As that happens, the platform gets more valuable. Data naturally flows to it because it is an inherently reliable place to store data, and with the proliferation of these engines – MapReduce, SQL, and search – applications are migrating there as well. Stuff you used to have to do in an enterprise data warehouse or a document store, now can move over. Not all of it, but some of it.  We have been actively adding these features as well as security and compliance that businesses need. That innovation has been crucial for enterprises to roll out this platform in a new way.

One of the challenges – and this is a Bill Joy quote – is that all of the smart people work somewhere else. We don't believe that Cloudera is in a position to – or ever could – drive all of the innovation in the platform. What we need to do is embrace a much broader community, open source and otherwise, that is driving innovation on the platform. That's why we created Cloud Connect Innovators, a new program I am running, to find open source projects or interesting proprietary companies that makes the platform better.

It is funny that you bring up in-memory, which is very hot right now. One of the hottest in-memory technologies right now is the Spark project out of the University of California at Berkeley. We now have a formal relationship with the guys out of Berkeley who built Spark, and we are contributing code to the Spark project and we are working with Databricks, which has commercialized Spark, to make sure that it works with our security framework, that it can be deployed with and operate properly under Cloudera Manager, that we provide great front-line support. We are working with them, the deep experts, to provide the backline support so our customers can roll this platform out with confidence.

We have already identified other players we want to partner with, and this is really driven by a key observation we have made in the last year as companies roll out the Cloudera platform. If you are going to put 10 petabytes onto a platform, you are making a five to ten year commitment to that platform. If customers are going to make that bet, they need to know the platform has legs, that it is going to keep getting better.

The platform started out in 2008 when we started the company as a batch oriented processing with scale out storage – it had no security, no real-time capabilities. But it is getting better, steadily more real-time, more secure, and more operable with the addition of data lineage and compliance and so forth. In the beginning, Hadoop was stood up to the side running sort of greenfield, net-new workloads. Companies wanted to do behavioral analytics. Increasingly, as it has gotten more capable and more real-time, it has moved to the center of the datacenter. It is now serving across lots of our customers as an enterprise data hub. It is the first place that data lands, because you can afford to store it there in complete detail, full fidelity forever.

TPM: It is the same idea behind the enterprise data warehouse based on relation databases, but it is turned on its head a little. And by the way, that hub and spoke idea for data warehousing didn't really work, either . . . .

Mike Olson: That's right, and it is because it forced you to choose what was important ahead of time. And the types of questions you could ask of that architecture and the types of data that it could store were just too limited. So as we proliferate these engines down at the platform level, we think we steadily increase the number of workloads you can run.

TPM: For customers who have been at Hadoop for a while, how much have their clusters grown? What is the shape of the growth curve? The early adopters from 2008 and 2009 are no longer early adopters, so what scale are they at today?

Mike Olson: That is exactly the right question to ask, and you have touched on it a few times. When you get to an enterprise data hub, you are no longer at hundreds of nodes. You have at least 1,000 nodes, and in many cases, you are into the thousands. The very largest publicly acknowledged cluster that we can talk about is 4,000 nodes.

But here's the thing. That growth curve is relentlessly up and to the right. And bear in mind that 4,000 nodes in 2013 is way more computing than 4,000 nodes would have been in 2009. Storage is denser, processors have more cores.

Tom Reilly: The important thing to consider here is that we are seeing enterprises going mainstream with Hadoop, and the most common way we are seeing it used is as this data hub. And that is why we have integrated and packaged all of the software to create this one place in the enterprise where data goes first.

Once it goes there, it can be pushed out to other operational systems, and with the Impala and search capabilities we have added, business users have exploratory use of that data.

TPM: Well, I mean, who doesn't want the SQL querying? And who doesn't want search? So it makes sense to package it all up. Nobody in the marketing department wants to write MapReduce algorithms in Java, do they?

Mike Olson: That is precisely the point. And whatever your use case is right now, if you are making that five or ten year bet, you may as well get all of the capabilities you can possibly get.

TPM: So how fast are these new capabilities ramping?

Mike Olson: Impala went GA in June, Search went GA in August. There are 5,000 enterprises running those two things together.

TPM: How many enterprises do you have in total? What is the percentage of the base that has adopted these new features?

Mike Olson: [Laughter] The attach rate for Search is insane, like 85 to 90 percent. I mean, all you have to do is type in a text box!

TPM: Well, we both know how conservative enterprises can be, so that kind of attach rate is unusual.

What you really need is another recession and that will really drive the adoption. The Great Recession did Cloudera a big favor. The dot-com bubble bursting did the same for Linux.

Mike Olson: That's exactly right. It launched Red Hat.

TPM: I remember the recessions in the late 1980s that started the rise of commercial minicomputers in the datacenter. When was IBM's AS/400 announced? The summer of 1988. When did the stock market crash? That October. When did IBM enter the Unix market and put its seal of approval on Unix? In the spring of 1990. And when did people start ripping some or all of their mainframes out? During the recession of 1990 and 1991.

Tom Reilly: I was actually selling AS/400s back then.

TPM: So how many customers do you have overall at this point, and how much revenue are these new features driving?

Mike Olson: We don't disclose our customer count or revenues, but I can tell you that the attach rate is phenomenal. I don't believe we have yet touched bottom on demand for Hadoop. The part of our business that is growing fastest this year is sales and go to market – the field reps, the technical people. We are at around 480 and if you stand where we are you can spit to 500, and that is better than double this time last year.

Tom Reilly: We have been roughly doubling every year, and as we look forward, we believe we are in an accelerating market.

If you look at customers, we can measure value based on how much data is going into and through this enterprise data hub. Customers are renewing at bigger cluster sizes than they originally bought.

TPM: Is it like Red Hat where every time their big customers renew, the checks are roughly 20 percent bigger?

Tom Reilly: I don't think we have enough history, but it is analogous. Our goal is to help every customer get to an enterprise data hub where all of their data goes to one place and it is accessible to everyone in the enterprise.

Mike Olson: This is the fact that really strikes me. When we were founding the company, we had to really think about what Moore's Law meant for our per-node pricing policy. If disks are going to get denser and processor cores are going to get more numerous, and if we price by the node, are we not screwed?

It turns out that nevertheless, even though the machines are getting more capable and bigger, people are buying more of them every year as data growth has outstripped Moore's Law.

TPM: That is certainly how it has worked in the supercomputing market. These labs ride Moore's Law inside the box and then they add more boxes, too, because they have to stay ahead of Moore's Law. Everybody else can ride along with Moore's Law, but if you want to be on the cutting edge, you have to get ahead of it.

Mike Olson: That's exactly right.

TPM: So among your largest, most established customers – forget the newbies for a moment – are they doubling their nodes or doubling their petabytes every year? What is the shape of the curve?

Mike Olson: The answer really is: It depends. The sample size is still small, but doubling or better is not hard. That doesn't mean that the information streaming into the company is doubling or better, but more that they are realizing the value of the hub and are taking stuff off of tape and other sources and putting it there.

Tom Reilly: Also, the use cases are growing. And this hub has value, feeding into data warehousing and other systems, and then customers realize it can be the compliance and auditing system because all of the events and data are there and it is secure, we have data lineage and data discovery now and you can now satisfy all of your auditor's requests. That is almost a side benefit to collecting all of your data into an archive feeding it into operational systems.

Mike Olson: Our most established customers have four years with us, and their spending with us is literally 10X higher now.

TPM: Did you plan this?

Mike Olson: The vision, from the beginning, was to get to real-time, and that you had to be able to put absolutely everything on the platform. I will say that the philosophical move in the market has taken us aback. The market is growing faster than we imaged it would.

It wouldn't be right to say that we planned it, but it is right to say that we believed.

TPM: When will we see enterprise customers break through 10,000 nodes with Hadoop clusters? With Moore's Law, will businesses need to deploy that many nodes? Because that is a whole different level of scale.

Mike Olson: Let me put that this way. That is just a doubling of what we see now. Merely a doubling – we are such computer science bigots here. . . . Yes, it will happen. That use case will be about a really big monster store, and who knows what analytics will be running there.

It is interesting that in the past five years, the biggest clusters have not doubled in size. They have gone from maybe 3,500 to 5,000 nodes. This is the size of the largest clusters that Facebook or Yahoo are running. But with broad adoption across way more industry verticals and the acceleration of data movement into these clusters, we will see this grow. You can just attack more problems if you have got more data.

TPM: Does every business, regardless of size, end up using a platform like this?

Mike Olson: We think so. Think about small and medium business adoption if we have all of these Cloudera Connect Cloud partners.

Customers have been running our infrastructure in their datacenters for a number of years, and they stand up a whole bunch of nodes behind their firewalls and operate them. Increasingly, they have been asking us for cloud options: public cloud, managed cloud, and private cloud.

You have been able to run Cloudera's full stack on Amazon Web Services for a while, and we have reasonable deployments on public cloud and that is just getting better as we go. In addition to that effort, we have an initiative called Cloudera Connect Cloud, which is us engaging with significant enterprise cloud providers to make sure they can deliver our product as a service over the Internet. SoftLayer, T-Systems, Verizon Business, and Savvis are able to resell our products, and others are able to do the same thing.

TPM: I know you have customers running Hadoop clusters based on CDH on Amazon Web Services, but are you able to run CDH as part of the Elastic MapReduce service? As far as I know, EMR just supports plain vanilla Apache Hadoop and the M3 and M5 Hadoop distros from MapR Technologies.

Mike Olson: We have been talking to Amazon for some time about that. The issue, from our point of view, is that our platform is very differentiated. It's not just the EMR services, but Impala for real-time SQL, Solr for search, and a rich suite of additional capabilities such as Sentry, our security infrastructure, and Data Navigator, our audit logging and security policy. That stuff just doesn't have an analog in the EMR world. You can absolutely deploy our software on EC2 and scale out storage on Amazon.

TPM: Do you have any idea how many of your customers are deploying their Hadoop clusters on the Amazon public cloud?

Mike Olson: The genuine answer is that because we have not been concentrating on cloud deployment, most of our customers have stood up our software in their own datacenters ­– in part that is because of us, but in part it is because if you are, for instance, a bank or an insurance company, you already have a datacenter and your data is all there. And early adopters tend to put the data management infrastructure where the data was. It is really a forward-looking appetite by customers that is really driving this cloud investment from us.

TPM: Are you seeing customers who only want to run from the public cloud, or only want to run certain kinds of work on public clouds?

Mike Olson: It is some of both. The nirvana is to be able to run in the datacenter and then burst to the cloud.

Lots of our customers who run in their own datacenters nonetheless want cloudy-style services, such as virtualization and elasticity. We are working with the OpenStack project to get our full stack so it runs well in that infrastructure.

TPM: When you deploy to AWS today, you don't have a choice but to deploy to virtualized servers. But there is going to be a day with OpenStack – and it isn't that far away if you look at what Canonical has done with Metal-as-a-Service and the bare metal provisioning that is coming with the future "Icehouse" release of OpenStack – where you will be able to deploy to bare metal or to a hypervisor like KVM or Xen. People will have the option of either virtualized or bare metal, and thus far, as far as I know, most Hadoop clusters are running on raw servers, not hypervisors.

Mike Olson: That's driven more by the territory as we find it than by what customers want. If you talk to enterprises ­ a big bank, a big hospital ­ they actually want virtualization. They absolutely do. The platform was not designed to run that way, and as a result people are deploying on bare metal right now. But with the work we are doing with OpenStack and in general the way the projects are evolving, we think we are going to embrace virtualization because it matters to customers.

TPM: It is funny. There are a bunch of people who are going in the other direction, using LXC lightweight Linux containers, ZeroVM now at Rackspace Hosting. People are trying to get the hypervisor out of the way and get a thinner container.

Mike Olson: I very much want to have the virtualization layer impose less of a tax on the layer that we run in, and I am glad that this work is ongoing and I think that OpenStack and others are going to continue to drive in that direction. But it is likewise incumbent on us to embrace virtualization.

TPM: So are you going to join Project Savanna, which is being spearheaded by Mirantis, Hortonworks, and Red Hat, or Project Serengeti, which is a similar effort to deploy Hadoop on virtualized infrastructure from VMware?

Mike Olson: We know the folks at Mirantis really well, and we have been talking to them and we have been working with them on Savanna. Our key focus right now is engagement with the open source project. So this is not a Cloudera-Mirantis alignment, this is a Cloudera-OpenStack alignment. I have no comment on Serengeti.

EnterpriseAI