Advanced Computing in the Age of AI | Thursday, March 28, 2024

Data Lakes and Overcoming the Waste of ‘Data Janitor’ Duties 

Data lakes solve a lot of problems in today's big data world. When properly designed, they serve as an efficient means for storing large volumes and varieties of data. A data lake's efficiency comes from its inverse approach to data storage as compared to traditional structured data warehouses: Rather than enforcing rigid schemas upon arrival, data lakes allow data to exist, unconstrained, in native format.

This approach, commonly referred to as 'schema-on-read,' defers the time-consuming task of data modeling until a time when the enterprise has a clear idea of what questions it would like to ask of the data. In light of the influx of new data sources that enterprises must contend with — many of them unstructured and many whose value is not yet fully understood — this approach not only promotes agile response to new data sources, it also works to ensure that future analysis efforts aren't constrained by schema constraints that came before.

So that's the good news: data lakes hold the promise of a fundamentally better way of tackling the demands of big data. And, we have the technology and processing power to do it in the cloud, allowing for the massive scale of both storage and computing that today’s big data needs require.

Sounds great, but…

Building a data lake and pouring data into it is the easy part. The hard part is managing that data so that it’s useful. Without schemas or relational databases to provide context and consistency, the real challenge enterprises face is finding other ways to link and correlate the widely disparate data types stored in their data lakes — or risk them becoming a collection of siloed “data puddles.” In fact, it's been estimated that data scientists spend up to 80 percent of their time as “data janitors,” cleaning up raw data in preparation for analysis. 1 Those highly trained experts should be focusing on analysis and insights, not merging and de-duplicating records.

Not to mention, it’s almost impossible for an enterprise to reliably predict its future analytics needs. That means a data lake designed for today’s storage and analytics needs may be completely ill equipped to handle tomorrow’s. In order to integrate new data types and solutions, the enterprise would need to constantly reconfigure the lake, consuming precious time and resources with each new integration. Enterprises need a solution now — and they need it fast.

Why the urgency?

Because readying data for analysis is the most time-consuming part of any big data initiative—and certainly the most challenging part of analyzing data stored in data lakes — the whole process requires an entirely new approach. Data has become a key competitive differentiator for companies. For some it’s even more valuable than their core product. Take Facebook, whose user data is orders of magnitude more valuable than its newsfeed. Uber and Air BnB have recognized the tremendous value of their data about users’ travel habits, which could far exceed the value of their actual services.

Why is this so critical right now? Four reasons:

  • We’re facing a continued dramatic escalation in the volume and variety of data inflow, which legacy systems are unprepared to handle.
  • Enterprises can’t predict their future data needs, but do know they’ll need to be able to react even faster than they do now. Current systems already can’t keep up — they need far greater agility.
  • Conventional data lakes that depend on relational databases are simply too clunky. As new business questions arise or new systems are brought to bear — layering on a graph database, a search engine or investigating a complex business question, for example — we need a solution that can create just-in-time data pools, grouping specialized data sets together amid the larger lake without full extraction, which legacy systems are unable to do.
  • The lines between data integration and management are blurring. This should be a symbiotic process, for which conventional data lake environments are not equipped. It calls for a solution that marries the two, allowing them to work in harmony.


dPaaS: A Future-Ready Approach to Big Data

dPaaS, or Data Platform as a Service, has emerged as a much more agile solution that unifies the closely-related operations of integration, data management and data storage (i.e. data lake hosting). This new, data-centric approach is critical for creating a foundation that can accommodate all types of data formats and analytics applications, now and in the future.

Unlike other solutions, which entail point-to-point hard-coded connections to connect the disparate databases that make up a data lake, with dPaaS, as data is ingested and integrated, all iterations are persisted in the central data repository (i.e., data lake), along with robust metadata to organically grow the data lake. Data management functions such as cleansing, deduplication, match and merge, etc. also take place at this time. With metadata attached, the data can be organized on the way in, moving from its raw format in the lake to specialized data stores, allowing investigators to comb through and transform the data specifically for the queries needed right now, without modifying the core data, enabling it to be transformed yet again in a different way to answer the next question.

When the dPaaS solution is delivered with a microservices approach, this enables even greater analytical flexibility and power, improving the usefulness of existing data, along with stability of the entire system. Unlike monolithic, packaged software that offers 100 features, but you really only need five, microservices solutions allow enterprises to customize their data lake utilization processes, adding only the functionality they need, when they need it, and easily removing functions they no longer require. Even better, this can be done on-demand, with no complex integration required. That means instead of modifying their business processes to fit a vendor’s application, enterprises can keep the process and use the microservices they need to get the job done.

For even greater efficiency, these integration and data management functions can be offered as managed services, which frees data scientists and other experts at the organization to focus on the end goal: analysis. It also puts the onus on the vendor offering the managed service to handle security, compliance and maintenance, eliminating the risks inherent in the likes of iPaaS solutions where self-service integration makes it harder to control governance, introducing potential security risks.

dPaaS Lets Enterprises Get to Work

With a cleansed and enriched lake of data now at the ready—and time available for data scientists to devise the queries—data analysis can be performed ad hoc using microservices and configuration-based system. Schemas and their outputs are modeled on the fly, providing rapid movement of data from the data lake into formats appropriate for the task at hand. These could be a graph database one week, a time series database the next, or a relational or key/value store the week after that.

Exciting new technological advances such as non-relational databases, distributed computing frameworks, and, yes, data lakes are all getting us closer to a data-inspired future. But, there's still no substitute for good old-fashioned data management. In fact, the need for diligent data governance is more important than ever, and the dPaaS approach to data operations makes sure this critical piece of the puzzle is not overlooked.

With dPaaS, the real promise of Big Data is within reach, giving enterprises the ability to actually use their data for maximum impact and competitive advantage.

1The New York Times, "For Big-Data Scientists, 'Janitor Work' Is Key Hurdle to Insights," August 2014

Brad Anderson is vice president of big data informatics at Liaison Technologies.

EnterpriseAI