Data Prep: Easing Data Scientists’ ‘Janitorial Work’
"Data Scientist": is there a more formidable job title in the IT industry? Not Data Manager, not Data Analyst, but Data Scientist. They command top salaries. They are in high demand. They know things, do things, discuss (among themselves) things that the rest of us can’t begin to understand or even create the false impression of understanding. When we hear the oft-repeated lament that data scientists spend 80 percent of their time on the preparation of data for analytics workloads, a task (we're told) for which they are woefully overqualified, we shake our heads knowingly. Yes, we agree, what a sad waste of their time. Although, having said that, what it is they should be doing we can’t say with a high degree of confidence.
Actually, we know in a general sort of way what they should be doing: using their arcane knowledge and rarefied skills to develop analytics applications that pull insight out of massive data lakes, preferably in real time. The insights are there in the lakes, waiting to be pulled to the surface like so many trout. Yet for too many organizations the trout elude data scientists' analytics hooks, in part because too little of their effort goes to extracting insight (i.e., “algorithmic work”) and too much is spent on menial data prep (a.k.a. “janitorial work”).
Data prep itself is the process of “wrangling” diverse datasets in diverse formats so they can be compiled within a single data warehouse and collectively analyzed for a variety of downstream uses, including business analytics and machine learning. The idea is to cleanse, shape and conform a kitchen sink of data – quickly, easily and with increasing efficiency.
Howard Dresner, founder and chief research officer at Dresner Advisory Services, said the data wrangling market is gaining increased attention because analysis of massive and diverse datasets, the connecting of dots among them, is the essence of big data and analytics.
“We are seeing momentum continue to build around end user data preparation as a critical component to the success of companies’ business intelligence and analytics strategies,” he said. “Across three years of data, the importance of end-user data preparation remains consistently high.”
In a substantial step forward for the wrangling software category, Google and wrangling specialist Trifacta have announced Google Cloud Dataprep, which embeds Trifacta's Wrangler and Photon Compute Framework, and natively integrates with Google Cloud Dataflow for, according to the companies, “serverless, auto-scaling execution of data preparation recipes.” It enables Google Cloud users to access, explore and prepare diverse data in such services as Google Cloud Storage and Google BigQuery, for analytics and other uses.
The product, scheduled for general availability late next month, takes aim at the cloud analytics market, expected to grow to more than $20 billion by 2020.
According to Google, Cloud Dataprep includes:
- Anomaly detection: Google Cloud Dataprep detects schema, type, distributions and missing/mismatched values, and utilizes machine learning to suggest corrective data transformations.
- Drag-and-drop development: An intuitive user experience that eliminates the need for coding so that users can focus on analysis.
- Integration: Users can more securely read raw data from Google Cloud Storage or BigQuery, or upload from their local machine and write back cleaned data into BigQuery for further analysis.
- Managed infrastructure: IT resource provisioning and management are handled automatically and elastically.
“It’s part of a broad recognition that as you’re putting more data into cloud, and in particular into Google Cloud, you need a way to take the data from raw to refined so you can get productive with that information,” Trifacta CEO Adam Wilson told EnterpriseTech. “Whether you’re doing basic analytics, loading BigQuery or doing advanced machine learning, it’s important to have nice, clean, structured data that you can operate on, and Trifacta is a key part of delivering that solution to the Google Cloud ecosystem.”
Wilson said Trifacta’s architecture lends itself to leveraging the Google infrastructure as well as to wrangling within broad cloud deployments at scale because the company has focused on the high end of the market since its founding in 2012.
“We’ve felt that if we can wrangle data at scale – the really big messy, complicated stuff – we can always move downstream to smaller, more structured data,” Wilson said. “Our work with Google represents a size and scale that is somewhat unrivaled on the planet, so you can imagine the effort that went into making sure our product was not just going to interoperate with all the other Google Cloud services but was also able to leverage the compute infrastructure appropriately….”
In addition to the Google Cloud integration, Trifacta announced today an extension to its partnership with collaborative data company Alation to offer an integrated self-service data cataloging (aka data discovery) and wrangling solution, enabling users to access the combined features within a single interface.
"Organizations are embracing self-service analytics but struggle with the distributed nature of self-service analytic projects,” said Satyen Sangani, CEO, Alation, citing joint Alation and Trifacta customers eBay, MarketShare and Munich Re. “Analysts need the tools that allow them to work productively and in a more collaborative manner with data experts -- from finding, understanding and trusting their data, to preparing that data for analysis. Our partnership with Trifacta enables analysts to accomplish all of that within a single solution across databases and Hadoop -- making their work much more efficient."
Trifacta wrangling technology, both free and paid versions, are used by tens of thousands of users, according to Wilson, at some 4,500 companies worldwide. As Trifacta evolves its technology partnerships and capabilities, the goal is to automate more aspects of data prep (and in the process, free up data scientists to focus on even more esoteric work). Users of its free product generate anonymized data that Trifacta collects in the cloud, Wilson explained, which in turn “helps us to train our algorithms so the system gets smarter” in making interactive suggestions, guiding end users as they structure, shape, cleanse and transform data.
“It’s what we call ‘interactive discovery,’ making sure you get eyes on the data very early in the process so you can start to understand things like consistency, conformity and completeness of that data," Wilson said, "so you can do the appropriate clean-up on the information to take it from raw to refined.”
Trifacta customers include the Centers for Disease Control and Prevention, Kaiser Permanente, Nordea Bank and The Royal Bank of Scotland. PepsiCo’s CPFR (Collaboration, Planning, Forecasting, Replenishment) team uses Trifacta in support of its supply chain strategy. The team works with retailers to order the right quantities of product for their warehouses and stores. It’s a delicate balance: supply too much product and PespiCo risks returned and wasted resources; supply too little and it loses sales revenue and profits, as well as leading consumers to choose competing products. As a result, the refinement of sales forecast reports is an ongoing process.
Under its old system, reports required analysts to build a CPFR tool that combined all retailer sales data and PepsiCo supply data (along with external, unstructured data, according to Wilson, such as weather forecasts) to be combined, taking up to several months to complete. Customers provided PepsiCo with data, such as warehouse inventory, store inventory, and point-of-sale inventory, which PepsiCo reconciled with its own shipment history and the amount of products on order. Customers had their own, unique ways to standardize data that didn’t correspond with other customers’ or PepsiCo’s systems, resulting in large quantities of messy data that made wrangling painful and time consuming.
PepsiCo decided to implement Hadoop as a landing and staging environment, and the CPFR team selected Trifacta to serve as an interface to access and transform this data. Under the new system, reports run directly on Hadoop, enabling analysts to directly manipulate data using Trifacta and publish the prepared data sets to Tableau for visualization and broader consumption.
PepsiCo reports it has cut end-to-end analysis run time by up to 70 percent, allowing analysts to spend most of their time analyzing and sharing the right story about sales data with retailers, instead of manually putting data together. In addition, build times for creating reports has been reduced by up to 90 percent. PepsiCo also cited Trifacta’s ability to reproduce patterns, so building CPFR tools has become more efficient and automated.
“Trifacta learns our patterns,” said Mike Reigling, PepsiCo supply chain data analyst, “which means as we build these CPFR tools it actually becomes faster because the tool knows what we’re trying to accomplish.”