Nvidia Platform Pushes GPUs into Machine Learning, High Performance Data Analytics
GPU leader Nvidia, generally associated with deep learning, autonomous vehicles and other higher-end AI-related workloads (and gaming, of course), is mounting an open source end-to-end GPU acceleration platform and ecosystem directed at machine learning and data analytics, domains heretofore within the CPU realm.
GPUs operate in the analytics space, of course – examples include Kinetica’s GPU-accelerated database and OmniSci’s (formerly MapD) Core database system for big data query and visualization. Now, with Nvidia's effort to push the GPU acceleration into ML/HPDA (high performance data analytics), the company reports that the RAPIDS platform delivers speed-ups, using the XGBoost machine learning algorithm for training on an NVIDIA DGX-2 supercomputer, of 50x compared with CPU-only systems.
RAPIDS brings with it with an ecosystem from the open-source community, including Databricks (a web-based platform for big data processing in the cloud using Apache Spark) and Anaconda (an open source distribution of the Python and R programming languages for data science and machine learning), and tech companies such as Hewlett Packard Enterprise, IBM and Oracle.
The RAPIDS suite of open-source libraries has been under development for the past two years by Nvidia engineers working with open-source contributors, including Apache Arrow (a data layer for in-memory analytics), Pandas and scikitlearn, and it’s designed to give scientists the tools to run the entire data science pipeline on GPUs. RAPIDS builds on popular open-source projects by adding GPU acceleration to the Python data science tool chain.
“We’re building on the community of Python users… and more recently built around… Apache Arrow and in memory data format and some other tools that allow us to scale from using just one GPU to multiple GPUs in the system, to multiple node and clusters of GPUs,” said Jeff Tseng, head of product for AI infrastructure at Nvidia, in a pre-announcement conference call. “These technologies are driving RAPIDS’ ability to integrate into today’s most popular data science workloads and accelerate them…. We’re going to be focused on business data, on tabular data, and were going to accelerate machine learning data prep.”
To bring additional ML libraries and capabilities to RAPIDS, the company is working with open-source contributors Anaconda, BlazingDB, Databricks, Quansight and scikit-learn, as well as Wes McKinney, head of Ursa Labs and creator of Apache Arrow and Pandas, the Python data science library.
“At Databricks, we are excited about RAPIDS’ potential to accelerate Apache Spark workloads,” said Matei Zaharia, founder of Apache Spark. “We have multiple ongoing projects to integrate Spark better with native accelerators, including Apache Arrow support and GPU scheduling with Project Hydrogen. We believe that RAPIDS is an exciting new opportunity to scale our customers' data science and AI workloads.”
Initial RAPIDS benchmarking results indicate data scientists can reduce training times from days to hours, or from hours to minutes, depending on the size of their dataset, according to Nvidia.
“Data analytics and machine learning are two of the biggest high performance computing applications that have not been accelerated – until now,” said Jensen Huang, founder and CEO of Nvidia. “The world’s largest industries use a sea of servers to study vast quantities of data to make fast, accurate predictions, so data analytics and machine learning can directly impact the bottom line. Building on CUDA and its global ecosystem, and working closely with the open-source community, we have created the RAPIDS GPU acceleration platform. It integrates seamlessly into the world’s most popular data science libraries and workflows to speed up machine learning. We are turbocharging machine learning like we have done with deep learning.”
The two aforementioned GPU players in the data analytics space, OmniSci and Kinetica, also touted RAPIDS potential to bring scale and end-to-end GPU capabilities to such efforts.
“Data scientists use OmniSci on NVIDIA GPUs to accelerate data exploration and feature engineering when creating machine learning models,” said Todd Mostak, CEO and co-founder, OmniSci. “Now our users can interactively query and visualize data at scale in OmniSci, and then pipe the results into RAPIDS’ open-source libraries, enabling powerful end-to-end data science workflows. Together, NVIDIA and OmniSci make it much faster to build and iterate on models, resulting in increased accuracy and quicker time to deployment.”
Nima Negahban, co-founder and CTO of Kinetica, said, “The RAPIDS suite of open-source libraries is a significant improvement in enabling data scientists to leverage the power of the GPU across their model development toolchain. RAPIDS can dramatically simplify and optimize training and improve model accuracy, without any significant logical redesign effort on the part of the data scientist. We’re excited to partner with NVIDIA in this journey to democratize AI — with NVIDIA driving model development and training and Kinetica driving operationalization and deployment of those models, enabling enterprises to gain maximum insight from their data.”
From the HPC industry, Rollin Thomas, Python data analytics lead at NERSC, the National Energy Research Scientific Computing Center, said RAPIDS is a potentially significant new scientific tool.
"NERSC supports more than 7,000 researchers at universities, national labs and in industry. They increasingly want productive, high-performance ways of interacting with their data from complex science simulations or experimental and observational facilities like particle accelerators and telescopes. We look forward to working with Nvidia to put new high-performance Python data analytics tools like RAPIDS in the hands of our users to accelerate their pace of discovery across many scientific disciplines."
Access to the RAPIDS suite of libraries is available at http://www.rapids.ai, where the code is being released under the Apache license. Containerized versions of RAPIDS are available immediately on the NVIDIA GPU Cloud container registry.
Nvidia said RAPIDS systems are under development from Cisco, Dell EMC, HPE, IBM, Lenovo, and Pure Storage.