New Data Analytics Benchmark Puts Stopwatch to Hadoop-based Systems
Ladies and gentlemen, start your clusters.
A new data analytics and machine learning benchmark has been released by the Transaction Processing Performance Council (TPC) measuring real-world performance of Hadoop-based systems, including MapReduce, Apache Hive, and Apache Spark Machine Learning Library (MLlib).
Called the TPCx-BB benchmark and downloadable at the TPC site, it executes queries frequently performed by companies in the retail industry running customer behavior analytics.
HPE has already seized the benchmark pole position, as it were, issuing results showing that its current-generation ProLiant DL380 G9 server, using Xeon E5-v4 processers, came in with a 27 percent performance gain over the previous-generation DL380 server.
The TPCx-BB (BB stands for “Big Benchmark”) is designed to incorporate complex customer analytical requirements of retailers, according to TPC. Whereas online retailers have historically recorded only completed customer transactions, today deeper insight is needed into consumer behavior, with relatively straightforward shopping basket analysis replaced by detailed behavior modeling. According to the TPC, the benchmark compares various analytics solutions in a real-world scenario, providing performance-vs.-cost tradeoffs.
“With the advent of so many big data and analytics systems – from an array of hardware and software vendors – there is immediate demand for apples-to-apples, cross-platform comparison,” said Bhaskar Gowda, chairman of the TPCx-BB committee (and senior staff engineer at Intel’s Data Center Group).
Gowda told EnterpriseTech the benchmark tests various data management primitives – such as selects, joins and filters - and functions. “Where necessary, it utilizes procedural programs written using Java, Scala and Python. For use cases requiring machine learning data analysis techniques, the benchmark utilizes Spark MLLIB to invoke machine learning algorithms by providing an input dataset to the algorithms processed during the data management phase.”
He said the benchmark exercises the compute, I/O, memory and efficiency of various Hadoop software stacks (Hive, MapReduce, Spark, Tez) and runs tasks resembling applications developed by an end-user with a cluster deployed in a datacenter, providing realistic usage of cluster resources.
It also utilizes, when necessary, procedural programs written using Java, Scala and Python, Gowda said. For machine learning use cases, the benchmark utilizes Spark MLLIB to invoke machine learning algorithms during the data management phase.
Other phases of the benchmark include:
Load: tests how fast raw data can be read from the distributed file system, permuted by applying various optimizations, such as compression, data formats (ORC, text, Parquet).
Power: tests the system using short-running jobs with less demand on cluster resources, and long-running jobs with high demand on resources.
Throughput: tests the efficiency of cluster resources by simulating a mix of short and long-running jobs, executed in parallel.
For the record, according to HPE, the 12-node Proliant cluster used in the first test run on the benchmark had three master/management nodes and nine worker nodes with RHEL 6.x OS and CDH 5.x Hadoop Distribution. It ran a dataset of about 3TB. Comparing current- versus previous-generation Proliant servers, HPE reported a 27 percent performance gain and cost reduction of 9 percent.