Bringing Big Data and HPC Together
What does the intersection of big data and high performance computing look like? About 50 technology leaders from across industries got an exclusive glimpse during Tabor Communications’ inaugural LBD + EHPC event at the Ponte Vedra Inn & Club in Florida last week.
Readers of Datanami and EnterpriseTech will recall that Leverage Big Data and Enterprise HPC used to be separate events focused on their respective audiences. This year Tabor Communications decided to merge them, reflecting the movement of big data and HPC toward convergence.
That doesn’t mean they’re taking the same route. We see the big data camp favoring the discovery of correlations hidden in data using machine learning algorithms, while HPC traditionally uses highly detailed modeling and simulation to achieve its goals. But increasingly, both camps are aiming to use emerging techniques and technologies, such as deep learning on GPUs, to inform decision-making, often on live data with low latencies driven by real-time requirements.
This convergence is what attracted HPC directors, data scientists, enterprise architects, CTOs and other technology decision-makers to the Jacksonville, Florida area last week. Tabor Communications nGage Events, and the Ponte Vedra Resort played host to delegates from companies like Credit Suisse, Ford Motor Company, Gulfstream Aerospace, Erickson, Cummins, Samsung, and UnitedHealth Group.
Drinking Deep Learning Kool-Aid
There was about a 50-50 split between the big data and traditional supercomputing camps at LBD+EHPC 2017, but nearly everybody sees the disciplines coming together in light of evolving technology and changing business requirements. One of the true believers of big data-HPC convergence is Jay Boisseau of Dell EMC.
“I’m an HPC person who’s fascinated by the possibilities of augmenting intelligence with deep learning techniques,” Boisseau said during his keynote last week. “I’ve drunk the ‘deep learning Kool-Aid.'” (See HPCWire Managing Editor Tiffany Trader’s story, “Data-Hungry Algorithms and the Thirst for AI,” for more on Boisseau’s talk.)
Another company looking to utilize hyper-scale approaches to big data computation is Ancestry.com. During a keynote address, Tyler Folkman, a senior data scientist at Ancestry.com, told attendees how the company uses an array of technologies and techniques to tell people more about their family history and genealogy, often times with tight service level agreements (SLAs).
Folkman detailed the storage, management and analytics challenges that come with having 20 billion digitized historical records, 80 million family trees, 175 million sharable photos, documents and written stories, and a total of more than 10 petabytes of data. “We’re building a data culture at Ancestry,” Folkman said.
Much of this data is stored in Ancestry.com’s “Big Tree,” which consists of a graph database that customers are free to roam across for the purpose of identifying people they might be related to. Folkman told the audience how Ancestry.com is using machine learning technologies like H2O.ai, Spark, and Scikit-learn to build predictive capabilities into its service and to make recommendations about possible relatives.
Asya Shkylyar, a senior scientific consultant for infrastructure at BioTeam, reminded the LBD+EHPC audience that details matter. Too often, advanced scale computing practitioners overlook technical aspects of computing, whether it’s properly configuring Kerberos, implementing schedulers or forgetting to synchronize the clocks, Shkylyar said. It was a pertinent reminder that we must first master the complexity of technology before we can do great things with data.
Feed the Beast
This was followed by a presentation from a newcomer to HPC storage specialist Data Direct Networks, Bret Costelow, vice president of worldwide sales and business development. Costelow reviewed storage challenges, and DDN’s product line, as they apply to organizations using supercomputers on the Top500 list of the world’s most powerful HPC systems, along with hyperscalers, Internet companies, and telecom companies that store and manage enormous volumes of data.
Costelow emphasized the need for what others have called “balanced” high-performance systems. “Compute, fabric and storage must scale equally,” he said. “Without adequate storage performance, compute performance goes unused.” He reviewed the emergence of new technologies, such as 3D XPoint, which delivers memory that’s fast, inexpensive and non-volatile.
All of this will come to bear on use cases across multiple industries in which the volumes of data involved are exploding – including, for example, implementation of video surveillance cameras in HD formats, which will generate 80,000X the amount of data produced today, he said.
Looking at the broader picture, Costelow said the center of gravity in the data center is moving from compute-centric to data-centric, with a corresponding use case emphasis on analytics and the IoT. But with computers having hundreds of TBs of memory, “feeding the beast” is a major challenge. “Moving data will become the primary job in the data center,” he said. “We need to do it quicker and faster.”
Your Datanami managing editor hosted a cross-industry panel on big data challenges that are pushing the limits organization’s IT infrastructures. Bob Neuerburg, senior enterprise architect at Expedia, described the need to develop two separate customer facing data services at scale: one that enables customers to make travel arrangements in real time, and a second to serve as a trip advisor recommending places to go and things to do within the customer’s travel location based on individual preferences.
“We act like a start-up at Expedia,” Neuerburg said. “We’re always pushing hard in all direction. We’re agile, so we can fail fast and try something else new.”
Michael Chupa, an HPC engineer at Bristol-Myers Squibb, emphasized the range of his organization’s data management challenges, including privacy, security, and multi-tenancy. While pointing out these difficulties, he also said they are surmountable and noted an inspirational quote from philosopher Joseph Campbell: “The psychotic drowns in the same water the mystic swims in with delight.”
Tassos Sarbanes, a data architect at Credit Suisse, spotlighted the challenge of scaling regulatory compliance, while Sakthivel Madhappen, IT infrastructure Operations Manager, Inova Health Systems, discussed the challenges of hunting for the cure for major diseases via data analytics while also offering thoughts on the greatest hindrance to innovation: technology limitations or staff skills.
Security, Security, Security
Day three of LBD+EHPC 2017 dawned with security in the air. Sagar Gaikwad, who manages the Big Data CyberTech group at Capital One, described how the bank is using big data tech to prevent fraud.
Like all credit card companies, Capital One must intelligently parse and manage transaction requests and weed out the security threats as they come in. However, old rules-based systems can’t keep up with the volume of transactions. The company sought to buy an shrink-wrapped machine learning system that could keep up with the volume, but found they only satisfied 80% to 90% of its needs.
So instead, Capital One built its own system. Purple Rain, as the Capital One system is called, combines a variety of tools, including Hadoop, ElasticSearch, Kafka, Spark, Storm, and Nifi to enable machine learning algorithms to detect fraudulent transactions in real time. (Look for a more in-depth Datanami review Purple Rain in the days to come.)
A highlight of the conference was a presentation from former IDC industry analyst (now of Hyperion Research) Bob Sorensen, who spoke on “The Emerging Role of Advanced Computing in Cybersecurity.”
Sorensen began by reviewing the woeful state of security readiness at most large organizations, based on a study of 62 companies, many of which are (mis)guided by the belief that “we haven’t been breached yet, so we must be doing something right.” Sorensen said most organizations remain under-prepared even though best practices available today could help.
Although Sorensen said utilization of big data cyber capabilities and tools are still in their infancy, a small contingent of known-name firms are using big data techniques primarily for fraud detection and remediation. “In this morass,” he said, “the advantages of bringing effective advanced computing – in a number of forms – to the cyber realm are myriad.”
Outcompute to Outcompete
Larry Patterson, director of advanced computing technologies at Gulfstream Aerospace, gave us a potent reminder of the reason we engage in high performance data analytics in the first place. “It’s straight up competitive advantage,” he said during a panel discussion moderated by your EnterpriseTech managing editor. “It just makes you money. You have to out-compute to outcompete.”
The difficulty in finding highly skilled personnel was a topic that was touched on by Chris Mustain, the vice president of innovation policy and programs at the U.S. Council on Competitiveness, while Fred Streitz, the director of the HPC Innovation Center at Lawrence Livermore National Laboratory, predicted we could see quantum computing sooner than many people expect.
All told, attendees expressed a great deal of satisfaction with the two-and-a-half day event. With nGage handling logistics and the five-star resort pampering LBD+EHPC attendees in a comfortable and intimate setting, our minds were free to focus on the convergence of big data and HPC.