Healthcare Conundrum Overcome: Protecting and Sharing Patient Data
The healthcare industry wrestles with a perennial dilemma: protecting sensitive patient data versus the need to amass and distribute patient datasets at scale among multiple hospitals to enhance medical research and treatment using advanced data analytics.
Intel’s Collaborative Cancer Cloud (CCC) is an open Platform-as-a-Service technology and workflow strategy designed to overcome the contradictory goals of guarding and sharing patient information. Announced last year, the CCC now has three members – Dana-Farber Cancer Institute, Boston; the Knight Cancer Institute at Oregon Health & Science University, Portland; and the Ontario Institute for Cancer Research, Toronto – and intends to add more organizations that contribute to a massive database of molecular and imaging data within the CCC cloud data center for analyzing the drivers of cancer. Ultimately, the goal is to provide doctors and researchers a match between a patient’s individual genomic sequencing data and existing data, reducing the diagnostics and treatment process from weeks or months to, by the end of the decade, less than 24 hours.
Today, cancer data is siloed behind hospital firewalls, explained Ethan Cerami, Ph.D., director of the Knowledge Systems Group and Lead Scientist in the Department of Biostatistics and Computational Biology at Dana-Farber. When diagnosing a patient’s cancer, doctors typically have access to a few thousand other cancer genomes, held by their own institution, for comparative analysis. Providing access to hundreds of thousands, or millions, of other genomes would enable more effective and faster diagnoses.
The Intel platform, Cerami told EnterpriseTech, “is a way to bore securely shared genomic data across centers, and that’s really the innovative aspect for us, the compelling reason we joined up, because that ultimately results in the best science and the best possible patient care.
“We have an instance of the Intel cluster here that we can put data onto, but we control the data with various levels of control, and the data never leaves our premises,” Cerami said. “Rather than us taking all our data and sending it to different places to process, people can send us their code as a workflow, and they have a whole platform set up for running workflow engines or workflow pipelines. The results of that processing can be sent back (with patient identification information stripped out). It’s a way for us to control data that we’re comfortable with, while enabling others to use the data in a way that helps them develop their algorithms or insights into the particular research project they’re interested in. That’s different from taking all of our data and putting it on Google cloud or AWS in that we still control the data using our protocols required by hospital policy.”
Cerami said Dana-Farber has access to about 700 CCC processing cores. Tumors typically total between 5 and 10 gbytes of data, he said, and the Institute has sequenced more than 10,000 patients.
Ketan Paranjape, director of partnerships and collaborations in Intel’s Health Strategy and Solutions Group, told EnterpriseTech the CCC project began four years ago when Intel approached the Knight Institute to discuss “building a system where we can work on common problems but with the mindset that the data remains behind the firewalls. We’ve built a federated, scalable and secure collaborative environment where people can do things like genotyping or distributed machine learning. The idea is to have multiple institutions collaboratively hang out with each other.”
The key, Paranjape explained, is rather than sending data from one member institution to another, the institutions send each other code that performs processing on another institution’s patient data, which remains where it is. Applications and tools are containerized through Docker and stored in a central application repository, which has mirrors at each member site for application consistency. When a user uploads new application code it gets pushed to all the participating sites. Analysis workflows consist of multiple tools and multiple inputs and are specified in a domain-specific language (WDL or CWL). Doctors and researchers submit jobs at a central location where the execution engine forwards instructions to each site to execute the containerized tools (that are already stored at each site) with the appropriate parameters and dataset inputs.
The CCC project involved code modernization work around the widely used Genome Analysis Toolkit (GATK), an open source software package for analysis of high-throughput sequencing data. GATK was optimized to enable cloud-based access while running on Intel clusters. Intel said the GATK Best Practices pipeline will be available to users of cloud service providers through a software-as-a-service (SaaS) mechanism, with a focus on variant discovery and genotyping.
“There’s real value in being able to get greater sample sizes of more patients so you can look at different trends within the data,” said Cerami. “It also enables us to compare the mutations and genomic signatures that we see within our patients and compare them to genomic signatures of other patients. That’s just generally good for different sets of research questions because you may want to look at rare genomic events, we may see only a few of those patients at our institution, but if we combine the data across multiple centers you’re suddenly able to see a signal out of a much larger data set.”
He said CCC is a significant part of Dana-Farber’s overall precision medicine strategy, and he foresees a day when other diseases in addition to cancer will be incorporated within the CCC’s scope.
Eventually, Cerami envisions a times when the CCC will host distributed machine learning applications across multiple medical centers to help give patients access to the best possible treatment or clinical trial. “You can imagine that each of the centers has 1000 samples of the same cancer type and each of those samples has genomic data but also has information about which patients live longer or have more aggressive phenotypes,” he said. “The idea is to build a distributed machine learning platform such that you can train your model using data from one center and then you can validate that data securely using data from the other centers. These are the kinds of things we’re thinking about.”