Infrastructure Issues when Taking Genomic Information into Clinical Applications Sponsored Content by EMC
For years, many aspects of life sciences work required the fast analysis of large datasets. And to accomplish this, most organizations installed high performance computer and storage infrastructures.
However, the greater volumes of data being produced with newer sequencing equipment, the need to run that data through more complex analysis, and the demand for faster time to results are straining the IT infrastructures in place in most organizations today. Simply put, the infrastructure is a bottleneck slowing critical analysis.
These problems will only get worse and more pronounced with the increased interest in translational and precision medicine applications that use genomic analysis in research and clinical settings. In such settings patient genomic information and disease biomarkers are used to make “real-time” decisions about customized treatments and therapies.
As more organizations move to these applications, new thinking about how to satisfy infrastructure performance and throughput requirements will need to take place so data is fed to high performance compute nodes in a timely manner. Making infrastructure choices more challenging is the fact that in many organizations there will be a need for repeated access to and analysis of the sequencing data over time as different multi-disciplinary groups use the data for different purposes.
Putting the issues into perspective
New sequencers and imaging equipment used in life sciences organizations are producing larger volumes of rich data, in shorter time periods, and at lower costs than ever before. The lower costs and richer data have allowed many organizations to expand the use of sequencing and imaging into new areas beyond drug discovery and into clinical settings and in some cases for personalized medicine applications. This has great implications on infrastructure and storage strategies.
A look at the work at some industry leading organizations helps put the storage issues into perspective.
The Center for Pediatric Genomic Medicine at Children's Mercy is among the first of its kind with a pediatric focus. The center provides clinical genomic services and research, focusing on sequencing and analysis of rare inherited diseases in children. And there are plans for expansion into a translational cancer genomics program and the use of pharmacogenomics in precision medicine.
The Pediatric Genomic Medicine Facilities at Children’s Mercy make use of four Illumina HiSeq sequencers and two Illumina MiSeq sequencers. Its collection of sequencers and robots allows it to process over 2 TB of data per week. Some of the work being done includes whole Exome sequencing (WES) for research, which involves decoding the coding regions of all 23,000 genes in the genome. And STAT-Seq, a fast whole genome analysis, that takes only 50 hours from ordering the test to delivery of an interim report.
North Carolina Clinical Genomic Evaluation by NextGen Exome Sequencing (NCGENES) at the Carolina Center for Genome Sciences studies ways for healthcare professionals to use genome sequencing information in a clinical setting.
One of the group’s efforts is a project to do perform WES on 750 University of North Carolina patients, in whom there is a reasonable suspicion that a discrete genetic error lies at the root of their disorder. The scientists will evaluate the use and performance of WES as a diagnostic tool, while addressing the impact of the diagnostic WES information on patients and families.
Institutions such as Partners Healthcare; The Broad Institute; The Genome Institute at Washington University; Human Genome Sequencing Center, Baylor College of Medicine; and others are working on similar efforts. In all of the efforts, the volumes of data keep growing, while the time to perform things like WES and whole genome analysis must be reduced.
For years, the way to handle data issues related to sequencing and data analysis in the life sciences was to simply throw raw storage capacity at the problem. But that approach no longer works. Besides dealing with capacity challenges, life sciences organizations must also deal with performance and data management issues when it comes to their choice in storage systems.
In particular, the work being conducted at leading institutions highlights the need for high performance storage to complement HPC compute capabilities. A good example of what is needed can be found in the work done at the Hospital for Sick Children (SickKids) and the University Health Network’s (UHN) Princess Margaret Cancer Center. Working together, the groups built a computing infrastructure, dubbed HPC4HEALTH, which balances resources efficiently but also has sufficient processing and storage power to handle complex molecular analyses and medical imaging workloads.
According to the center, the facility went from 1,200 to approximately 7,000 cores. Storage capacity has gone up drastically, from 550 terabytes to 2 petabytes. They moved from an Ethernet network to an InfiniBand network to meet the requirement of 80 gigabytes per second from the compute nodes to the storage. Additionally, the system is able to do 145 trillion calculations per second – one of the largest systems dedicated to health research. The center can now process five times the work in the same amount of time.
This type of infrastructure transformation is now required in many organizations.
There are all areas where EMC Isilon can help. Isilon offers a proven Scale-Out network attached storage (NAS) solution that can address life sciences workflows end-to-end. Isilon Scale-Out NAS provides a highly available, and reliable, single file system, OneFS, for life sciences applications and workflows.
OneFS provides multi-protocol support such as SMB, NFS, Swift, and HDFS for a wide variety of genomics technologies, user access and analysis environments. Also, single Isilon storage cluster can host multiple node types, S, X, NL and HD. S and X nodes are ideal for performance workflows like mapping and alignment of NGS data while NL and HD nodes provide high-density, low cost storage of raw data and analysis results in a long archive.
The solutions are simple to deploy, maintain, and grow. EMC Isilon can instantly scale capacity to meet an organization’s research demands—from 16 TB to 50 PB in a single file system and deliver a performance greater than 200 GB per second. With EMC Isilon scale-out NAS, organizations can manage PBs of data in one-tenth the administration time that would be spent on traditional storage solutions. EMC Isilon’s proven solutions achieve 48-percent greater storage productivity for mission-critical research applications and workflows, enabling immediate concurrent access to critical data.
Additionally, the solutions are well-suited to the large-scale, mission critical data analysis workflows found in most life sciences organizations where research should never stop. To ensure continued operations, Isilon includes a number data protection and high availability features in its solutions. For example, OneFS supports up to four simultaneous device failures (N+4) without compromising data reliability and availability. Protection levels can be modified on the fly or set by policy to match the value of the data.
Another factor to consider is data access and tiering. Many organizations find that different groups need access to large datasets at different times. With Isilon solutions, organizations are able to setup rules keeping the higher performing storage nodes available for the immediate access to data for computational needs while more cost effective nodes are used for all other data. Infrequently accessed data can be tiered to other public cloud resources such as AWS and Azure or private cloud resources using CloudPools.
Visit Emerging Technologies for the Life Sciences online to learn more!