By Izumi Hinkson, Ph.D., NCI CBIIT Cancer Informatics Branch
In recent years, genomics has been described as a big data science on par with the likes of Twitter, YouTube, and the scientific pursuit of understanding the universe1. These comparisons are certainly not exaggerations, especially considering that genomic data is generated at a rate of 1 zetta-bases per year1. More importantly, genomic data are vital to the targeting of actionable genetic alterations for precision medicine. Several large-scale genomic studies have been generated through work funded by the National Cancer Institute (NCI) such as The Cancer Genome Atlas (TCGA), Therapeutically Applicable Research to Generate Effective Therapies (TARGET), the Cancer Genome Characterization Initiative (CGCI), and the Cancer Cell Line Encyclopedia (CCLE). TCGA, for example, produced 2.5 petabytes of data spanning seven different data types from 11,000 patients. It is clear that the field of cancer genomics has made its debut as a big data science, but how do we exploit this massive amount of data to lessen the burden of cancer?
Big Data Require Big Infrastructure
To maximize the value of these data, they need to be interpretable. Biomedical data originate from numerous labs and research institutions around the world. Data generated by one researcher may have little meaning to another researcher unless it is accompanied by the necessary annotations and a Rosetta Stone to interpret them. To compound this further, genomic and other -omic data are often collected in different file formats, analyzed against different reference databases, and stored in different repositories. Standardization of a single data type is difficult in and of itself, and expanding standardization efforts to encompass multiple data types from different experimental platforms or measurements drastically increases the level of complexity.
Despite these complications, the use of -omic data holds great promise. Combining genomic data with other patient-derived data such as epigenomics, proteomics, imaging, treatments, outcomes, and demographics facilitates a deeper understanding of molecular mechanisms that are responsible for an individual patient’s disease. Multiplying this effort across a population of cancer patients allows researchers and clinicians to interpret trends, identify specific risk factors, create better screening assays, develop more efficacious treatments, and ultimately, reduce the burden of cancer.
Cancer research in the era of big data presents a major challenge: we need to collect vast amounts of data to understand the complexities of individual cancer patients but, the more data we collect, the more we actually limit the number and diversity of researchers who can access and interpret the data because big data questions require big data infrastructure. Many researchers, particularly those at research institutes that lack extensive information technology capacity, are at a profound disadvantage. This hampers innovation in cancer prevention, diagnostics, and therapeutics.
Comprehensive Infrastructure for Cancer Genomic Data
In 2016, to address these needs, the NCI launched the Genomic Data Commons (GDC) and three Cancer Genomics Cloud (CGC) Pilots. The GDC has been tasked with creating a standardized data submission process, ensuring data quality, harmonizing large genomic datasets, and providing secure access to data. The Broad Institute FireCloud, the Institute for Systems Biology (ISB) CGC, and the Seven Bridges Genomics (SBG) CGC complement the GDC by harnessing Google Cloud Platform and Amazon Web Services technologies. The Cloud Pilots have co-located data harmonized by the GDC (e.g. TCGA, TARGET) with the elastic compute power of these commercial clouds, eliminating both the need to download petabytes of data, and the prohibitive cost and time required for such downloads. The Cloud Pilots also allow researchers to take advantage of hosted cutting-edge analysis pipelines or to bring their own tools to the cloud.
Cloud Pilots Accelerate Science and Discovery
Within a year, data analysis projects leveraging the Cloud Pilots have been initiated by scientists across the globe including the US, Australia, India, and France. Projects on the Cloud Pilots range from technology development to understanding the molecular underpinnings of specific cancer types.
An example of the impact of the Cloud Pilots is the recently published work of the Piccolo lab at Brigham Young University in Nature Scientific Reports2. Tatlow and Piccolo used the ISB-CGC to perform fast, cheap, and robust RNA-sequencing analyses of 12,307 samples from CCLE and TCGA. The researchers were able to rent pre-emptible virtual machines and reduce the CCLE RNA-sequencing analysis cost by about half compared to their cluster-based configuration. Tatlow and Piccolo also reported analyzing over 64 terabytes of TCGA data for only $0.09 per sample. Reflecting on their work on the ISB-CGC, they wrote, “Our experiences confirm that the cloud has potential to make it easier to apply custom workflows to sequencing data at a modest price… we believe cloud computing will play an increasingly important role in cancer research.”
The Wong lab at the University of New South Wales has recently initiated a project on the SBG-CGC. Using data hosted on the SBG-CGC, Poulos and Wong are analyzing raw DNA sequencing files across thousands of whole cancer genomes to identify somatic mutations. The goal of this project is to determine whether or not these mutations disrupt important regulatory networks within cancer cells. Poulos and Wong’s work highlights the global reach of the CGC Pilots. Poulos has expressed that these analyses have been “made possible through access to the Cancer Genomic Cloud3.”
Another key example of how the CGC Pilots are helping to advance cancer research is the partnership between the Broad Institute and IBM Watson Health4. Through this collaboration, researchers are using FireCloud to understand why cancer patients become resistant to drugs and relapse. Other research projects on the CGC Pilots include developing novel RNA-seq analysis tools, studying the relationship between bacteria and cancer, understanding HLA-typing to inform immunotherapy, and discovering new cancer biomarkers and drug targets. Collectively, these projects exemplify the mission of the CGC Pilots – bringing data and computation together to create knowledge that accelerates cancer research and enables precision medicine.
The Future: A Cancer Data Ecosystem
The GDC and CGC Pilots serve as cornerstones for the development of a Cancer Research Data Ecosystem, a key recommendation of the Blue Ribbon Panel (BRP) formed as a result of the Cancer MoonshotSM. The goals of this ecosystem are to “collect, share, and interconnect a broad array of large datasets so that researchers, clinicians, and patients will be able to both contribute and analyze data, facilitating discovery that will ultimately improve patient care and outcomes5.” Through cooperation and collaboration within and between academia, government, and private industry, the GDC along with the technology and lessons learned from the CGC Pilots will continue to enhance the democratization of cancer data and further the mission of the NCI.
1 Stephens, Z.D., et al., Big Data: Astronomical or Genomical? PLoS Biol, 2015. 13(7): p. e1002195.
2 Tatlow, P.J. and S.R. Piccolo, A cloud-based workflow to quantify transcript-expression levels in public cancer compendia. Sci Rep, 2016. 6: p. 39259.
3 Poulos, R., Personal communication regarding CGC projects. 2017.
5 Jacks, T., E. Jaffee, and D. Singer, Cancer Moonshot Blue Ribbon Panel Report 2016. 2016, National Cancer Institute: https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/blue-ribbon-panel. p. 1-69.