On September 4, 2013, NCI’s Clinical Proteomics Tumor Analysis Consortium (CPTAC) publicly released proteomic data produced from colorectal tumor samples previously analyzed as part of The Cancer Genome Atlas (TCGA) initiative. This is the first release of proteomic tumor data designed to complement genomic data from the same tumors.
Since its launch in 2006, the Consortium has “aimed at improving proteomic analysis platforms to reliably identify, quantify, and compare proteins and peptides in complex biological mixtures.” After five years of technology development, inter-laboratory studies, and statistical analyses of variation, the program entered a second phase, this time focused on data production. CPTAC obtained access to a subset of tumors analyzed by TCGA, with a view toward generating proteomic data sets to enable integration with the genomic data.
The CPTAC consists of five teams that create a network of Proteome Characterization Centers (PCCs):
- Broad Institute; Fred Hutchinson Cancer Research Center
- Johns Hopkins University
- Pacific Northwest National Laboratory
- Vanderbilt University
- Washington University; University of North Carolina, Chapel Hill
This network has been established to deploy an array of proteomic technologies to analyze more than 300 samples of colorectal, breast, and ovarian cancer types from TCGA. Using mass-spectrometry and antibody-array-based platforms, the PCCs will provide a quantitative inventory of cancer proteins for each sample as well as information about post-translationally modified proteins.
Prior to embarking on analysis of samples characterized by TCGA, the CPTAC network performed a number of due-diligence studies aimed at understanding pre-analytic variables of the sample collection. In the spirit of technology assessment, each CPTAC platform slated to analyze TCGA samples first “qualified” on a xenograft-based sample for comparative reference (CompRef). Mass-spectrometry files of CompRef were scrutinized for a number of data-quality metrics to ensure a high and consistent level of quality across the CPTAC network. The CompRef material also served as an interstitial reference material to monitor instrument performance over the course of the tumor analysis.
The first data set released by CPTAC provides characterizations of 90 colorectal tumors. Professor Daniel Liebler’s laboratory provided global proteome analysis on the tumors using a Thermo Orbitrap mass spectrometer. The data set contains 1425 files with a total size of 737 gigabytes (GB). While the data currently available are limited to instrument output files, more generally useful files including lists of peptides and proteins identified by the CPTAC Common Data Analysis Pipeline will be available in the near future. Data are available via the CPTAC Data Coordinating Center.
With corresponding genomic data available through the TCGA data portal, this data set poses a substantial opportunity to the systems-biology community. At 737 GB, this proteomics data set is modest when compared with its genomics complement. However, the challenges of this data set reside in the complexity provided by the added level of information. Both the number of samples and the availability of corresponding genomic information make this data set the first of its kind among proteomics data. Below are a number of questions to be asked of the data set:
- What genomic alterations are expressed at the protein level?
- How does unsupervised clustering of protein expression compare with that of gene expression?
- Are there colon cancer protein signatures, and do they differ from colon cancer gene signatures?
- Which gene-based biomarker candidates can be detected at the protein level, and potentially provide the antigen for a biomarker assay?
Data analysis within CPTAC has thus far focused on data quality, peptide identification and quantification, protein identification, and localization of posttranslational modification sites. Full proteogenomic integration will require development of new computational tools, many of which exceed the scope of the CPTAC program.
This data set stands as an invitation to data scientists currently outside of proteomics to consider the challenges of proteogenomic integration. To our knowledge, this is the first data set of this scale to have both genomic and proteomic characterization on common samples. But it will not be the last.
Chris Kinsinger, Ph.D., is CPTAC Program Manager, Center for Strategic Scientific Initiatives, National Cancer Institute.