Biology is increasingly joining the ranks of the data-intensive sciences through increased use of “-omics” technologies (next-generation sequencing, expression profiling, proteomics, methylation studies, etc.) and the large-scale use of a variety of imaging technologies in research. This is great for scientists and the patients who are going to benefit from the research, but it poses a set of significant challenges for those who provide informatics support to scientific investigators. In fact, the demands caused by large-scale data are breaking the standard model by which researchers manage and analyze data. A new model will be essential to the continued advancement of cancer research.
The “standard model” of biomedical data management and analysis is to identify public data sets of interest, download a local copy, combine with locally generated data and perform analysis on locally managed computers. This works fine for smaller data sets, but consider the situation for a group who wanted to analyze the complete Cancer Genome Atlas (TCGA) data set. By the conclusion of the project in 2014, TCGA is expected to generate on the order of 2.5 petabytes (PB) of data. Using conservative cost estimates, archive-level data storage alone is a $2,000,000/year cost, which would be repeated over multiple institutions (or even repeated multiple times in the same institution), with additional costs for the computational capacity required to perform the analysis. All of this is predicated on being able to receive the data in a timely fashion. That 2.5 PB is 20,000,000 gigabits of data. Assuming the unlikely possibility that a research group has access to a non-shared 10-gigabit network connection between the repository and themselves, and can achieve maximum theoretical throughput (doubtful), it would take 2,000,000 seconds or 23 days to complete the download. Sending a disk array copied on site by the US Postal Service would almost assuredly achieve a higher net throughput.
Given this situation, what is the alternative? There is an emerging consensus among biomedical informaticians (see for example, Grossman and White 2012 and Haussler et. al 2012) that the answer is to create public “cancer knowledge clouds”: that is, data repositories with co-located computing resources, allowing researchers to bring their analytic tools to the data rather than trying to bring the data to their tools. Such clouds have the potential to increase the speed of discovery and democratize access to cancer genomics data, which is too often the province of organizations that can support the high cost of maintaining these enormous data sets. While there exists an emerging consensus on the need for such clouds, there does not appear to yet be a consensus on the best way to implement such a cloud. For this reason, the NCI believes that the best way to proceed is to conduct a series of pilots that can be used to define a cancer knowledge cloud or clouds.
A meaningful set of pilots must solve meaningful, real-world problems currently faced by working scientists. To this end, a letter was recently sent to all NCI grantees by NCI Director Harold Varmus, NCI Center for Cancer Genomics acting co-Directors Dr. Louis Staudt and Dr. Stephen Chanock, and myself requesting information about situations where information technology limitations are inhibiting the research process, their experience with biological research clouds, and metrics that can be used to assess the success of cloud pilots. We encourage all interested parties to read the letter and respond with comments to firstname.lastname@example.org. This information, along with advice from intramural researchers, NCI staff, and NCI oversight committees will help ensure that these pilots provide the maximum value to cancer research.
 Storage costs are meant to reflect “fully loaded costs,” including operational costs such as data center operations, data protection, staffing, and amortized acquisition costs assuming a 5-year cost recovery on capital items.
 Grossman, RL and White, KP, 2012. “A vision for a biomedical cloud.” Journal of Internal Medicine, 271:2, 122-130.
 D. Haussler, D. A. Patterson, M. Diekhans, A. Fox, M. Jordan, A. D. Joseph, S. Ma, B. Paten, S. Shenker, T. Sittler, and I. Stoica, “A Million Cancer Genome Warehouse,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2012-211, Nov. 2012
George Komatsoulis, Ph.D., is Interim Director of the Center for Biomedical Informatics and Information Technology (CBIIT) and Chief Information Officer at the National Cancer Institute (NCI). You may reach George via email at email@example.com.