By Mark A. Musen, M.D., Ph.D.
The data science community is awash with “FAIRness.” In the past few years, there has been an emerging consensus that scientific data should be archived in open repositories, and that the data should be Findable, Accessible, Interoperable, and Reusable.1 The campaign for FAIR data is particularly palpable at the NCI, which is responding to the Blue Ribbon Panel for the Beau Biden Cancer Moonshot, calling for the creation of a Cancer Data Ecosystem that will support “large scale data collection and integration across many sources.” 2 There is a conviction that FAIR data will enable investigators to search for relevant datasets, to integrate data from diverse experiments, to perform secondary analyses, and to verify experimental results.
To meet these objectives, the NCI is supporting the development of a Genomic Data Commons,3 and it is embarking on a recently announced plan to create an even more comprehensive Cancer Research Data Commons.4 Other NIH Institutes and Centers may follow suit with commons initiatives of their own. Meanwhile, the NIH Big Data to Knowledge (BD2K) initiative is piloting its own NIH Data Commons.5 FAIRness is “in,” and so are commons.
Most discussions of data commons tend to center around opportunities for cloud computing and the ability of open repositories to support the FAIR principles. The problem for these new commons initiatives, however, is that current biomedical data generally are not FAIR. Merely relocating to the cloud data that are not reliably findable, accessible, interoperable, or reusable doesn’t solve any fundamental problem.
To make experimental datasets FAIR, they must be accompanied by metadata that can explain what the data are about, under what conditions the data were collected, how the data are formatted, and the provenance of the data. The challenge for existing online data repositories — and the challenge for all the data-commons initiatives — is that many online datasets are annotated with metadata that are simply terrible.
Most older online repositories are notorious for containing bad metadata, largely because these repositories allow their users the “freedom” to upload metadata that include arbitrary fields filled with arbitrary values — including missing values and typos. More modern archives maintain comprehensive data dictionaries to help users structure their metadata and to constrain the values that they enter. While the goal is admirable, such data dictionaries are effective only when they are actually used. My group’s analysis of 6.6 million metadata records stored in the NCBI’s BioSample repository, for example, was disheartening: We found that some 85% of BioSample submissions avoid using one of the repository’s predefined “packages” to structure the metadata; 15% of the metadata field names are not found in the data dictionary; 68% of the metadata field values that are supposed to come from a biomedical ontology simply don’t; and 73% of the fields that are designated as Boolean are not filled in with the value true or false.6
It’s not that the investigators who author metadata are intentionally doing a poor job. Creating good metadata takes considerable work, and most investigators see the primary endpoint of their research as publication in journals. Submitting the data to a public repository and annotating the data with good metadata are not usually viewed as primary research objectives. More important, although the scientific community has put in place rigorous systems of peer review for manuscripts describing its investigations, there is no established peer-review mechanism to ensure the quality of online metadata. Investigators see no incentive to take time away from activities that they perceive as central to their research to make certain that their archived metadata are consistent and complete and that their online datasets are FAIR.
This situation portends a serious situation for the new commons initiatives: The scientific community overall will suffer if the new data-commons initiatives generate cloud-based resources stuffed with data that are still not FAIR. Investigators who do not believe that they will benefit individually from the public good of high-quality metadata do not make the authoring of metadata a high priority, and they are unlikely to change their behavior if the process remains burdensome.
This is a new kind of “Tragedy of the Commons” that we must work hard to avoid. In his 1833 essay7 (brought to worldwide attention in 1968 by Garrett Hardin8), economist William Forster Lloyd imagined the devastating effects to a community of farmers when individual farmers each allow their cattle to overgraze on common land. In Lloyd’s analysis, the desire of individuals to maximize their own use of a shared resource leads to degradation and depletion of that resource through their collective actions. Without community-based interventions, we should expect analogous degradation in the value of a data commons, as the repository becomes filled with increasing numbers of datasets that have metadata that are confusing, conflicting, or incomplete. There will not be depletion of the resource as in Lloyd’s thought experiment, but, when investigators act in their own self-interest, taking short cuts to generate metadata as quickly as possible, we should expect that the overall utility of the resource will decline. In the absence of extensive human curation of the metadata, the ease of finding and integrating related datasets and the ability of users to sort through, to understand, and to reuse previously archived data will all diminish.
We can avoid the potential tragedy by encouraging investigators to author more complete, more comprehensive, and more standardized metadata. The biomedical community is already making significant progress in this direction. Resources such as BioPortal9 and the Ontology Lookup Service10 make rigorously documented, standardized terms more accessible to metadata authors. Electronic data-capture tools can transparently fetch controlled terms from online servers and can integrate the terms directly into custom-tailored data forms (as with REDCap)11 and spreadsheets (as with RightField),12 thus enhancing the quality of data annotations. Domain-specific data-management tools are able to offer bespoke user interfaces that greatly ease the acquisition of high-quality metadata and that facilitate data exploration and analysis.13 14
The creation of a data commons, however, requires the ability to deal with extremely varied — and often unanticipated — metadata patterns and data types. The whole idea of a commons is to integrate diverse datasets to facilitate novel opportunities for data evaluation and inquiry. There is thus a need for easy-to-use solutions that are generic (that is, not bespoke) to provide guidance over the entire life cycle of metadata — streamlining metadata creation, discovery, and access, as well as supporting metadata publication to third-party repositories. The technology under development by the Center for Expanded Data Annotation and Retrieval (CEDAR) offers an example of such an all-purpose, end-to-end solution.15 The NCI Cancer Research Data Commons team4 is looking at CEDAR and other tools to understand the effect that metadata-authoring systems can have on the FAIRness of its underlying datasets.
Technological solutions to improve the quality of metadata, combined with more explicit emphasis by funding agencies and by publishers on thorough and FAIR metadata descriptions, will make biomedical metadata more interoperable, and will help us to escape a new Tragedy of the Commons. We need a comprehensive approach to the authoring and management of metadata, as the success of a lot of commons projects is at stake.
1 Wilkinson, M.D., Dumontier, M., Aalbersberg, I.J., et al. The FAIR guiding principles for scientific data management and stewardship. Nature Scientific Data 3:160018, 2016.
2 Cancer Moonshot Blue Ribbon Panel Report, 2016. Available at https://www.cancer.gov/research/key-initiatives/moonshot-cancer-initiative/blue-ribbon-panel/blue-ribbon-panel-report-2016.pdf
3 Grossman, R.L., Heath, A.P., Ferretti, V., et al. Toward a shared vision for cancer genomic data. New England Journal of Medicine 375:1109–1112, 2016.
4 Dearry, A. Towards a cancer research data commons. DataScience@NIH, Sept 21, 2017. Available at https://datascience.nih.gov/CancerResearchDataCommons
5 Bonazzi, V. NIH Data Commons pilot phase explores using the Cloud to access and share FAIR biomedical big data. DataScience@NIH, June 1, 2017. Available at https://datascience.nih.gov/BlogFAIR
6 Gonçalves, R.S., O’Connor, M.J., Martínez-Romero, M., et al. Metadata in the BioSample online repository are impaired by numerous anomalies. Proceedings of SemSci 2017: Enabling Open Semantic Science. International Semantic Web Conference, Vienna, Austria, October 2017. Available at https://arxiv.org/pdf/1708.01286.pdf
7 Lloyd, W.F. Two lectures on the checks to population. Oxford University, 1833. Available at https://archive.org/details/twolecturesonch00lloygoog
8 Hardin, G. The tragedy of the commons. Science 162(3859):1243–1248, 1968.
9 Noy, N.F., Shah, N.H., Whetzel, P.L. et al. BioPortal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research 37(Suppl 2):W170–W173, 2009.
10 Coté, R.G., Jones, P., Apweiler, R., et al. The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries. BMC Bioinformatics 7:97, 2006.
11 Harris, P.A., Taylor, R., Thielke, R., et al. Research electronic data capture (REDCap)—a metadata-driven methodology and workflow process for providing translational research informatics support. Journal of Biomedical Informatics 42(2):377–381, 2009.
12 Wolstencroft, K., Owen, S., Horridge, M., et al. RightField: embedding ontology annotation in spreadsheets. Bioinformatics 27(14):2021–2022, 2011.
13 Schuler, R.E., Kesselman, C., and Czajkowski. Accelerating data-driven discovery with scientific asset management. Proceedings of the 12th International Conference on e-Science. Baltimore, MD, October 2016.
14 Griffiths, E., Dooley, D., Graham, M., et al. Context is everything: harmonization of critical food microbiology descriptors and metadata for improved food safety and surveillance. Frontiers of Microbiology, June 26, 2017, https://doi.org/10.3389/fmicb.2017.01068
15 Musen, M.A., Bean, C.A., Cheung, K.-H., et al. The Center for Expanded Data Annotation and Retrieval. Journal of the American Medical Informatics Association 22(6):1148–1152, 2015. See also https://metadatacenter.org