This is the fourth in a series of posts that discuss the principles underlying the three-year collaborative program “Joint Design of Advanced Computing Solutions for Cancer (JDACS4C).” Investigators from the National Cancer Institute (NCI) and the Frederick National Laboratory for Cancer Research have been working collaboratively with computational and data scientists affiliated with several national laboratories supported by the Department of Energy (DOE): principally Argonne, Los Alamos, Lawrence Livermore, and Oak Ridge. Their aim is to develop and apply large-scale computational approaches to answer challenges in cancer biology, surveillance, screening, and pre-clinical development. This pilot leverages the capabilities of high-performance computing to support implementation of a more advanced population-based cancer surveillance program and develop an integrated framework for modeling and simulation from individual patient to the population level.
The National Cancer Institute’s (NCI) Surveillance, Epidemiology, and End Results (SEER) Program is collaborating with four laboratories at the Department of Energy (DOE)—Oak Ridge National Laboratory (ORNL), Argonne National Laboratory (ANL), Los Alamos National Laboratory (LANL), and Lawrence Livermore National Laboratory (LLNL)—to leverage the capabilities of high-performance computing to support implementation of a more advanced population-based cancer surveillance program and to develop an integrated framework for modeling and simulation from individual patient to the population level.
The objective of the Population Level Pilot is to deliver working tools for automating and augmenting the work of central cancer registries. The pilot will also provide an infrastructure that will support the ongoing and sustainable development of scalable algorithms and informatics tools. This infrastructure will enhance the existing national cancer surveillance programs while also expanding the breadth of data captured to integrate more detailed data sources and variables that can be used to model and simulate cancer treatment outcomes.
This pilot has been organized into three aims. Aim 1 will develop scalable machine-learned natural language processing (NLP) tools for deep comprehension of unstructured clinical text to enable automated and accurate capture of reportable cancer surveillance data elements. Aim 2 will build a working infrastructure that integrates heterogeneous data sources for research. Scalable graph and visual analytics tools will be used to explore cancer treatment patterns and outcomes, including recurrence and survival. Aim 3 will utilize information extracted from clinical text in Aim 1 and integrated data sources from Aim 2 to develop a data-driven modeling and simulation environment for predicting health trajectories of cancer patients or populations.
Reading pathology reports and extracting the key clinical information in them, such as type of cancer and the behavior of a tumor, is a time-consuming task for tumor registrars, who require significant training and experience to correctly interpret clinical content. Using deep learning methods, researchers at participating DOE laboratories have been creating robust NLP tools to automate the task of identifying and coding cancer site, histology, laterality, grade, and behavior of tumors described in pathology reports. These models have been built in part using ORNL’s Titan and Summit supercomputers, and the resulting models have been delivered to evaluate and implement within the SEER cancer registry pipeline. The tools under development will be packaged with an uncertainty quantification (UQ) component that scores each predicted code for each document processed. These models will be tested and reviewed in an iterative process to ensure that the highest levels of precision and accuracy are achieved. Multiple hackathons are in the works to review outcomes of tool testing as well as to provide increased engagement with the diverse datasets being made available to the pilot. To support the development of additional NLP tools, NCI has produced an infrastructure to create large, manually annotated datasets, which have the coded data fields and the associated highlighted text. This infrastructure is currently being utilized for annotation of breast cancer and recurrence data elements found in electronic pathology reports.
NCI continues to identify and link external data sources with SEER data to enable the expansion of longitudinal data to form patient trajectories and to support modeling efforts. To inform the incorporation of those additional sources, NCI compiled an extensive breast cancer recurrence data dictionary to identify recurrence-related data elements across multiple sources, including pathology, radiology, pharmacy, biomarkers, procedures, comorbidities, patient-generated information, and radiation oncology. The Population Level Pilot Team is also collaborating with clinical experts to construct research agendas that will be used to create disease-specific use cases for scalable predictive modeling and analytics using a variety of integrated datasets.
The NCI-DOE Population Level Pilot collaboration will enhance the SEER Program through automated data abstraction, linkage and visualization of data sources, and predictive modeling. Preliminary results indicate the capability to use deep learning methods to automate extraction of selected registry data elements, which will enhance cancer surveillance efforts and reduce registry workload. Additionally, modeling of longitudinal patient trajectories will allow for review of initial breast cancer treatments and examine novel patterns. This application of data-driven modeling seeks to answer key clinical oncology questions to empower clinician decision-making tools, which can improve patient treatment selection and health outcomes.