Dr. Isaac Kohane: Making Our Data Work for Us!

Last weekend, Isaac Kohane, MD, PhD, FACMI, Marion V. Nelson Professor of Biomedical Informatics, and Chair of the Department of Biomedical Informatics at Harvard Medical School received the 2020 Morris F. Collen Award of Excellence at the AMIA 2020 Virtual Annual Symposium. This award – the highest honor in informatics – is bestowed to an individual whose personal commitment and dedication to medical informatics has made a lasting impression on the field.

Throughout his career, Dr. Kohane has worked to extract meaning out of large sets of clinical and genomic data to improve health care. His efforts mining medical data have contributed to the identification of harmful side-effects associated with drug therapy, recognition of early warning signs of domestic abuse, and detection of variations and patterns among people with conditions such as autism.

As the lead investigator of the i2b2 (Informatics for Integrating Biology & the Bedside) project, a National Institutes of Health-funded National Center for Biomedical Computing initiative, Dr. Kohane’s work has led to the creation of a comprehensive software and methodological framework to enable clinical researchers to accelerate the translation of genomic and “traditional” clinical findings into novel diagnostics, prognostics, and therapeutics.

Dr. Kohane is a visionary with a motto:  Make Our Data Work for Us! Please join me in congratulating Dr. Kohane, recipient of the 2020 Morris F. Collen Award of Excellence.

Hear more from Dr. Kohane in this video.

Video transcript (below)

The vision that has driven my research agenda is that we were not doing our patients any favors by not embracing information technology to accelerate our ability to both discover new findings in medicine, and to improve the way we deliver the medicine.

What does “make our data work for us” mean? It means that let’s not just use it for the real reason most of it is accumulated at present, which is in order to satisfy administrative or reimbursement processes. Let’s use it to improve health care.

Using just our claims data, we can actually predict – better than genetic tests – recurrence rates for autism. It’s the ability to show, with these same data, that drugs used for preventing immature birth in the genetic form are just as effective as those that are brand name; 40 times as expensive. It’s, as we’ve seen most recently, the ability to pull together data around pandemics within weeks, if and only if, we understand the data that’s spun off our health care systems in the course of care.

And finally, as exemplified by work on FHIR, which was funded by the Office of the National Coordinator and then the National Library Medicine, the ability to flow the data directly to the patient to finally allow patients’ access to their data in a computable format to allow decision support for the patient without going through the long loop of the health care system.

Because the NIH and NLM have invested in working on real-world sized experiments in biomedical informatics, on supporting the education of the individuals who drive those projects, and in supporting the public standards that are necessary for these projects to work and to scale, they’ve established an ecosystem that now is able to deliver true value to decision makers, to clinicians, and now to patients, as we’re seeing with a SMART on FHIR implementation on smartphones.

So, for those of you — the biomedical informaticians of the future who are clinicians — I strongly recommend that you don’t wait for someone else to fix the system. You have the most powerful tools to affect medicine, information processing tools. So, don’t wait to get old. Don’t wait to be recognized. You have the tools. Get in there, help change medicine. We all depend on you!

Introducing the NIH Guide Notice Encouraging Researchers to Adopt U.S. Core Data for Interoperability Standard

Recently, NIH issued a guide notice (NOT-OD-20-146) encouraging NIH-supported clinical programs and researchers to adopt and use the standardized set of healthcare data classes, data elements, and associated vocabulary standards in the U.S. Core Data for Interoperability (USCDI) standard. This standard will make it easier to exchange health information for research and clinical care, and is required under the Office of the National Coordinator Health Information Technology (ONC) Cures Act Final Rule to support seamless and secure access, exchange, and use of electronic health information.

USCDI standardizes health data classes and data elements that make sharing health information across the country interoperable, expands on data long required to be supported by certified EHRs, and incorporates health data standards developed.

NLM is proud to support USCDI through continued efforts to establish and maintain clinical terminology standards within the Department of Health and Human Services.

Standardized health data classes and elements enable collaboration, make it easier to aggregate research data, and enhance the discoverability of groundbreaking research. USCDI adoption will allow care delivery and research organizations to use the same coding systems for key data elements that are part of the USCDI data classes.

I encourage you to read more about the new guide notice in a joint post developed in collaboration with my NIH and ONC colleagues titled: “Leveraging Standardized Clinical Data to Advance Discovery.” And I ask you to consider, what could this notice mean for you? 

Some Insights on the Roles and Uses of Generalist Repositories

Guest post by Susan Gregurick, PhD, Associate Director for Data Science and Director, Office of Data Science Strategy, NIH

Data repositories are a useful way for researchers to both share data and make their data more findable, accessible, interoperable, and reusable (that is, aligned with the FAIR Data Principles).

Generalist repositories can house a vast array of data. This kind of repository does not restrict data by type, format, content, or topic. NIH has been exploring the roles and uses of generalist repositories in our data repository landscape through three activities, which I describe below, garnering valuable insights over the last year.

A pilot project with a generalist repository

NIH Figshare archive

Last September, I introduced Musings readers to the one-year Figshare pilot project, which was recently completed. Information about the NIH Figshare instance — and the outcomes of the project — is available on the Office of Data Science Strategy’s website. This project gave us an opportunity to uncover how NIH-funded researchers might utilize a generalist repository’s existing features. It also allowed us to test some specific options, such as a direct link to grant information, expert guidance, and metadata improvements.

There are three key takeaways from the project:

  • Generalist repositories are growing. More researchers are depositing data in, and more publications are linking to, generalist repositories.
  • Researchers need more education and guidance on where to publish data and how to effectively describe datasets using detailed metadata.
  • Better metadata enables greater discoverability. Expert metadata review proved to be one of the most impactful and unique features of the pilot instance, which we determined through two key metrics. When compared to data uploaded to the main Figshare repository by NIH-funded investigators, the NIH Figshare instance had files with more descriptive titles (e.g., twice as long) and metadata descriptions that were more than three times longer.
Illustrating how professionals can identify opportunities for collaboration and competition.

The NIH Figshare instance is now an archive, but the data are still discoverable and reusable. Although this specific pilot has concluded, we encourage NIH-funded researchers to use a generalist repository that meets the White House Office of Science and Technology Policy criteria when a domain-specific or institutional repository is not available.

A community workshop on the role of generalist repositories

In February, the Office of Data Science Strategy hosted the NIH Workshop on the Role of Generalist and Institutional Repositories to Enhance Data Discoverability and Reuse, bringing together representatives of generalist and institutional repositories for a day and a half of rich discussion. The conversations centered around the concept of “coopetition,” the importance of people in the broader data ecosystem, and the importance of code. A full workshop summary is available, and our co-chairs and the workshop’s participating generalist repositories recently published a generalist repository comparison chart as one of the outcomes of this event.

We plan to keep engaging with this community to better enable coopetition among repositories while working collaboratively with repositories to ensure that researchers can share data effectively.

An independent assessment of the generalist repository landscape

We completed an independent assessment to understand the generalist repository landscape, discover where we were in tune with the community, and identify our blind spots. Key findings include the following:

  • There is a clear need for the services that generalist repositories provide.
  • Many researchers currently view generalist repository platforms as a place to deposit their own data, rather than a place to find and reuse other people’s data.
  • Repositories and researchers alike are looking to NIH to define its data sharing requirements, so each group knows what is expected of them.
  • The current lack of recognition and rewards for data sharing helps reinforce the focus on publications as the key metric of scientific output and therefore may be a disincentive to data sharing.

The pilot, workshop, and assessment provided us with a deeper understanding of the repository landscape.

We are committed to advancing progress in this important area of the data ecosystem of which we are all a part. We are currently developing ways to continue fostering coopetition among generalist repositories; strategies for increasing engagement with researchers, institutional repositories, and data librarians; and opportunities to better educate the biomedical research community on the value of effective data management and sharing.

The Office of Data Science Strategy will announce specific next steps in the near future. In the meantime, we invite you to share your ideas with us at datascience@nih.gov.

Dr. Gregurick leads the implementation of the NIH Strategic Plan for Data Science through scientific, technical, and operational collaboration with the institutes, centers, and offices that make up NIH. She has substantial expertise in computational biology, high performance computing, and bioinformatics.

How much does it cost to keep data?

Study to forecast long-term costs

Guest post by Elizabeth Kittrie, NLM’s Senior Planning and Evaluation Officer.

As scientific research becomes more data-intensive, scientists and their institutions are increasingly faced with complex questions about which data to retain, for how long, and at what cost.

The decision to preserve and archive research data should not be posed as a yes or no question. Instead, we should ask, “For how many years should this subset of data be preserved or archived?” (By the way, “forever” is not an acceptable response.)

Answering questions about research data preservation and archiving is neither straightforward nor uniform. Certain types of research data may derive value from their unique qualities or because of the costs associated with the original data collection. Other types of research data are relatively easy to collect at low cost; yet once collected, they are rarely re-used.

To create a sustainable data ecosystem, as outlined in both the NLM Strategic Plan and the NIH Strategic Plan for Data Science, we need strategies to address fundamental questions like:

  • What is the future value of research data?
  • For how long must a dataset be preserved before it should be reviewed for long-term archiving?
  • What are the resources necessary to support persistent data storage?

We believe that economic approaches—including forecasting long-term costs, balancing economic considerations with non-monetary factors, and determining the return on public investment from data availability—can help us make preservation and archiving decisions.

Economic approaches…can help us make preservation and archiving decisions.

To that end, NLM has contracted with the National Academies of Sciences, Engineering, and Medicine (NASEM) for a study on forecasting the long-term costs for preserving, archiving, and promoting access to biomedical data. For this study, NASEM will appoint an ad hoc committee that will develop and demonstrate a framework for forecasting these costs and estimating potential benefits to research. In so doing, the committee will examine and evaluate the following:

  • Economic factors to be considered when examining the life-cycle cost for data sets (e.g., data acquisition, preservation, and dissemination);
  • Cost consequences for various practices in accessioning and de-accessioning data sets;
  • Economic factors to be considered in designating data sets as high value;
  • Assumptions built in to the data collection and/or modeling processes;
  • Anticipated technological disruptors and future developments in data science in a 5- to 10-year horizon; and
  • Critical factors for successful adoption of data forecasting approaches by research and program management staff.

The committee will provide a consensus report and two case studies illustrating the framework’s application to different biomedical contexts relevant to NLM’s data resources. Relevant life-cycle costs will be delineated, as will any assumptions underlying the models. To the extent practicable, NASEM will identify strategies to communicate results and gain acceptance of the applicability of these models.

As part of its information gathering, NASEM will host a two-day public workshop in late June 2019 to generate ideas and approaches for the committee to consider.  We will provide further details on the workshop and how you can participate in the coming months.

As a next step in advancing this study, we are supporting NASEM’s efforts to solicit names of committee members, as well as topics for the committee to consider.  If you have suggestions, please contact Michelle Schwalbe, Director of the Board on Mathematical Sciences and Analytics at NASEM.

casual headshot of Elizabeth KittrieElizabeth Kittrie is NLM’s Senior Planning and Evaluation Officer. She previously served as a Senior Advisor to the Associate Director for Data Science at the National Institutes of Health and as Senior Advisor to the Chief Technology Officer of the US Department of Health and Human Services. Prior to joining HHS, she served as the first Associate Director for the Department of Biomedical Informatics at Arizona State University.

Health Disparities: Big Data to the Rescue?

Guest post by Dr. Fred Wood, Outreach and Evaluation Scientist in the Office of Health Information Programs Development.

Socially disadvantaged populations have fewer opportunities to achieve optimal health. They also experience preventable differences when facing disease or injury. These inequities, known collectively as health disparities, significantly impact personal and public health.

Despite decades of research on health disparities, researchers, clinicians, and public health specialists have not seen the changes we were hoping for. Instead many health disparities are proving difficult to reduce or eliminate.

With that in mind the National Institutes of Health (NIH) National Institute on Minority Health and Health Disparities (NIMHD) launched a Science Visioning Process in 2015 with the goal of producing a scientific research plan that would spark major breakthroughs in addressing disparities in health and health care. NIMHD defines health disparities populations as including racial or ethnic minorities, gender or sexual minorities, those with low socioeconomic status, and underserved rural populations.

Through a mix of staff research and trans-NIH work groups—of which the National Library of Medicine is a part—NIMHD is gathering input on the current state of the science on minority health and health disparities.

Prompted in part by the NIH All of Us precision medicine initiative, one key visioning area—methods and measures for studying health disparities—includes big data.

We expect big data to bring significant benefits and changes to health care, but can it also play a part in reducing health disparities?

Last month the journal Ethnicity & Disease published a special issue focused on big data and its applications to health disparities research (Vol. 27, No. 2).

The issue includes a paper co-authored by the current NIMHD director, several NIH researchers (including me), and several academic partners. Titled “Big Data Science: Opportunities and Challenges to Address Minority Health and Health Disparities in the 21st Century,” (PDF | 436 KB) the paper identified three major opportunities for big data to reduce health disparities:

  1. Incorporate social determinants of health disparities information—such as race/ethnicity, socioeconomic status, and genomics—in electronic health records (EHRs) to facilitate research into the underlying causes of health disparities.
  2. Include in public health surveillance systems environmental, economic, health services, and geographic data on targeted populations to help focus public health interventions.
  3. Expand data-driven research to include genetic, exposure, health history, and other information, to better understand the etiology of health disparities and guide effective interventions.

But using big data for health disparities research has its challenges, including ethics and privacy issues, inadequate data, data access, and a skilled, diverse workforce.

The paper offered eight recommendations to counteract those challenges:

  1. Incorporate standardized collection and input of race/ethnicity, socioeconomic status, and other social determinants of health measures in all systems that collect health data.
  2. Enhance public health surveillance by incorporating geographic variables and social determinants of health for geographically defined populations.
  3. Advance simulation modeling and systems science using big data to understand the etiology of health disparities and guide intervention development.
  4. Build trust to avoid historical concerns and current fears of privacy loss and “big brother surveillance” through sustainable long-term community relationships.
  5. Invest in data collection on area-relevant small sample populations to address incompleteness of big data.
  6. Encourage data sharing to benefit under-resourced minority-serving institutions and underrepresented minority researchers in research intensive institutions.
  7. Promote data science in training programs for underrepresented minority scientists.
  8. Assure active efforts are made up front during both the planning and implementing stages of new big data resources to address disparities reduction.

Big data, it seems, is the classic double-edged sword. It offers tremendous opportunities to understand and reduce health disparities, but without deliberate and concerted action to address its inherent challenges and without the active engagement of minority communities in that process, those disparities could widen, keeping the benefits of precision medicine—including improved diagnosis, treatment, and prevention—from millions of those who need them.

How do you think big data will inform health disparities research? And what else might we do to ensure the disparities gap continues to close?