The Intangible Rewards of Engaging with Research Data

Guest post by Amanda K. Rinehart, MS, MLIS, Life Sciences Librarian and Associate Professor for the Department of Research and Education, University Libraries, at the Ohio State University. Ms. Rinehart will deliver the 2022 Joseph Leiter NLM/Medical Library Association (MLA) Lecture, “Data Communities: Room for Everyone, Roles for Librarians,” on December 6, 2022.

As I reread the OSTP Public Access Memos from 2013 and 2022, I am struck again by the premise behind openly sharing research data:

When federally funded research is available to the public, it can improve lives, provide policymakers with important evidence with which to make critical decisions, accelerate the rates of discovery and translation, and drive more equitable outcomes across every sector of society.

That’s ambitious enough but sharing research data goes a few steps further: It also uses our taxpayer funds more efficiently, increases public trust in the scientific endeavor, and facilitates research collaboration. However, if you haven’t had the opportunity to be a part of it yet, these can remain abstract motivations and may seem daunting. How and why would any librarian engage with sharing research data?

Research data management (RDM) is the organization, storage, preservation, and sharing of data. When a researcher is faced with a new RDM expectation—especially one that is often seen as a burden instead of a boon—it’s a natural fit for them to trust their librarian to help. Librarians are allies in this changing and confusing landscape. Because an inherent aspect of cutting-edge research is that it has never been done before, it often results in data that has never been previously collected or synthesized. As a result, this data doesn’t yet have ideal resources, workflows, or technologies for sharing it.

There is rarely one solution or easy answer. Librarians must ascertain what the researcher needs, whether it be awareness of new requirements, information about their options to meet these requirements, or education about better data management practices. We point to resources like shared curation training, multi-institutional partnerships and international perspectives, or appropriate data repositories. We acknowledge workflow gaps and challenges and summarize those needs across disciplines and institutions. We advocate for better resources, services, and support for managing research data. Because of this complexity, finding the combination of resources that results in appropriate sharing is more akin to building a relationship or becoming part of the research team rather than a transactional interaction.

However, providing assistance that is tailored to the specific needs of the researcher takes time, effort, and knowledge. Because RDM is a burgeoning field heavily dependent on changing technology and policy, staying abreast of current practices is a heavy investment as well. Most librarians, if not all of them, face reduced staffing, longer hours, more responsibilities, and limited pay. Therefore, what would induce a librarian to engage with research data? For myself, I share values with those cited in the OSTP Public Access Memos, so I have found many of my RDM interactions quite rewarding, and I suspect other librarians do as well.

For example, I participated in a meeting comprised of researchers who wished to improve infant and maternal health outcomes for local lower-income communities. In our county, we have a high rate of infant and maternal mortality, with Black infants dying at 2.7 times the rate of White infants. The researchers wanted to use an app, pre-installed on free phones, to make transportation to health care providers low cost or free. But how would they manage sensitive location and appointment data? Who needs access to that data, and when? What regulations apply, and how can we go beyond those requirements to make sure we are ethical? These are difficult questions, but they can lead to heartening discussions and innovative solutions with custom databases, Data Use Agreements, de-anonymization, encryption, and ultimately, data destruction.

This is just one example that touches on topics that I care about: infant and maternal health, social justice and equity in health care, and effective, efficient transportation as part of city infrastructure. But I’m not a health care provider, a sociologist, or a city planner. I’m a librarian, and as such, I can contribute by meeting researchers where they are, determining their most urgent needs, guiding them to resources, identifying gaps in knowledge and services, and advocating on their behalf to have those gaps filled.

I know my work doesn’t solve these large real-world problems or even just the problem of making research data available to those who can most beneficially use it. But any improvement in RDM practice gets us one step closer. I don’t have to solve the world’s problems to help solve the world’s problems. If you care about how data can be used to fulfill the NIH mission to “enhance health, lengthen life, and reduce illness and disability,” then you can see the value in becoming engaged with research data and how librarians can help researchers meet that goal.

Prior to starting her career as a librarian in 2011, Ms. Rinehart spent eleven years as a biologist with the United States Department of Agriculture testing alternative agricultural methods to reduce the human impact on climate change. Ms. Rinehart has a Master of Library and Information Science degree from the School of Information at the University of South Florida and a Master of Science degree in Botany and Plant Pathology from Michigan State University.

NLM is Celebrating 40 Years of Biomedical Training

Guest post by Richard C. Palmer, DrPH, JD, Acting Director, Division of Extramural Programs, National Library of Medicine (NLM), National Institutes of Health (NIH).

This summer, NLM is marking its 40th year of supporting Biomedical Informatics and Data Research Training (T15). This is an amazing accomplishment, and I extend my congratulations to all the past and present institutional training grant directors, trainees, and NLM staff that have helped mature the field, grow the scientific workforce, and prepare this country for a biomedical revolution. This revolution harnesses the power of data to improve scientific exploration, clinical care, public health practice, and personal health.

Although almost 40 years have passed, NLM is more committed than ever to support career training, which is a central component of the NLM Strategic Plan. Recently, NLM released a new R25 program focused on supporting innovative educational programs and research experiences aimed at preparing talented and diverse students for future careers in biomedical informatics and data science. NLM also recently awarded 18 T15 grant awards, the largest number of awards made to date, to help ensure an available data-driven biomedical informatics and data science workforce. About 170 graduate and postdoctoral students will be trained annually by the T15 program to meet this growing workforce demand. NLM recognizes that we need to invest in training to ensure that a well-trained informatics and data science workforce exists to address the health needs of this nation.

Personally, I am amazed with just how fast the biomedical informatics and data science field has grown in the past 10 years. I entered this field with a study that aimed to build a clinical decision support tool to help manage fall risk for older adults—I vividly remember the headache associated with the interoperability (a computer or software inability to exchange and utilize data or other information) of data sources. Since then, I have witnessed rapid change occurring—due in part to the continued advances in computing, data storage, and standardization—that has allowed biomedical informatics to quickly advance. This change is occurring rapidly. To harness this acceleration in the acquisition, storage, retrieval, and use of information in health research and for the biomedical enterprise, we need a highly skilled workforce, and the demand for scientists trained in these areas and who can apply these skills to health and biomedicine is higher than the current supply. NLM’s commitment to training is helping ensure that a workforce capable of leading innovation exists.

Since joining NLM, I have had the opportunity to learn more about NLM’s T15 training program and the impact it’s had. Forty years is a long time, so I pieced together data to identify a common trend: The majority of the T15 trainees move on to research-oriented roles in academic institutions, not-for-profit research organizations, governmental and public health agencies, pharmaceutical and software companies, and health care organizations. Those in training over the past 10 years published 2,350 articles, with nearly 23% of these publications being highly cited, and were associated with 23 patents. In addition, T15 trainees are taking on leadership roles in academia, health centers, and research organizations. Even NIH’s own Dr. Josh Denny, who leads the All of Us program, and Dr. Michael Chiang, Director of the National Eye Institute, are former T15 trainees.

Just recently, I was able to participate in my first T15 trainee conference hosted by the University at Buffalo, SUNY and saw what research T15 trainees were involved with. What impressed me was the passion these trainees had for the science and their commitment to tackling pressing biomedical issues. Trainees are conducting research in areas including basic biomedical research, health care delivery, clinical and translational research, public health surveillance, and consumer health. Given their level of engagement, there is little doubt that many current T15 trainees will build successful scientific careers that will benefit society tremendously. At NLM, we are committed to training and fostering the development of the next generation of biomedical informatics and data scientists and look forward to the scientific advances they make. They say time flies when you’re having fun, and the last four decades sure have flown by. Here’s to another 40 years of NLM-supported training!

Dr. Palmer oversees NLM’s grant programs for research, resources, workforce development, and small business related to biomedical informatics and data science. Prior to joining NLM, Dr. Palmer was a Health Scientist Administrator at the National Institute on Minority Health and Health Disparities. He has over 25 years of extramural research experience and has been an investigator on NIH and CDC funded research grants. Dr. Palmer has conducted research in health care and community-based settings aimed at addressing health disparities, understanding health care decision-making, and improving health outcomes and disease management among older adults.

Want to learn more about NLM’s support for training?

View a panel discussion on Lindberg and the Advancement of Science through Research Training held during the 2022 Lindberg-King Lecture and Scientific Symposium: Science, Society, and the Legacy of Donald A.B. Lindberg, MD on September 1. The panel addressed the leadership of Dr. Donald A.B. Lindberg, former NLM director, in the advancement of science through research training with emphasis on the field of informatics.

RADx-UP Program Addresses Data Gaps in Underrepresented Communities

Guest post by Richard J. Hodes, MD, Director, National Institute on Aging, and Eliseo Pérez-Stable, MD, Director, National Institute on Minority Health and Health Disparities, NIH.

A few months into the COVID-19 pandemic, we shared how NIH was working to speed innovation in the development, commercialization, and implementation of technologies for COVID-19 through NIH’s Rapid Acceleration of Diagnostics (RADx) initiative.

Two years later, one of the RADx programs—RADx Underserved Populations (RADx-UP)—reflects on lessons learned that have broken the mold of standard research paradigms to address health disparities.

Use of Common Data Elements

RADx-UP has presented unique challenges in terms of data collection, privacy concerns, measurement standardization, principles of data-sharing, and the opportunity to reexamine community-engaged research. Establishment of Common Data Elements (CDEs)—standardized, precisely defined questions paired with a set of allowable responses used systematically across different sites, studies, or clinical trials to ensure that the whole is greater than the sum of its parts—are not commonly used in community-engaged research. Use of CDEs enables data harmonization, aggregation, and analysis of related data across study sites as well as the ability to investigate relationships among data in unrelated data sets. CDEs can also lend statistical power to analyses of data for small subpopulations typically underrepresented in research.

RADx-UP is a community-engaged research program that builds on years of developing partnerships between communities and scientists. RADx-UP has funded 127 research projects with sites in every state and six U.S. territories as well as a RADx-UP Coordination and Data Collection Center (CDCC). RADx-UP assesses the needs and barriers related to COVID-19 testing and increase access to COVID-19 testing in underserved and vulnerable populations experiencing the highest rates of disparities in morbidity and mortality.

The COVID-19 pandemic necessitated establishing RADx-UP and its associated CDEs with unprecedented speed relying heavily on data elements derived from those already defined in the NIH-based PhenX Toolkit and Disaster Research Response (DR2) resources. The short time frame for this process did not allow for as extensive collaboration and input from RADx-UP investigators and community partners that would have been ideal. Additionally, many researchers, especially community partners engaged in RADx-UP projects, were not familiar with CDE data collection practices. As a result, CDE questionnaires had to be modified as studies progressed to better suit the needs of the consortium and investigators new to CDE collection had to be familiarized with these processes quickly. NIH program officers, NIH RADx-UP and CDCC leadership and engagement impact teams (EITs)—staff liaisons provided by the CDCC that link RADx-UP research teams to testing, data, and community-engagement resources—helped research teams implement and adjust CDE collection, ensured alignment across consortium research teams, and assisted with other data-related issues that arose.

All RADx programs are required to collect a standardized set of CDEs, including sociodemographic, medical history, and health status elements with the intent to provide researchers rapid access to data for secondary research analyses in the RADx Data Hub, the central repository for RADx data. However, implementation of CDEs in the context of underserved communities in the rapidly evolving COVID-19 pandemic presented complex issues for consideration.

Some of these issues included data privacy, the risk of re-identification of underserved and undocumented populations, and data collection burden on participants as well as researchers. The privacy of health data is protected under federal law. The RADx-UP program instituted measures to ensure program participants’ data remain protected and de-identified using a token-based hashing algorithm methodology that allows researchers to share individual-level participant data without exposing personally identifiable information. To address data collection and respondent burden concerns, projects modified questions to allow some flexibility in expanding response options more appropriate to some underserved communities. The CDCC also developed COLECTIV, a digital interface for projects to directly enter data into the data repository and included gateway questions to relieve respondent burden.

Respect for Tribal Data Sovereignty

RADx-UP leadership and investigators recognized that additional considerations for tribal sovereignty, practices, and policies needed to be addressed for projects that include American Indian and Alaska Native (AI/AN) participants. Through consultations with the NIH Tribal Advisory Committee and the broader AI/AN community and meetings with an informal RADx-UP AI/AN project working group established by the CDCC, NIH realized that deposition of tribal data into the RADx Data Hub would not meet the cultural, governance, or sovereignty needs of AI/AN RADx research data. In response, NIH hopes to establish a RADx Tribal Data Repository (TDR) responsible for the collection, protection, and sharing of data collected in AI/AN communities with respect for the practices and policies of Tribal data sovereignty. Applications for the repository have been solicited and NIH hopes to make an award for the TDR sometime in FY23.

Rapid Data Sharing

One of the largest hurdles the RADx-UP program has faced is implementing rapid sharing of research data for secondary analyses and to inform decision-making and public health practices related to the COVID-19 pandemic. RADx-UP research teams are expected to share their data on a timely cadence before data collection ends. This is a far more stringent practice relative to the current standard NIH data-sharing policy that requires data to be shared at the time of acceptance for publication of the main findings from the final data set. NIH and CDCC staff have worked together with the RADx research community to highlight the importance of and compliance with rapid data-sharing. Within the first six months, a total of 69 Phase 1 projects began transmitting CDE data to the RADx-UP CDCC. The COVID-19 pandemic posed a tremendous challenge, and NIH responded by collaborating with vulnerable and underserved communities. This collaboration has opened an unprecedented opportunity to build on a now established foundation for future research to address gaps in understanding the broader social, cultural, and structural factors that influence disparities in morbidity and mortality from COVID-19 and other diseases. Data collection and sharing efforts of the RADx-UP initiative comprise a significant contribution. Collaboration among the NIH, research investigators, and communities impacted by COVID-19 has been the catalyst. To learn more about RADx-UP, please visit a recent journal article available on PubMed.

Dr. Hodes has served as NIA director since 1993, overseeing studies of the biological, clinical, behavioral, and social aspects of aging. He has devoted his tenure to the development of a strong, diverse, and balanced research program focused on the genetics and biology of aging, basic and clinical studies aimed at reducing disease and disability, and investigation of the behavioral and social aspects of aging. Ultimately, these efforts have one goal — improving the health and quality of life for older people and their families. As a leading researcher in the field of immunology, Dr. Hodes has published more than 250 peer-reviewed papers.

Dr. Pérez-Stable practiced primary care internal medicine for 37 years at the University of California, San Francisco before becoming the Director of NIMHD in 2015. His research interests have centered on improving the health of individuals from racial and ethnic minority communities through effective prevention interventions, understanding underlying causes of health disparities, and advancing patient-centered care for underserved populations. Recognized as a leader in Latino health care and disparities research, he spent 32 years leading research on smoking cessation and tobacco control in Latino populations in the United States and Latin America. Dr. Pérez-Stable has published more than 300 peer-reviewed papers.

Meet the Next Generation of Leaders Advancing Data Science and Informatics at NLM

Guest post by Virginia Meyer, PhD, Training Director for the Intramural Research Program, National Library of Medicine, National Institutes of Health.

Working at NLM means being at the forefront of innovation in the rapidly evolving fields of data science and informatics. Within that environment, the NLM Intramural Research Program (IRP) is dedicated to supporting individuals looking to develop and apply computational approaches to a broad range of problems in biomedicine, molecular biology, and health.

NLM understands that contributions from people of diverse backgrounds, cultures, and histories enables research that has the greatest impact and reaches the widest possible audience. Such a workforce is necessary to drive innovation and scientific advancement and is imperative to ensuring that computational tools and data sets are free from bias. To that end, the Diversity in Data Science and Informatics (DDSI) Summer Internship, a program of the NLM IRP now in its inaugural year, was developed to support and engage young scientists who are dedicated to careers in computational biology and biomedical informatics. It is our hope that time spent in the DDSI program and Principal Investigators (PI) will encourage trainees to continue along the path toward becoming leaders in their chosen fields.

Meet four of this year’s DDSI interns and learn about the work they are doing in the NLM IRP!

Will Hibbard
Graduate Student in Biomedical Informatics
University of Buffalo

PI: Olivier Bodenreider, MD, PhD, Computational Health Research Branch, Lister Hill National Center for Biomedical Communications at NLM
Research Area: Natural Language Processing

What interested you most about the DDSI program?
I found out about the program when a teacher recommended it to me out of the blue, and after looking into it, I found a lot of fun research projects I could join. The program offered an opportunity to join research projects in familiar and unfamiliar fields. Ultimately, it was pleasantly outside of my comfort zone and presented the kind of challenge that makes me love research.

What research project are you working on and why?
I ended up working with Dr. Olivier Bodenreider using neural networks to better develop natural language processing in medical databases. I applied to this project because it involved two areas in which I had less experience: ontology and data structures. I pursued this research area because it allowed me the chance to improve in fields that I did not understand well at the time.

Why might someone want to apply to the DDSI program in the future?
This is the kind of experience with challenges that allow you to grow as a person and as a professional. Whether you know the area of research well or have trouble understanding it, this program will give you an opportunity to learn through a practical research project.

What is next for you after you complete your internship?
I will be taking a gap year while I apply to medical school. I am hoping to work in my local oncology institute and medical corridor.

MG Hirsch
PhD Student in Computer Science
University of Maryland, College Park

PI: Teresa Przytycka, PhD, Computational Biology Branch, National Center for Biotechnology Information at NLM
Research Area: Evolutionary Genomics

What interested you most about the DDSI program?
Evolution of gene expression and modeling different modes of evolution is something that I had yet to explore in my PhD research. I thought a summer program would be perfect to learn about it. It also gives me the opportunity to get a feel for working at the NIH and if I would want to consider the NIH Graduate Partnerships Program.

What research project are you working on and why?
I am evaluating the possibility of different modes of gene expression evolution within a tumor. Previous work in the lab considers different models of gene expression evolution between animal species. Many models of evolution assume neutral evolution, that mutations occur and persist randomly; however, we know that mutations that change phenotypes undergo various selective pressures from the environment. Considering this, previous work, resulting in the software EvoGeneX, has fit computational models using Ornstein-Uhlenbeck processes to evaluate potential divergence of gene expression within fly species. My research project is applying this same concept to cancer tumors. After tumorigenesis, cancer cells rapidly accumulate further mutations and diversify into subclones within the same tumor. Owing to the different sets of mutations, these subclones evolve differently. We can hypothesize then that the evolution of the gene expression of subclones can be modeled using the same computational models.

Why might someone want to apply to the DDSI program in the future?
The DDSI program offers extra speaker talks and networking opportunities.

What is next for you after you complete your internship?
I will be finishing my PhD in computer science at UMD.

Sirisha Koirala
Undergraduate Student in Computer Science
University of Maryland, College Park

PI: Zhiyong Lu, PhD, Computational Biology Branch, National Center for Biotechnology Information at NLM
Research Area: Natural Language Processing and Computational Biology

What interested you most about the DDSI program?
I was most interested in the unique ongoing research projects that students had the opportunity to participate in, which I would not have been able to find at other programs. It was very interesting to learn about the ways that artificial intelligence (AI) could be applied to medical practices, and this stood out to me as medicine and AI are two of my main interests.

What research project are you working on and why?
I am working on AI in the prediction of progression in age-related macular degeneration. In my first year of college, I was on the pre-medicine track; however, while gaining greater exposure, I realized that I have a stronger passion for computer science. Within the field of computer science, I have a particular interest in AI, and this project specifically allowed me to combine both of my interests and backgrounds.

Why might someone want to apply to the DDSI program in the future?
The DDSI program provides students who come from underrepresented backgrounds a chance to gain real hands-on experience. As a student who came from a small, all-women’s university where I did not have the availability to engage in such opportunities, this program has helped me significantly. I have been able to get the real-world experience I need to help me excel further in my career preparations, and students who are in similar positions should consider applying for this reason.

What is next for you after you complete your internship?
After I complete my internship, I will be starting my second year of college at University of Maryland, College Park where I am pursuing a major in computer science.

Tochi Oguguo
Undergraduate Student in Computer Science and Information Systems
University of Maryland, Baltimore County

PI: Sameer Antani, PhD, Computational Health Research Branch, Lister Hill National Center for Biomedical Communications at NLM
Research Area: Bias in Machine Learning

What interested you most about the DDSI program?
What interests me the most about this program is the amount of experience you gain during the summer. You leave understanding concepts at a higher level and applying lessons to your life outside of research.

What research project are you working on and why?
My research project is about bias in machine learning. By using fair active learning, we teach the machine how to give accurate responses when diagnosing or classifying a dataset or image. Bias is one of the biggest issues in machine learning, especially in health care where inaccurate judgment can be dangerous.

Why might someone want to apply to the DDSI program in the future?
DDSI is a great program to help students and interns learn more about career paths out there for them to explore and to help you become a more resilient person and scientist outside of research.

What is next for you after you complete your internship?
I plan to apply again next summer and keep working in research and machine learning! Also, I will take more classes in information science to help me become a better programmer.

Using Large Datasets to Improve Health Outcomes

Guest post by Lyn Hardy, PhD, RN, Program Officer, Division of Extramural Programs, National Library of Medicine, National Institutes of Health.

Before the advent of algorithms to determine the best way to treat and prevent heart disease, a health care provider looking for best practices for their patients may not have had the resources to find that best method. Today, health care decision-making for individuals and their health care providers is made easier by predictive and preventive models, which were developed with the goal of guiding the decision-making process. One example is the Patient Level Prediction of Clinical Outcomes and Cost-Effectiveness project led by Columbia University Health Sciences.

These models are created using computer algorithms (a set of rules for problem-solving) based on data science methods that analyze large amounts of data. While computers can analyze facts within the data, they rely on human programming to define what pieces of data or what data types are important to include in the analysis to create a valid algorithm and model. The results are translated into information that health care providers can use to understand patterns and provide methods for predicting and preventing illness. If a health care provider is looking for ways to prevent heart disease, an accurate model might describe methods—like exercise, diet, and mindfulness practices—that can achieve that goal.

Algorithms and models have benefited the world by using special data science methods and techniques to understand patterns that guide clinical decisions, but identifying data used in their development still requires practitioners to be conscious of the results. Research has shown that algorithms and models can be misleading or biased if they do not account for population differences like gender, race, and age. These biases, also known as algorithmic fairness, can adversely affect the health of underserved populations by not giving individuals and health care providers information specific to and that directly addresses their diversity. An example of potential algorithm bias is creating an algorithm to treat hypertension without including variated treatments for women or considering life-related stress or the environment.

Researchers are focusing on methods to create fair and equitable algorithms and models to provide all populations with the best and most appropriate health care decisions. Researchers in our NLM Extramural Programs analyze this data through NLM funding opportunities that foster scientific inquiry so we better understand algorithmic effects on minority and marginalized populations. Some of those funding opportunities include NLM Research Grants in Biomedical Informatics and Data Science (R01 Clinical Trial Optional) and the NIH Research Project Grant (Parent R01 Clinical Trial Not Allowed).

NLM is interested in state-of-the-art methods and approaches to address problems using large health data sets and tools to analyze them. Specific areas of interest include:

  • Developing and testing computational or statistical approaches to apply to large or merged health data sets containing human and non-human data, with a focus on understanding and characterizing the gaps, errors, biases, and other limitations in the data or inferences based on the data.
  • Exploring approaches to correct these biases or compensate for missing data, including introducing debiasing techniques and policies or using synthetic data.
  • Testing new statistical algorithms or other computational approaches to strengthen research designs using specific types of biomedical and social/behavioral data.
  • Generating metadata that adequately characterizes the data, including its provenance, intended use, and processes by which it was collected and verified.
  • Improving approaches for integrating, mining, and analyzing health data in a way that preserves that data’s confidentiality, accuracy, completeness, and overall security.

These funding opportunities encourage inquiry into algorithmic fairness to improve health care for all individuals, especially those who are underserved. By using new research models that account for diverse populations, we will be able to provide data that will support the best treatment outcomes for everyone.

Dr. Hardy’s work and expertise focus on using health informatics to improve public health and health care decision-making. Dr. Hardy has held positions as a researcher and academician and is active in national informatics organizations. She has written and edited books on informatics and health care.

%d bloggers like this: