Next-Generation Data Science Research Challenges

Data science challenges represented by a circular maze

NIH-funded research is rapidly becoming more and more data-driven. This is true whether that research is intramural or extramural or whether it is focused on solving concrete problems or advancing methodologies for specific domains.

Right now, NLM’s role in this data-driven research centers on developing scalable, sustainable, and generalizable methods for making biomedical data FAIR: Findable, Accessible, Interoperable, and Reusable.

Toward this end, NLM—on behalf of the NIH—released last fall a Request for Information on Next-Generation Data Science Challenge in Health and Biomedicine. We sought community input on data science research initiatives that could address the key challenges researchers, clinicians, administrators, and others currently faced. We invited suggestions for new data science research in six areas:

  • Data-driven Discovery
  • Data-driven Health Improvement
  • Advanced Data Management
  • Intelligent and Learning Systems for Health
  • Workforce Development and Diversity
  • New Stakeholder Partnerships

Fifty-three responses provided more than 180 pages of ideas and suggestions.

The topic “Data-driven Discovery” prompted input focused on developing methods and tools to help researchers derive insights from data. These suggestions fell into a number of areas particularly relevant to NLM, including help with natural language processing; predictive analytics to help generate hypotheses from hidden patterns; ways to extract and formalize scientific claims and causal statements from publications; and improved ontologies.

Ideas related to improving health through data recommended developing algorithms tied to patient similarity to drive comparative effectiveness research; nuanced characterizations of phenotypes (including severity, degree and certainty); and strategies to address bias in health records used for research purposes.

Suggestions concerning managing data revealed a need to better capture and curate that data. These included smoothly integrating personal data from mobile devices into clinical work flows; automatically assigning standardized metadata to existing data sets and digital files; sharing open source analytic methods; and developing technological platforms to help scientists store and analyze data.

Ideas for intelligent learning systems ranged widely, from brain science research focused on learning and retention, to approaches for engaging users with health data, to building flexible learning modules.

Many contributors recognized the need to develop a data-skilled workforce, and their suggestions extended beyond simply increasing the number of data scientists. They called for reaching out to high school and undergrad students to equip them earlier with the foundational skills and education that would make training in data science interesting and feasible; creating core informatics and data science skills for all researchers; and infusing the PhD in health informatics with required coursework in computer science and statistics, along with health and biomedicine.

Suggestions for stakeholder partnerships included the names of specific associations, federal agencies, and companies, as well as a shout-out for working more closely with those citizen scientists interested in taking advantage of the growing supply of publicly available health data.

Clearly the research challenges of the future will need strong investments from across NIH.

The Library’s early contributions will be three-fold:

  1. Serve as an honest broker to create a trans-NIH statement of the data science and informatics skills essential for all federally supported trainees (in NIH-funded research training programs located at universities as well as career grants).
  2. Stimulate research in advanced curation and information-integration methods.
  3. Accelerate the development of scalable, reusable, and generalizable visualization tools and analytical approaches.

What else do you think belongs in our portfolio? Chime in below.

Now is your chance to shape the next-generation research agenda for data science.

Request a summary of the responses to the RFI referenced in this post.


casual headshot of Valerie FloranceDr. Valerie Florance, co-author of this post, serves as Director of the NLM Division of Extramural Programs. She also coordinates NLM’s informatics training programs.

One thought on “Next-Generation Data Science Research Challenges

  1. NLM’s extensive experience in outreach programming: including targeting students at the high school level, suggests the addition to the Library’s early contributions list ‘reaching out to help equip high school students with the foundational skills and education that would make training in data science interesting and feasible.’ This would include initiatives that (already) target racially and economically diverse student populations, as well as utilizing the excellent in place infrastructure of the NNLM including direct front line access available through the public libraries. The development of needed curricula for both teachers and students could be part of that contribution, perhaps in collaboration with other stakeholders. Moreover NLM has worked effectively in the past on trans NIH outreach collaborations in science education for high school students; another leadership opportunity.

Leave a Reply