Building Data Science Expertise at NLM

Guest post by the Data Science @NLM Training Program team.

Regular readers of this blog probably know that NLM staff are expanding their expertise beyond library science and computer science to embrace data science. As a result, NLM—in alignment with strategic plan Goal 3 to “build a workforce for data-driven research and health”—is taking steps to improve the entire staff’s facility and fluency with this field so critical to our future.

The Library is rolling out a new Data Science @NLM Training Program that will provide targeted training to all of NLM’s 1,700 staff members. We are also inviting staff from the National Network of Libraries of Medicine (NNLM) to participate so that everyone in the expanded NLM workforce has the opportunity badge reading "Data Science @NLM Training Kickoff" to become more aware of data science and how it is woven in to so many NLM products and services.

For some of our staff, data science is already a part of their day-to-day activities; for others, data science may be only a concept, a phrase in the strategic plan—and that’s okay. Not everyone needs to be a data scientist, but we can all become more data savvy, learning from one another along the way and preparing to play our part in NLM’s data-driven future. (See NLM in Focus for a glimpse into how seven staff members already see themselves supporting data science.)

Over the course of this year, the data science training program will help strengthen and empower our diverse and data-centric workforce. The program will provide opportunities for all staff to participate in a variety of data science training events targeted to their specific interests and needs. These events range from the all-hands session we had in late January that helped establish a common data science vocabulary among staff to an intensive, 120-hour data science fundamentals course designed to give select NLM staff the skills and tools needed to use data to answer critical research questions. a badge reading "Data Science Readiness Survey Completed" and showing a thumbs up We’re also assessing staff members’ data science skill levels and creating skill development profiles that will guide staff in taking the steps necessary to build their capacity and readiness for working with data.

At the end of this process, we’ll better understand the range of data science expertise across the Library. We’ll also have a much clearer idea of what more we can do to develop staff’s facility and fluency with data science and how to better recruit new employees with the knowledge and skills needed to advance our mission.

In August, the training program will culminate with a data science open house where staff can share their data science journey, highlight group projects from the fundamentals course, and find partners with whom they can collaborate on emerging projects throughout the Library.

But that final phase of the training initiative doesn’t mean NLM’s commitment to data science is over. In fact, it will be just the beginning.

In the coming years, staff will apply their new and evolving skills and knowledge to help NLM achieve its vision of serving as a platform for biomedical discovery and data-powered health.

How you are supporting the data science development of your staff? Let’s share ideas to keep the momentum going!

Co-authored by the Data Science @NLM Training Program team (left to right):

    • Dianne Babski, Deputy Associate Director, Library Operations
    • Peter Cooper, Strategic Communications Team Lead, National Center for Biotechnology Information
    • Lisa Federer, Data Science and Open Science Librarian, Office of Strategic Initiatives
    • Anna Ripple, Information Research Specialist, Lister Hill National Center for Biomedical Communications

National Public Health Week 2019: How NLM Brings Together Libraries and Public Health

Guest post by Derek Johnson, MLIS, Health Professionals Outreach Specialist for the National Network of Libraries of Medicine Greater Midwest Region

Recent articles in Preventing Chronic Disease and The Nation’s Health chronicle how public libraries can complement the efforts of public health workers in community outreach and engagement. Data tell us that more Americans visit public libraries in a year (1.39 billion) than they do health care providers (990 million). More so, over 40% of computer-using patrons report using libraries to search for health information. However, we also know many individuals struggle with accessing and understanding the health information they encounter every day.

This challenge begs the question, “How does the National Library of Medicine (NLM) increase access to trustworthy health information to improve the health of communities across the United States?”

It’s an important question, and, as we celebrate National Public Health Week, it gives us an opportunity to reflect on the incredible work NLM is doing through its National Network of Libraries of Medicine (NNLM) to bring libraries and public health together.

Take, for example, Richland County Public Health in Ohio. Richland County is approximately 33% rural. Many rural areas have been identified as “internet deserts.” In addition, adults in the county have lower rates of high school and college-level education compared to state averages. Seeking to address these disparities, Richland County Public Health applied for a funding award from NNLM’s Greater Midwest Region to develop an Interactive Health Information Kiosk in partnership with the county public library system.

With funding in hand, Richland County Public Health loaded select NLM resources onto specially configured iPads and installed them in the nine branches of the Richland County Libraries. A health educator trained library staff, local healthcare providers, and the public on how to use those resources to access trustworthy health information. Moving forward, librarians will be able to help patrons use the health kiosks. As a result, Richland County Public Health is helping improve health literacy among adult residents and, ultimately, enabling them to make more informed decisions about their health.

Another example of a public health and public library collaboration comes from NNLM’s Middle Atlantic Region (MAR). The Philadelphia Department of Public Health recognized the need to engage individuals in neighborhoods most vulnerable to severe weather events to increase their knowledge of disaster and emergency preparedness.

With funding from MAR, the Philadelphia Department of Public Health partnered with four branches of the Free Library of Philadelphia to train both librarians and local residents on emergency preparedness. Participants learned how to make use of the NLM Disaster Information Management Research Center and where to find local resources during weather-related emergencies.

These are just two of the many projects that NNLM helps facilitate across the country through its network of more than 7,500 library, public health, community-based, and other organizational members.

And, while NNLM continues to identify partnerships for funding public health and library projects, it also engages health educators by offering continuing education credit for Certified Health Education Specialists (CHES). CHES-certified professionals work in a variety of health care and public health settings where they help community members adopt and maintain healthy lifestyles. Health educators can earn continuing education credits by attending specially designated NNLM webinars on topics such as health statistics and evidence-based public health, with more courses in the works.

As communities continue to rely on the public health workforce to sustain and build healthy environments, know that the National Library of Medicine and its National Network of Libraries of Medicine are here to support the work they do!

headshot of Derek JohnsonDerek Johnson, MLIS is the Health Professionals Outreach Specialist for the National Network of Libraries of Medicine Greater Midwest Region. In this capacity, he conducts training and outreach to public health professionals on a variety of topics, including evidence-based public health, health disparities, and community outreach.


An Introduction to Authority-based Security

Guest post by Kurt W. Rodarmer, a software security architect in NLM’s National Center for Biotechnology Information.

NLM is working to unleash the potential of data and information to accelerate and transform biomedical discovery. Foundational to that goal lie the data themselves. We assess their value, collect and curate them, and then make them accessible.

But access has its risks. Big risks. Especially when it comes to personal medical data or hard-earned, grant-funded proprietary data. We need to find a way to deliver access while simultaneously controlling and protecting the data.

That’s where security comes in.

We’re all familiar with “identity-based security,” evolution’s primitive mechanism that predates our species. It starts by using our eyes, ears, and nose to identify someone or something and ends with an immediate risk-assessment. Not surprisingly, this mechanism was modeled in modern cybersecurity and is virtually ubiquitous across consumer and industrial-grade systems.

For all their efforts though, these systems sure seem to fail—a lot. Common wisdom suggests breaches are inevitable, but that’s not entirely true. There are other approaches.

Authority-based security is one. With that, authority, permissions, and trust are explicitly modeled, and policy decisions are made up front. We create objects that embody these ethereal concepts and make them tangible. These objects can then be stored, transmitted, accessed, sub-divided, transferred, etc. The discipline of modeling and managing authority is called Authority Management.

Identity- and authority-based approaches achieve several common goals. They each have strengths and weaknesses. Where they differ, the stronger, more effective, and more elegant of the two is nearly always authority-based.

Both approaches grant permissions based upon security policy. Authority-based security captures the result of policy evaluation as permissions in unforgeable and unmodifiable tokens. Since these tokens come from a known source of authority and are tamper-evident, the permissions they contain require no further scrutiny. They are as trustworthy as the authority that issued them. A permission token typically contains only a small subset of the overall permissions available to an individual, ideally never more than are needed within the current dynamic context.

By contrast, identity-based techniques make permission decisions based upon global attributes or provide crude static mechanisms. In most cases, they reflect zero context sensitivity. That means, for example, that if I run a program on a stock Linux system, that program executes using 100% of my permissions, even though it may need only read access to one file and write access to one directory. For all I know the program could be surreptitiously stealing my most sensitive data in the background, and I’d have no awareness or protection against it. Without my permission? That’s the point—I just gave it ALL my permissions!

In an authority-managed system, I would have given that same program permissions to access only the file and directory needed, leaving it powerless to read other sensitive files, much less phone home and exfiltrate them.

So, if identity-based security is so far behind the curve, what accounts for its continued use? It has one highly prized strength: its ability to revoke permissions on the spot. Since permissions are granted at the moment they are going to be exercised, any permission can be immediately denied as the result of updating policy. Since this policy update is often reactive, coming about once damage has already occurred and possibly delayed by weeks or months, the value of its immediacy is questionable. Tokens have a built-in timeout making them self-revoking, and in practice perform similarly.

Here’s how it works. To do anything of substance in a system, you need permissions. You may have those permissions already stored on some device, such as your phone. Or, you may need to go through the process of identifying yourself to some part of the system that is storing permissions on your behalf, accessible once your identity has been authenticated. In either case, the first step is to get ahold of a token containing your set of pre-approved permissions.

The permission set you now hold represents the complete permissions you have within the system you have just entered, e.g., dbGaP, a grant administration system, etc. It is unlikely to represent all the permissions you have within every system you can access. Even so, it’s probably too permissive for what you have in mind. Your next step would typically be to subset your permissions to only those needed to limit the potential damage should the token fall into the wrong hands.

Sometimes you need to share your permissions, such as when a grant-funded investigator delegates most of the research documentation to lab assistants. She can take her permission tokens received with the grant, subset, and delegate them to her lab as appropriate, so everyone can work.

What else can you do with them? Literally anything that can be done in an information system! Beyond implementing the traditional security processes of Identity and Access Management (IAM, a proper subset of Authority Management), tokens are also used to protect resources in other ways. They can be used to model spending accounts and quotas, control access to consumable or metered resources, mitigate DOS attacks, provide audit trails, and eliminate the use of passwords and multiple logins.

Because tokens carry permissions whose source of authority is irrefutable, they are the mechanism for implementing the fundamental principles of security. We can bring some of their benefits to bear right now and help lay the groundwork for secure, accessible biomedical data.

headshot of Kurt RodarmerKurt Rodarmer started work on military-grade secure operating systems over 20 years ago in Silicon Valley, working with the architect of KeyKOS, Norman Hardy. He is an expert in secure software and language design and has formalized the field of Authority Management. Kurt previously worked for Apple and Oracle and was a consultant to IBM and Sun, among others.

Data Discovery at NLM

Guest post by David Hale, Information Technology Specialist at NLM.

Did you know that each day more than four million people use NLM resources and that every hour a petabyte of data moves in or out of our computing systems?

Those mammoth numbers indicate to me how essential NLM’s array of information products and services are to scientific progress. But as we gain more experience with providing information, particularly clinical, biologic, and genetic datasets, we’re finding that how we share data is as critical as the data itself.

To fuel the insights and solutions needed to improve public health, we must ensure data flow freely to the researchers, industry innovators, patient communities, and citizen scientists who can bring new lenses to these rich repositories of knowledge.

One way we’re opening doors to our data is through an open data portal called Data Discovery. While agencies like the Centers for Disease Control and the Centers for Medicare and Medicaid Services are already utilizing the same platform with success, NLM is the first of NIH’s Institutes and Centers to adopt the platform. Our first datasets are already available, including content from such diverse resources as the Dietary Supplement Label Database, Pillbox, ToxMap, Disaster Lit, and HealthReach.

Why did NLM take this step? While many of our data resources have long been publicly available online, housing them within Data Discovery offers unconstrained access and delivers key benefits:

  • Powerful data exploration tools—By showing the dataset as a spreadsheet, the Data Discovery platform offers freedom to filter and interact with the data in novel ways.
  • Intuitive data visualizations—A picture is worth a thousand words, and nowhere is that truer than leveraging data visualizations to bring new perspectives on scientific questions.
  • Open data APIs—Open data alone isn’t enough to fuel a new generation of insights. Open APIs are critical to making the data understandable, accessible, and actionable, based on the unique needs of the user or audience.

What does this mean in practice?

Let’s look at the Office of Dietary Supplements’ (ODS) Dietary Supplement Label Database (DSLD) to illustrate the potential of leveraging Data Discovery.

More than half of all Americans take at least one dietary supplement a day. Reliable information about those supplements is critical to their appropriate use, making DSLD a timely and important dataset to make available in an open data platform. Through Data Discovery, researchers, academics, health care providers, and the public will be able to explore and derive insights from the labels of more than 85,000 dietary supplement products currently or formerly sold in the US.

Developers and technologists who support research, health, and medical organizations require APIs that are modern, interoperable, and standards-compliant. Data Discovery provides a powerful solution to these needs, supporting NLM’s role as a platform for biomedical discovery and data-powered health.

Beyond fueling scientific discovery, open access to data holds another benefit for advancing public health: contributing to the professional development of data and informatics specialists. An increasingly important part of the health care workforce, informaticists help researchers extract the most meaningful insights from data, driving new developments in the lab and better management of patients and populations.

I invite you to explore the new Data Discovery portal. It’s an exciting step forward in achieving key aspects of the NLM Strategic Plan—to advocate for open science, further democratize access to data, and support the training and development of the data science workforce.

headshot of David Hale
Credit: Jacie Lee Almira Photography

David Hale is an Information Technology Specialist at the National Library of Medicine. In addition to leading Data Discovery, David is also project lead for NLM’s Pillbox, a drug identification, reference, and image resource. He received his Bachelor of Science in Physical Science from the University of Maryland.

Keeping Up with the Information Onslaught

Organizing your resources sustainably

Guest post by Helen-Ann Brown Epstein, MLS, MS, AHIP, FMLA, informationist at the Health Sciences Library Virtua in Mt Laurel, New Jersey.

I am of the generation that fondly remembers when the comedian George Carlin mused about our obsession with stuff.

“That’s all you need in life, a little place for your stuff,” he said. “That’s all your house is: a place to keep your stuff.”

And having a place for our stuff, he observes, allows us to relax, whether we’re at home or traveling.

But what about the stuff that matters to us as health information professionals? How can we sustainably organize all that while keeping up with the literature for both our customers and ourselves?

The information explosion keeps creating more and more stuff. Currently, PubMed has more than 29 million citations, but they’re not stopping. On average, NLM adds about 1.1 million citations per year to PubMed. That’s nearly 92,000 citations per month or over 21,000 citations per week. Who can keep up with that?!

Once upon a time, we used index card files of relevant citations, clustered by MeSH or our favorite terms, to organize key references. Sometimes, we ripped out relevant articles or photocopied them, building stacks of stuff we promised ourselves we’d read.

Today, online databases make it possible to retrieve smaller, more precise results sets. We’re also able to create online alerts focused on special topics or specific journals. We can then store these citations in My NCBI accounts that can be exported into bibliographic citation management software. Some of these software packages even allow us to download PDFs, add notes to them, and then share them with colleagues.

We’ve come a long way.

In my everyday life as a health sciences librarian, I work solo for a large three-hospital system. My virtual library frees me up to make house calls to help my customers set up their own current awareness alerts that will deliver the important literature and key tables of contents to their inboxes. I also use my visits to encourage them to setup their own My NCBI accounts and to leverage the power of bibliographic software to manage their citations. And I talk about how crucial it is to decide how to best organize their literature and other sources of information at the start of any project, not later, when the volume gets too big to manage.

As part of the first cohort of the Medical Library Association Research Training Institute, I’m learning from experience the benefits of that last bit of wisdom. Following the advice of our expert faculty, I have created my alerts and determined the headings for my collections of citations. Though I’m at the early stages, I expect taking these important first steps will help ensure that I’m not missing relevant articles as they come out and might even help me unearth applicable research from disciplines I had not previously considered. I also expect to more readily find saved articles more quickly when I need them and possibly uncover connections I had not previously seen. At minimum, I know that building a collection of resources from the beginning will give me the freedom to get to articles when I’m ready for them, knowing they’ll be there waiting.

Ultimately though, by establishing now how I will manage the information, I’ve discovered that George Carlin was right. Now that I have a “house” for my stuff, I can relax. Instead of stressing out over where that stuff is going to go, I can focus on the research, knowing that I have a system in place to keep my resources organized and to keep me on track as I evaluate online journal club formats and their role in an interprofessional patient care team.

How do you keep your information stuff organized? I welcome your comments and questions.

headshot of Helen-Ann Brown EpsteinHelen-Ann Brown Epstein, MLS, MS, AHIP, FMLA, currently serves as the informationist at the Health Sciences Library Virtua in Mt Laurel, New Jersey. She spent the previous 22 years as a clinical librarian at Weill Cornell Medical Library. Helen-Ann is active in the Medical Library Association and has authored or co-authored several articles on medical librarianship.

Technology and Data in Mental Health: Applications for Suicide Prevention

Guest post by Elizabeth Chen, PhD, Associate Director of the Center for Biomedical Informatics, Associate Professor of Medical Science, and Associate Professor of Health Services, Policy & Practice at Brown University.

Biomedical informatics as a discipline is broadly concerned with the effective use of data, information, and knowledge to improve human health. Since its origins in the 1950s, we have watched this discipline evolve with advances in health information and communications technology as well as the explosion of electronic health data. During this time, we have also seen the emergence of sub-disciplines reflecting areas of specialization. In fact, a 2015 study uncovered almost 300 different “types” of informatics! Among these was mental health informatics, which first appeared in the title of a 1995 article indexed in PubMed.

Using technology to understand and support mental health dates to the 1950s when specialized television broadcasts delivered mental health training. In the 1960s, computers analyzed data for psychological diagnoses and housed “artificial intelligence” systems that simulated communication with a psychotherapist. More recently, with the rapid adoption of electronic health record (EHR) systems that can collect longitudinal patient information such as diagnoses and medications, we are observing the increased use of EHR technology and data for improving health care, including mental health care.

Mental health remains a global crisis. In the United States alone, mental health conditions affect 1 in 5 adults and children. These conditions are among the factors that contribute to making suicide the 10th leading cause of death overall and 2nd leading cause among 10- to 34-year-olds nationally. With suicide rates having increased by nearly 30% since 1999,  the National Strategy for Suicide Prevention calls for a comprehensive and coordinated approach that includes data-driven strategic planning and evidence-based programs.

There are numerous and wide-ranging applications of mental health informatics and EHRs contributing to these efforts, including the following:

  • Two independent datasets, one including EHR and biobank data from the Vanderbilt University Medical Center, have characterized the role of common genetic variants among those who have attempted suicide. These large-scale genetic analyses support a heritable component to suicide attempts and an incomplete genetic relationship with psychiatric and sleep disorders.
  • At the Parkland Health & Hospital System in Texas, a Universal Suicide Screening Program, initiated in 2012, led to implementing the Columbia-Suicide Severity Rating Scale in the EHR system for adults. The integration of this clinical decision support tool into the clinical workflow demonstrates how technology may be used to improve suicide risk recognition.
  • Researchers across the country are developing models for predicting patients’ future risk of suicidal behavior using “machine learning” techniques, state death certificates, and longitudinal EHR data from a range of health systems, including Partners Healthcare in Massachusetts [PubMed], HealthPartners in Minnesota, Henry Ford Health System in Michigan, and five different Kaiser Permanente locations [PubMed]. Implementing these predictive models as clinical decision support tools in EHR systems has the potential to improve screening, detection, and treatment of suicide risk.
  • In Connecticut, EHR data from the statewide health information exchange and five clinical partners are being used to identify patients at risk of suicide. Claims data from the All-Payer Claims Database and mortality data from the State Department of Public Health will be used to assess the outcomes and impact of the quality improvement efforts.

And these are just a few examples.

Technology and data will continue to play important roles in advancing mental health care. We have already seen the contributions of mental health informatics over the years and those of related areas such as behavioral health informatics and computational psychiatry. There is much more to come in the development of effective and innovative solutions for improving diagnosis, treatment, and prevention of mental health conditions, including those related to suicidal thoughts and behaviors.

headshot of Dr. Elizabeth ChenElizabeth S. Chen, PhD is the founding Associate Director of the Center for Biomedical Informatics, Associate Professor of Medical Science, and Associate Professor of Health Services, Policy & Practice at Brown University. She leads the Clinical Informatics Innovation and Implementation (CI3) Laboratory that is focused on leveraging EHR technology and data to improve healthcare delivery and biomedical discovery. Dr. Chen is an elected fellow of the American College of Medical Informatics and is a member of NLM’s Biomedical Informatics, Library and Data Sciences Review Committee.


Dr. Chen will deliver the next NLM Biomedical Informatics & Data Science Lecture on Wednesday, November 14, 2018, at 2:00 pm in the Natcher Conference Center (Building 45), Balcony A. Her talk, “Knowledge Discovery in Clinical and Biomedical Data: Case Studies in Pediatrics and Mental Health,” is free and open to the public. It will also be broadcast live globally and archived via NIH Videocast.

Data in the Scholarly Communications Solar System

Guest post by Kathryn Funk, program manager for NLM’s PubMed Central.

The Library of the Future. What will it look like?  The NLM Strategic Plan envisions it partly as “one of connections between and among literature, data, models, and analytical tools.” In this future, journal articles are no longer lone objects drifting in space, but, rather, each a solar system waiting to be explored. Indeed, we’re already seeing the published literature associated with datasets, clinical trials, protocols, software, earlier versions (including preprints), peer review documents, and so on through consistent identifiers and standardized publishing and archival practices.

To help researchers and the public navigate this new solar system, PubMed Central (PMC), NLM’s full-text archive of journal literature, has been collaborating with publishers and funders for the last year to support efficient ways of linking journal articles with associated data. We’re encouraging authors to cite their open datasets and publishers to archive and make available those data citations in a machine-readable format. Though data citations represent only a small percentage of how PMC articles are linked to data (supplementary material continues to be the predominant method for associating data with articles in the archival record), the growth in data citations in the last year has been promising, nearly doubling the previous year’s total (i.e., 850 articles with data citations in 2017 vs.  approximately 440 in 2016). NLM is also supporting the public access policy requirements of our research funder partners by encouraging authors to deposit datasets as supporting documents via the NIH Manuscript Submission (NIHMS) system.

But solar systems, even the metaphorical kind, are meant to be explored, so we’re also working to expose each journal article solar system in a way that promotes discoverability. We want to make it easier to discover articles in PMC with associated data citations, data availability statements, and supplementary data, through improved record displays and new search facets, leveraging the data-related search filters announced earlier this year.

NLM is also looking beyond datasets to archive and expose articles’ key satellites, including, for example, comments generated during the peer review process. As the effort to expand the openness of peer review gains traction, PMC staff have been collaborating with publishers and Crossref on standardized ways to make readily available those peer review materials.

As with any exploration of new solar systems, it’s our hope that taking these steps will help generate new knowledge, and in so doing drive research that is reproducible, robust, transparent, and reusable. And as we move toward becoming the Library of the Future, how we can best support your research needs in connecting the literature with the rest of the research universe? Please let us know.

With thanks to Jeff Beck for the solar system analogy. 

casual headshot of Kathryn FunkKathryn Funk is the program manager for PubMed Central. She is responsible for PMC policy as well as PMC’s role in supporting the public access policies of numerous funding agencies, including NIH. Katie received her master’s degree in library and information science from The Catholic University of America.