How much does it cost to keep data?

Study to forecast long-term costs

Guest post by Elizabeth Kittrie, NLM’s Senior Planning and Evaluation Officer.

As scientific research becomes more data-intensive, scientists and their institutions are increasingly faced with complex questions about which data to retain, for how long, and at what cost.

The decision to preserve and archive research data should not be posed as a yes or no question. Instead, we should ask, “For how many years should this subset of data be preserved or archived?” (By the way, “forever” is not an acceptable response.)

Answering questions about research data preservation and archiving is neither straightforward nor uniform. Certain types of research data may derive value from their unique qualities or because of the costs associated with the original data collection. Other types of research data are relatively easy to collect at low cost; yet once collected, they are rarely re-used.

To create a sustainable data ecosystem, as outlined in both the NLM Strategic Plan and the NIH Strategic Plan for Data Science, we need strategies to address fundamental questions like:

  • What is the future value of research data?
  • For how long must a dataset be preserved before it should be reviewed for long-term archiving?
  • What are the resources necessary to support persistent data storage?

We believe that economic approaches—including forecasting long-term costs, balancing economic considerations with non-monetary factors, and determining the return on public investment from data availability—can help us make preservation and archiving decisions.

Economic approaches…can help us make preservation and archiving decisions.

To that end, NLM has contracted with the National Academies of Sciences, Engineering, and Medicine (NASEM) for a study on forecasting the long-term costs for preserving, archiving, and promoting access to biomedical data. For this study, NASEM will appoint an ad hoc committee that will develop and demonstrate a framework for forecasting these costs and estimating potential benefits to research. In so doing, the committee will examine and evaluate the following:

  • Economic factors to be considered when examining the life-cycle cost for data sets (e.g., data acquisition, preservation, and dissemination);
  • Cost consequences for various practices in accessioning and de-accessioning data sets;
  • Economic factors to be considered in designating data sets as high value;
  • Assumptions built in to the data collection and/or modeling processes;
  • Anticipated technological disruptors and future developments in data science in a 5- to 10-year horizon; and
  • Critical factors for successful adoption of data forecasting approaches by research and program management staff.

The committee will provide a consensus report and two case studies illustrating the framework’s application to different biomedical contexts relevant to NLM’s data resources. Relevant life-cycle costs will be delineated, as will any assumptions underlying the models. To the extent practicable, NASEM will identify strategies to communicate results and gain acceptance of the applicability of these models.

As part of its information gathering, NASEM will host a two-day public workshop in late June 2019 to generate ideas and approaches for the committee to consider.  We will provide further details on the workshop and how you can participate in the coming months.

As a next step in advancing this study, we are supporting NASEM’s efforts to solicit names of committee members, as well as topics for the committee to consider.  If you have suggestions, please contact Michelle Schwalbe, Director of the Board on Mathematical Sciences and Analytics at NASEM.

casual headshot of Elizabeth KittrieElizabeth Kittrie is NLM’s Senior Planning and Evaluation Officer. She previously served as a Senior Advisor to the Associate Director for Data Science at the National Institutes of Health and as Senior Advisor to the Chief Technology Officer of the US Department of Health and Human Services. Prior to joining HHS, she served as the first Associate Director for the Department of Biomedical Informatics at Arizona State University.

What is the academic health sciences library’s role in the learning health care system?

Guest post by Philip Walker, MLIS, MSHI, Director of the Annette & Irwin Eskind Biomedical Library, Vanderbilt University.

I was introduced to the concept of the learning health system or learning health care system last year, but when the topic came up again at a recent lecture, I felt compelled to know more. My basic search across Medline (PubMed), CINAHL, Embase, and Engineering Village yielded over 10,000 results [before deduplication], including both conceptual and research articles from various clinical specialties and informatics. However, a quick scan of the titles and abstracts uncovered little to no mention of the role of the health sciences or (bio)medical library in the learning health system (LHS).

That got me thinking: what might that role be?

I don’t have all the answers, but given the major part the LHS occupies in the culture of today’s academic medical center, I’m hoping this post can spark a conversation among health sciences librarians about ways we can help our institutions achieve the goals of the LHS.

Generally speaking, the learning health system can be described as a fusion of clinical and basic sciences, informatics/data sciences, and workplace culture, with the goal of continually improving the quality, safety, efficiency, and effectiveness of health care. Or, as one colleague eloquently stated, “The learning health system helps us improve how we care for patients while we are taking care of them.”

It dawned on me that this could be a key step in the evolution of evidence-based medicine/evidence-­based practice.

In the LHS, we are not using the biomedical literature to change practice, but instead identifying real-time data signals (via pragmatic clinical trials) within the electronic medical record to generate new knowledge, change clinical practice, and refine institutional policies and procedures. The literature’s influence remains—as the basis for determining which research projects to pursue—but it becomes secondary to real-time data. Then, once the findings are in hand, the literature helps validate and supplement those findings prior to their dissemination and adoption.

Could this be the beginning of real-time evidence-based medicine or evidence-based practice? If so, then there is definitely a place for libraries, information scientists, and knowledge management practitioners in the learning health system.

Of course, libraries have their collections of knowledge-based information resources, literature searching (and filtering) services, and collaboration spaces to offer, but I’m thinking we can do more than that. By identifying the local LHS information architecture, i.e., the flow of information in the research, clinical, or educational context, we can discover potential roles for the library. Understanding the flow of information allows us to identify how it enters the system, interacts with users, and is packaged for adoption. That understanding can also help us—in conjunction with the published literature—pinpoint and address the information needs and gaps within the LHS. That, to me, is where the opportunities for libraries reside.

This novel use of the literature will require knowledge management and knowledge extraction practices such as filtering, summarizing, synthesizing, or curating information. These contributions go beyond the saved searches or static bibliographies libraries traditionally offer, but they fall well within the librarian skillset. While the next steps of translating and integrating the literature and newly generated data into the electronic medical record will likely fall outside the library’s purview, the overall potential for collaboration will ultimately depend upon the relationship between the library, LHS leadership, and the medical center’s informatics and/or clinical decision support unit(s).

Depending on the organization and skills of library staff, we can position ourselves as the central information hub, collaboration space, literature searchers, and, in some cases, consultants in text mining, data mining, data visualization, or data management. By partnering with our institutions to achieve the goals of the LHS, we can strengthen relationships with our constituents and help them educate current and future health care practitioners, generate new knowledge from research, and improve health delivery and outcomes.

headshot of Philip WalkerPhilip Walker, MLIS, MSHI, is the Director of Vanderbilt University’s Annette and Irwin Eskind Biomedical Library. He has been a librarian at Eskind since 2012 and served as Interim Director from 2017-2018. Walker previously worked at Tulane University’s Rudolph Matas Library of the Health Sciences, the Texas Medical Center Library, and the Meharry Medical College Library.

Solo Librarians as Information Servers

Guest post by Louise McLaughlin, MSLS, Information Specialist at Woman’s Hospital in Baton Rouge, Louisiana.

As information flows from the data collection pipeline to research, curation, and publication, hospital librarians, especially those who practice closely with health care providers, become the human face of information servers. And like those data processing units that serve numerous users, these librarians, many of whom work alone as solo librarians, must be prepared to fill requests from all quarters.

Consider, for example, the following vignettes:

The Chief Operating Officer is launching the next phase of a project to reduce perinatal mortality and preterm births. The librarian continually provides the physicians, nurses, and social workers on the project committee with research articles on emerging causes, new treatments, and community-health approaches to improving outcomes.

A pre-op nurse talks with a colleague about a practice difference they have in monitoring a patient. She wants to know what the evidence says.

A nurse educator asks for help proofreading an article about a successful quality improvement project and confirming the proper citation format for the references.

A physician teaching medical students in Mongolia about the latest updates in women’s health asks, “Can you gather research articles that would address their population on this list of topics?”

The marketing department is updating the hospital’s website. They want to know where they can find consumer-friendly health care definitions.

Sometimes, that can all happen in one day!

But answering questions is not all we do.

A 2016 survey of solo librarians garnered responses from 383 professionals who reported on job duties. Using a pick list, respondents identified an average of nine different job duties for which they were responsible, from a high of 17 to a low of five. (To the best of the authors’ knowledge, no data exists regarding the total number of solo librarians in health care, so the survey results are limited.)

Judging by their selections, a job description that fairly represents a solo librarian’s qualifications might include strong literature search skills across multiple databases, managing electronic resources, experience instructing clinicians, fluency with medical terminology, and advanced budget skills. Working with researchers and rounding with clinical staff may also be required, as might serving on hospital committees, including the Institutional Review Board. Strong outreach practices are encouraged.

Even with such a diverse skill set, many of these librarians lack job security. While many solos are regarded as valuable members of their health care teams, they also know their jobs may not survive the next hospital merger or budget crisis. In fact, listserv news of hospital library closures, anticipated or unexpected, can turn that fear into an ever-present companion.

Yet being resilient may be a solo librarian’s strongest quality. We always have an eye toward future trends, both in our hospitals and in the information arena. Listen to our conversations, and you will hear us talking about ways to use data to demonstrate to our administrators our daily contributions to patient safety, improved outcomes, case management, and the hospital’s overall return on investment.

And though we call ourselves “solo librarians”—and might be managing a hospital’s library services alone or with a skeleton staff of part-timers or volunteers—we know we do not work in isolation.

Our colleagues in academia and at the National Network of Libraries of Medicine nourish us with webinars about the basics of electronic medical record data, innovative instructional methods, consumer health resources, and best uses for a variety of NLM databases. Many of them are our professional best friends, supporting us when we need clarity on best practices in running our library or offering support with a perplexing situation.

We also rely upon the National Library of Medicine, both for its resources and its vision. We view NLM’s 10-Year Strategic Plan as a roadmap to where we are headed. Solo librarians are well-prepared to support Goals 2 and 3 (PDF) of the plan, whether with skills we already have or others we need to develop. Supporting biomedical and health information access and dissemination is already part of our lives; learning to identify and appreciate the capabilities of new digital products is on our must-do list. With training and guidance, we can be the link that facilitates data science proficiency within our institutions and healthy living within our communities.

But like information servers, solo librarians are most valuable when we are kept updated, valued, and used. For this, we count on those higher up the knowledge-creation ladder to share their wisdom with us, value our expertise in local health dynamics, and remind others to use us as resource partners.

casual headshot of Louise McLaughlinLouise McLaughlin, MSLS, stepped into the role of Information Specialist at Woman’s Hospital in Baton Rouge, Louisiana, when her predecessor retired, and her job as assistant librarian was eliminated. She has reached out to friends in similar settings and established a monthly Solo Chat and worked as co-convener of the Medical Library Association’s Solo Special Interest Group. Louise has authored or co-authored several articles on solo librarianship for the Journal of Hospital Librarianship, the National Network, and other association publications.

The Evolution of Data Science Training in Biomedical Informatics

Guest post by Dr. George Hripcsak, Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus.

Biomedical informatics is an exciting field that addresses information in biomedicine. At over half a century, it is older than many realize. Looking back, I am struck that in one sense, its areas of interest have remained stable. As a trainee in the 1980s, I published on artificial neural networks, clinical information systems, and clinical information standards. In 2018, I published on deep learning (neural networks), electronic health records (clinical information systems), and terminology standards. I believe this stability reflects the maturity of the field and the difficult problems we have taken on.

On the other hand, we have made enormous progress. In the 1980s we dreamed of adopting electronic health records and the widespread use of decision support fueled by computational techniques. Nowadays we celebrate and bemoan the widespread adoption of electronic health records, although we still look forward to more widespread decision support.

Data science has filled the media lately, and it has been part of biomedical informatics throughout its life. Progress here has been especially notable.

Take the Observational Health Data Sciences and Informatics (OHDSI) project as an example: a billion patient records from about 400 million unique patients, with 200 researchers from 25 countries. This scale would not have been possible in the 1980s. A combination of improved health record adoption, improved clinical data standards, more computing power and data storage, advanced data science methods (regularized regression, Bayesian approaches), and advanced communications have made it possible. For example, you can now look up any side effect on any drug on the world market, review a 17,000-hypotheses study (publication forthcoming) comparing the side effects caused by different treatments for depression, and study how three chronic diseases are actually treated around the world.

How we teach data science in biomedical informatics has also evolved. Take as an example Columbia University’s Department of Biomedical Informatics training program, which has been funded by the National Library of Medicine for about three decades. It initially focused on clinical information systems under its founding chair, Paul Clayton, and while researchers individually worked on what today would be called data science, the curriculum focused heavily on techniques related to clinical information systems. For the first decade, our data science methods were largely pulled in from computer science and statistics courses, with the department focusing on the application of those techniques. During that time, I filled a gap in my own data science knowledge by obtaining a master’s degree in biostatistics.

In the second decade, as presented well by Ted Shortliffe and Stephen Johnson in the 2002 IMIA Yearbook of Medical Informatics, the department shifted to take on a greater responsibility for teaching its own methods, including data science. Our core courses focused on data representation, information systems, formal models, information presentation, decision making, evaluation, and specialization in application tracks. The Methods in Medical Informatics course focused mainly on how to represent knowledge (using Sowa’s 1999 Knowledge Representation textbook), but it also included numeric data science components like Bayesian inference, Markov models, and machine learning algorithms, with the choice between symbolic and statistical approaches to solving problems as a recurring theme. We also relied on computer science and statistics faculty to teach data management, software engineering, and basic statistics.

In the most recent decade, the department expanded its internal focus on data science and made it more explicit, with the content from the original methods course split among three courses: computational methods, symbolic methods, and research methods. The computational methods course covered the numerical methods commonly associated with data science, and the symbolic methods course included the representational structures that support the data.

This expansion into data science continued four years ago when Noemie Elhadad created a data science track  (with supplemental funding from the National Library of Medicine) that encouraged interested students to dive more deeply into data science through additional departmental and external courses. At present, all students get a foundation in data science through the computational methods class and required seminars, and those with additional interest can engage as deeply as any computer science or statistics trainee.

We encourage our students not just to apply data science methods but to develop new methods, including supplying the theoretical foundation for the work. While this may not be for every informatics trainee, we believe that our field must be as rigorous as the methodological fields we pull from. Examples include work on deep hierarchical families by Ranganath, Blei, and colleagues, and remaking survival analysis with Perotte and Elhadad.

To survive, a department must look forward. Our department invested heavily in data science and in electronic health record research in 2007. A decade later, what is on the horizon?

I believe informatics will come full circle, returning at least in part to its physiological modeling origins that predated our department. As we reach the limits of what our noisy and sparse data can provide for deep learning, we will learn to exploit pre-existing biomedical knowledge in different forms of mechanistic models. I believe these hybrid empirical-mechanistic methods can produce patient-specific recommendations and realize the dream of precision medicine. And we have begun to teach our trainees how to do it.

formal headshot of Dr. HripcsakGeorge Hripcsak, MD, MS, is Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus. He has more than 25 years of experience in biomedical informatics with a special interest in the clinical information stored in electronic health records and the development of next-generation health record systems. He is an elected member of the Institute of Medicine and an elected fellow of the American College of Medical Informatics and the New York Academy of Medicine. He has published more than 250 papers and previously chaired NLM’s Biomedical Library and Informatics Review Committee.

NIH Draft Strategic Plan for Data Science: Suggestions for Optimizing Value

Guest post by Dr. William Hersh, professor and chair of the Department of Medical Informatics and Clinical Epidemiology, School of Medicine, Oregon Health & Science University.

Earlier this year, the National Institutes of Health (NIH) issued a Request for Information (RFI) soliciting input for their draft Strategic Plan for Data Science. As I did for the National Library of Medicine’s (NLM) RFI concerning next-generation data science challenges in health and biomedicine, I shared my comments on the data science plan through both the formal submission mechanism and my blog. (See also my blog comments on the NLM RFI.) I appreciate being asked to update my comments on the draft NIH data science plan in this guest post.

The draft NIH data science plan is a well-motivated and well-written overview of the path NIH should follow to ensure that the value of data science is leveraged to maximize its benefit to biomedical research and human health. The goals of connecting all NIH and other relevant data, modernizing the ecosystem, developing tools and the workforce skills to use it, and making it sustainable are all important and articulated well in the draft plan.

However, collecting and analyzing the data, along with building tools and training the workforce to use the data, are not enough. Three additional aspects not adequately addressed in the draft are critical to achieving the value of data science in biomedical research.

The first of these is the establishment of a research agenda around data science itself. We still do not understand all the best practices and other nuances around the optimal use of data science in biomedical research and human health. Questions remain regarding how best to standardize data for use and re-use. What standards are needed for best use of data? Where are the gaps in our current standards that we can address to improve the use of data in biomedical research, especially data not originally collected for research purposes (such as clinical data from electronic health records and patient data from wearables, sensors, or that is directly entered)?

We must also research more extensively the human factors around data use. How do we organize workflows for optimal input, extraction, and utilization of data? What are the best human-computer interfaces for such work? How do we balance personal privacy and security against the public good of learning from such data? What ethical issues must be addressed?

The second inadequately addressed aspect concerns the workforce for data science. While the draft properly notes the critical need to train specialists in data science, it does not explicitly mention the discipline that has been at the forefront of “data science” before the term came into widespread use, namely, biomedical informatics. NLM has helped train a wide spectrum of those who work in data science, from the specialists who carry out the direct work to the applied professionals who work with researchers, the public, and other implementers. NIH should acknowledge and leverage this workforce that will analyze and apply the results of data science work. The large number of biomedical (and related flavors of) informatics programs should expand their established role in translating data science from research to practice.

The final underspecified aspect concerns the organizational home for data science within NIH. Many traditional NLM grantees, including this author, have been funded under the NIH Big Data to Knowledge (BD2K) program launched several years ago. The newly released NLM Strategic Plan includes a focus on data science and goes beyond some of the limitations of the draft NIH data science plan described above, making the NLM the logical home for data science within NIH.

By addressing these concerns, the NIH data science plan can make an important contribution to realizing the potential for data science in improving human health as well as preventing and treating disease.

headshot of Dr. Hersh William Hersh, MD, FACMI, serves as professor and chair of the Department of Medical Informatics & Clinical Epidemiology, School of Medicine, Oregon Health & Science University. His current work is focused on the workforce needed to implement health information technology, especially in clinical settings, and he is active in clinical and translational research informatics.