The Evolution of Data Science Training in Biomedical Informatics

abstract conceptual image representing the merger of data science and humanity

Guest post by Dr. George Hripcsak, Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus.

Biomedical informatics is an exciting field that addresses information in biomedicine. At over half a century, it is older than many realize. Looking back, I am struck that in one sense, its areas of interest have remained stable. As a trainee in the 1980s, I published on artificial neural networks, clinical information systems, and clinical information standards. In 2018, I published on deep learning (neural networks), electronic health records (clinical information systems), and terminology standards. I believe this stability reflects the maturity of the field and the difficult problems we have taken on.

On the other hand, we have made enormous progress. In the 1980s we dreamed of adopting electronic health records and the widespread use of decision support fueled by computational techniques. Nowadays we celebrate and bemoan the widespread adoption of electronic health records, although we still look forward to more widespread decision support.

Data science has filled the media lately, and it has been part of biomedical informatics throughout its life. Progress here has been especially notable.

Take the Observational Health Data Sciences and Informatics (OHDSI) project as an example: a billion patient records from about 400 million unique patients, with 200 researchers from 25 countries. This scale would not have been possible in the 1980s. A combination of improved health record adoption, improved clinical data standards, more computing power and data storage, advanced data science methods (regularized regression, Bayesian approaches), and advanced communications have made it possible. For example, you can now look up any side effect on any drug on the world market, review a 17,000-hypotheses study (publication forthcoming) comparing the side effects caused by different treatments for depression, and study how three chronic diseases are actually treated around the world.

How we teach data science in biomedical informatics has also evolved. Take as an example Columbia University’s Department of Biomedical Informatics training program, which has been funded by the National Library of Medicine for about three decades. It initially focused on clinical information systems under its founding chair, Paul Clayton, and while researchers individually worked on what today would be called data science, the curriculum focused heavily on techniques related to clinical information systems. For the first decade, our data science methods were largely pulled in from computer science and statistics courses, with the department focusing on the application of those techniques. During that time, I filled a gap in my own data science knowledge by obtaining a master’s degree in biostatistics.

In the second decade, as presented well by Ted Shortliffe and Stephen Johnson in the 2002 IMIA Yearbook of Medical Informatics, the department shifted to take on a greater responsibility for teaching its own methods, including data science. Our core courses focused on data representation, information systems, formal models, information presentation, decision making, evaluation, and specialization in application tracks. The Methods in Medical Informatics course focused mainly on how to represent knowledge (using Sowa’s 1999 Knowledge Representation textbook), but it also included numeric data science components like Bayesian inference, Markov models, and machine learning algorithms, with the choice between symbolic and statistical approaches to solving problems as a recurring theme. We also relied on computer science and statistics faculty to teach data management, software engineering, and basic statistics.

In the most recent decade, the department expanded its internal focus on data science and made it more explicit, with the content from the original methods course split among three courses: computational methods, symbolic methods, and research methods. The computational methods course covered the numerical methods commonly associated with data science, and the symbolic methods course included the representational structures that support the data.

This expansion into data science continued four years ago when Noemie Elhadad created a data science track  (with supplemental funding from the National Library of Medicine) that encouraged interested students to dive more deeply into data science through additional departmental and external courses. At present, all students get a foundation in data science through the computational methods class and required seminars, and those with additional interest can engage as deeply as any computer science or statistics trainee.

We encourage our students not just to apply data science methods but to develop new methods, including supplying the theoretical foundation for the work. While this may not be for every informatics trainee, we believe that our field must be as rigorous as the methodological fields we pull from. Examples include work on deep hierarchical families by Ranganath, Blei, and colleagues, and remaking survival analysis with Perotte and Elhadad.

To survive, a department must look forward. Our department invested heavily in data science and in electronic health record research in 2007. A decade later, what is on the horizon?

I believe informatics will come full circle, returning at least in part to its physiological modeling origins that predated our department. As we reach the limits of what our noisy and sparse data can provide for deep learning, we will learn to exploit pre-existing biomedical knowledge in different forms of mechanistic models. I believe these hybrid empirical-mechanistic methods can produce patient-specific recommendations and realize the dream of precision medicine. And we have begun to teach our trainees how to do it.

formal headshot of Dr. HripcsakGeorge Hripcsak, MD, MS, is Vivian Beaumont Allen Professor and Chair of Columbia University’s Department of Biomedical Informatics and Director of Medical Informatics Services for New York-Presbyterian Hospital/Columbia Campus. He has more than 25 years of experience in biomedical informatics with a special interest in the clinical information stored in electronic health records and the development of next-generation health record systems. He is an elected member of the Institute of Medicine and an elected fellow of the American College of Medical Informatics and the New York Academy of Medicine. He has published more than 250 papers and previously chaired NLM’s Biomedical Library and Informatics Review Committee.