Models: The Third Leg in Data-Driven Discovery

a three-legged stool isolated on a white background

George Box, a famous statistician, once remarked, “All models are wrong, and some are useful.”

As representations or approximations of real-world phenomena, models, when done well, can be very useful.  In fact, they serve as the third leg to the stool that is data-driven discovery, joining the published literature and its underlying data to give investigators the materials necessary to explore important dynamics in health and biomedicine.

By isolating and replicating key aspects within complex phenomena, models help us better understand what’s going on and how the pieces or processes fit together.

Because of the complexity within biomedicine, health care research must employ different kinds of models, depending on what’s being looked at.

Regardless of the type used, however, models take time to build, because the model builder must first understand the elements of the phenomena that must be represented. Only then can she select the appropriate modeling tools and build the model.

Tracking and storing models can help with that.

Not only would tracking models enable re-use—saving valuable time and money—but doing so would enhance the rigor and reproducibility of the research itself by giving scientists the ability to see and test the methodology behind the data.

Enter libraries.

As we’ve done for the literature, libraries can help document and preserve models and make them discoverable.

The first step in that is identifying and collecting useful models.

Second, we’d have to apply metadata to describe the models. Among the essential elements to include in such descriptions might be model type, purpose, key underlying assumptions, referent scale, and indicators of how and when the model was used.

screencapture with the DOI and RRIDs highlighted
The DOI and RRIDs in a current PubMed record.
(Click to enlarge.)

We’d then need to apply one or more unique identifiers to help with curation. Currently, two different schema provide principled ways to identify models: the Digital Object Identifier (DOI) and the Research Resource Identifier (RRID). The former provides a persistent, unique code to track an item or entity at an overarching level (e.g., an article or book).  The latter documents the main resources used to produce the scientific findings in that article or book (e.g., antibodies, model organisms, computational models).

Just as clicking on an author’s name in PubMed can bring up all the articles he or she has written, these interoperable identifiers, once assigned to research models, make it possible to connect the studies employing those models.  Effectively, these identifiers can tie together the three components that underpin data-driven discovery—the literature, the supporting data, and the analytical tools—thus enhancing discoverability and streamlining scientific communication.

NLM’s long-standing role in collecting, organizing, and making available the biomedical literature positions us well to take on the task of tracking research models, but is that something we should do?

If so, what might that library of models look like? What else should it include? And how useful would this library of models be to you?

Photo credit (stool, top): Doug Belshaw [Flickr (CC BY 2.0) | erased text from original]

4 thoughts on “Models: The Third Leg in Data-Driven Discovery

  1. This is an exciting idea with the potential for creating a new way of mining scientific work. It includes at least two areas in need of testing, in my opinion:
    – what sorts of models in what areas of research might be a good starting point, based on potential user perceptions and
    – what descriptors would be needed, or whether RRIDs would be sufficient. I’m thinking especially of the social scientist’s need to find a model analogous to what they need.

  2. Since one of the purposes of a library is to curate information and knowledge, I think it makes absolute sense that libraries, especially the NLM, do this.

  3. I’ve been meaning to comment on this blog for a while. Your take on this is spot on. I would love to see pubmed become an access point for published models. Our European and New Zealand colleagues has some experience in building model repositories, eg biomodels. What’s missing is being able to go from the publication, eg PubMed, directly to the model. This would be incredibly useful. The key however is curation. Most of the models at these repositories are curated, meaning a check is made that the published model actually produces the published results. In addition, metadata is added to the model to assist in searching and to assist simulation tools. The scientific quality of the paper is not judged by this process, that’s for the manuscript reviewers. I think without curation the repositories would be largely full of models that didn’t work as published. For example before the advent of standards such as SBML, the bulk of published models were non-functional. This was discovered early on by the curators of repositories.

    All the models on these repositories tend to be based on existing modeling markup languages. I don’t think the repositories store raw code such as Matlab etc. however there is probably a need to store such data. In terms of preference I would always vote for standard non-executable formats such as SBML, NeuroML and CellML simple because you can do a lot more with them. There will however be cases where the authors only have raw executable code so some mechanism needs to be in place to support that.

    As for what models should look like, in the long term I’d like to see models published so that they include all the information necessary to reproduce, validate and reuse the model. I think you might be thinking along similar lines but that would be a long term project. In the short term a link from Pubmed either to NLM’s own repository or to the CellML and biomodels repositories would be a huge benefit to the modeling community.

Leave a Reply