Biomedical Discovery through SRA and the Cloud

Guest post by Jim Ostell, PhD, Director of the National Library of Medicine’s National Center for Biotechnology Information, National Institutes of Health.

NLM’s Sequence Read Archive (SRA) is used by more than 100,000 researchers every month, and those researchers now have a tremendous new opportunity to query this database of high-throughput sequence data in new ways for novel discovery: via the cloud. NLM has just finished moving SRA’s public data to the cloud, completing the first phase of an ongoing effort to better position these data for large-scale computing.  

To understand the importance of this move, it’s helpful to consider the analogy of how humans slowly improved their knowledge of the surface of the Earth.

The first simple maps allowed knowledge of terrain to be passed from people who had been there to those who hadn’t. Over the centuries, we learned to sail ships over the oceans and capture new knowledge in navigation charts and techniques. And we learned to fly airplanes over an area of interest and automatically capture detailed images of not only terrain, but also buildings and reservoirs, and assess the conditions of forest, field, and agricultural resources.

Today, with Earth-orbiting satellites, we no longer need to determine in advance what we want to view. We just photograph the whole Earth, all day, every day, and store all the data in a big database. Then we mine the data afterward. The significant change here is that not only can we follow, in great detail, locations or features on the Earth that we already know we’re interested in, as in aerial photography, but we can also discover new things of interest. Examples abound: for instance, noticing a change in a military base, and going back in time to see when the change began or how it developed; or seeing a decline in a forest or watershed, going back in time to see how this decline developed, and then looking geographically to see if it’s happening in other places in the world.

Scientists also can develop new algorithms to extract information from the corpus, or collection, of information. For example, archeologists looking for faint straight-line features indicative of ancient walls or foundations can apply new algorithms to the huge body of existing data to suddenly reveal ancient buildings and cities that were previously unknown.

DNA sequencing has had a similar history, starting from the laborious sequencing of tiny bits of known genomes that could be analyzed by eye (like hand-drawn maps), to the targeting of specific organism genomes to be completely sequenced and then analyzed (similar to aerial photography), to the modern practice of high-throughput sequencing, in which researchers might sequence an entire bacterial genome to study only one gene because it’s easier and cheaper to just measure the whole thing.

However, the significant difference in this analogy is that the ability to search, analyze, or develop new algorithms to explore the huge corpus of high-throughput sequence data is not yet a routine practice accessible to most scientists — as it is for Earth-orbiting satellite data.

Today, scientists expect to be able to routinely explore the entire corpus of targeted genome sequence data through tools such as NLM’s Basic Local Alignment Search Tool (BLAST); very little of the scientific work with genome data is looking for a specific GenBank record. The major scientific work is done by exploring the data in fast, meaningful ways, asking questions such as “Has anyone else seen a protein like this before?”; “What organism is most like the organism I’m working on?”; “Where else has a piece of sequence like this been collected?”; “Is anything known about the function of a piece of sequence like this?” But it has not been possible to do that for the high-throughput, unassembled sequence data, across all such sequences, because that corpus of data has been too big for all but a few places in the world to hold, or to compute across.

This is now changing.

With support from the National Institutes of Health (NIH) Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative, NLM’s National Center for Biotechnology Information (NCBI) has moved the publicly available high-throughput sequence data from its SRA archive onto two commercial cloud platforms, Google Cloud and Amazon Web Services. For the first time in history, it’s now possible for anyone to compute across this entire 5-petabyte corpus at will, with their own ideas and tools, opening the door to the kind of revolution that was sparked by the availability of a complete corpus of Earth-orbiting satellite images.

The public SRA data include genomes of viruses, bacteria, and nonhuman higher organisms, as well as gene expression data, metagenomes, and a small amount of human genome data that is consented to be public (from the 1000 Genomes Project). NCBI has held, and will continue to hold, codeathons to introduce small groups of scientists to exploring these data in the cloud. For example, during a recent codeathon, participants worked with a set of metagenomes to try to identify known and novel viruses. Other upcoming codeathon cloud topics include RNA-seq, pangenomics, haplotype annotation, and prokaryotic annotation.

Now that the publicly available SRA data are in the cloud, the next milestone is to make all of SRA’s controlled-access human genomic data available on both cloud platforms. Providing access to these data requires a higher level of security and oversight than is required for the nonhuman and publicly available human data, and access must be accompanied by a platform for the authentication and authorization of users, which creates a host of other issues to address. This effort is being undertaken in concert with other major NIH human genome repositories, with guidance from NIH leadership, and with international groups such as the Global Alliance for Genomics and Health (GA4GH).

But, already, the publicly available SRA data are there for biological and computational scientists to take their first dive into the new world of sequence-based “Earth-orbiting satellite photography.” More and more — in research, in clinical practice, in epidemiology, in public health surveillance, in agriculture, in ecology and species diversity — we’ve seen the movement to “just sequence the whole thing.” Now we’ve taken the first step toward the necessary corollary: to “analyze it all afterward.”

In the coming weeks and months, NLM will be making further announcements about SRA in the cloud, with tutorials and updates on the availability of controlled-access human data. For those already familiar with operating on commercial clouds who would like a look at the SRA data in the cloud, you can get started today via the updated SRA toolkit.


Dr. Ostell has had a leadership position at NCBI since its inception in 1988. Before assuming the role of Director in 2017, he was Chief of NCBI’s Information Engineering Branch, where he was responsible for designing, developing, building, and deploying the majority of production resources at NCBI, including flagship products such as PubMed and GenBank. Dr. Ostell was inducted into the United States National Academies, Institute of Medicine, in 2007 and made an NIH Distinguished Investigator in 2011.

To stay up to date on NCBI projects and research, follow us on Twitter.


Enhancing Data Sharing, One Dataset at a Time

Guest post by Susan Gregurick, PhD, Associate Director for Data Science and Director, Office of Data Science Strategy, National Institutes of Health

Circular graphic showing Findable, Accessible, Interoperable, and Reusable aspects of the Vision of the NIH Strategic Plan for Data Science
Vision of the NIH Strategic Plan for Data Science

The National Institutes of Health (NIH) has an ambitious vision for a modernized, integrated biomedical data ecosystem. How we plan to achieve this vision is outlined in the NIH Strategic Plan for Data Science, and the long-term goal is to have NIH-funded data be findable, accessible, interoperable, and reusable (FAIR). To support this goal, we have made enhancing data access and sharing a central theme throughout the strategic plan.

While the topic of data sharing itself merits greater discussion, in this post I’m going to focus on one primary method for sharing data, which is through domain-specific and generalist repositories.

The landscape of biomedical data repositories is vast and evolving. Currently, NIH supports many repositories for sharing biomedical data. These data repositories all have a specific focus, either by data type (e.g., sequence data, protein structure, continuous physiological signals) or by biomedical research discipline (e.g., cancer, immunology, or clinical research data associated with a specific NIH institute or center), and often form a nexus of resources for their research communities. These domain-specific, open-access data-sharing repositories, whether funded by NIH or other sources, are good first choices for researchers, and NIH encourages their use.

NIH’s PubMed Central is a solution for storing and sharing datasets directly associated with publications and publication-related supplemental materials (up to 2 GB in size). On the other end of the spectrum, “big” datasets, comprising petabytes of data, are now starting to leverage cloud service providers (CSPs), including through the NIH Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative. These are still the early days of data sharing through CSPs, and we anticipate that this will be an active area of research.

There are, however, instances in which researchers are unable to find a domain-specific repository applicable to their research project. In these cases, a generalist repository that accepts data regardless of data type or discipline may be a good fit. Biomedical researchers already share data, software code, and other digital research products via many generalist repositories hosted by various institutions—often in collaboration with a library—and recommended by journals, publishers, or funders. While NIH does not have a recommended generalist repository, we are exploring the roles and uses of generalist repositories in our data repository landscape.

screenshot of NIH Figshare homepage
NIH Figshare homepage https://nih.figshare.com

For example, as part of our exploratory strategy, NIH recently launched an NIH Figshare instance, a short-term pilot project with the generalist repository Figshare. This pilot provides NIH-funded researchers with a generalist repository option for up to 100 GB of data per user. The NIH Figshare instance complies with FAIR principles; supports a wide range of data and file types; captures customized metadata; and provides persistent unique identifiers with the ability to track attention, use, and reuse.

NIH Figshare is just one part of our approach to understanding the role of generalist repositories in making biomedical research data more discoverable. We recognize that making data more FAIR is no small task and certainly not one that we can accomplish on our own. Through this pilot project, and other related projects associated with implementing NIH’s strategy for data science, we look forward to working with the biomedical community—researchers, librarians, publishers, and institutions, as well as other funders and stakeholders—to understand the evolving data repository ecosystem and how to best enable useful and usable data sharing.

Together we can strengthen our data repository ecosystem and ultimately, accelerate data-driven research and discovery. We invite you to join our efforts by sending your ideas and needs to datascience@nih.gov.

Susan Gregurick, PhD

Dr. Gregurick leads the NIH Strategic Plan for Data Science through scientific, technical, and operational collaboration with the institutes, centers, and offices that comprise NIH. She has substantial expertise in computational biology, high performance computing, and bioinformatics.

Defining the Path Forward for NLM’s New Office of Engagement and Training

Guest post by Amanda J. Wilson, Chief, Office of Engagement and Training, NLM.

During the NLM Board of Regents (BOR) meeting held last week, I had the distinct honor of introducing the new Office of Engagement and Training (OET). This office brings together many of the outreach, training, and capacity-building staff, programs, and services from across the Library.

Since OET was established in June 2019, our team has been occupied with moving into our new space, getting to know one another, exploring the depth and capacity of the resources we have to accomplish our goals, discussing what the future holds for our role in coordinating engagement activities, and reflecting as a team on the niche we fill for NLM. In the midst of this summer flurry of activity and, quite frankly, the more mundane tasks of figuring out the fastest way to answer the door to our offices and the mechanics of mail distribution, some themes surrounding what we can, and hope to become rose to the top.

Our vision for OET is a resource that will serve the NLM community as a strategic connector between NLM and our audiences, as well as across the Library, as a trusted authority on the NLM experience when engaging with Library resources. We are also an incubator for new approaches to engagement.

What, exactly, does that mean?

It means we understand the broad range of both new and existing NLM users, their needs, and the most effective pathways to reach them. And it also means we are closely connected to NLM researchers, developers, information professionals, program managers, and product owners, including knowing what information is most important to them and has the greatest impact on their work.

This vision also involves knowing how all segments of NLM’s audiences respond to different types of engagement activities. That knowledge will position OET to use our expertise, capabilities, and connections to bring NLM’s trusted resources to communities when and where those resources are needed most. And, considering our unique position, it means we can be a catalyst for exploring novel, effective ways to connect, build, and enhance opportunities for all audiences to engage with NLM.

But that’s not all.

As we started working toward these goals and aspirations, we asked the BOR for advice and thoughts to guide us. For some activities that we currently engage in, such as surveys, webinars, meetings, and exhibits, the BOR provided encouragement for us to continue. The BOR also challenged OET to explore new strategies for engagement, such as working with U.S. Public Health Service Commissioned Corps officers who are part of the Prevention through Active Community Engagement (PACE), in the Office of the Surgeon General. Another suggestion was to engage in community theater productions to help convey our message.

The possibilities that BOR members provided, as well as input from our colleagues at NLM and other partners, have given OET much to consider as we chart our path forward.

What does this vision of OET mean to you?

I’ve been called corny by one of my colleagues (said with a smile) for my obvious enthusiasm about the future of OET. But I absolutely embrace that sentiment! I’m enthusiastic because I have an opportunity to lead a wonderful team of experienced, knowledgeable colleagues dedicated to our mission. I’m also enthusiastic because OET has the support of NLM leadership and the BOR to continue creating an office that supports NLM’s goals with evidence-based engagement and training, built on collaboration and inclusivity and with an eye to the future.

This is an exciting time, and I look forward to all that we can do together! I invite you to join us along the way.

Photo of Amanda Wilson, Chief of the Office of Engagement and Training.

Amanda J. Wilson is Chief of the NLM Office of Engagement and Training (OET), bringing together general engagement, training, and outreach staff from across NLM to focus on the Library’s presence across the U.S. and internationally. OET is also home to the Environmental Health Information Partnership for NLM and coordinates the National Network of Libraries of Medicine. Wilson first came to NLM when appointed Head, National Network Coordinating Office, in January 2017.

NLM Scientists Contribute to AI for Medical Image Interpretation

Guest post by Sameer Antani, PhD, Staff Scientist, Acting Branch Chief for the Communications Engineering Branch and Computer Science Branch at the National Library of Medicine’s Lister Hill National Center for Biomedical Communications, National Institutes of Health.

Artificial intelligence (AI) has become one of the hottest fields of the 21st century. But AI isn’t a new concept. It’s older than I am!

AI—or, more specifically, machine learning-based automated intelligent decision support—is making inroads in applications that we could only dream about just a few decades ago, such as automated check recognition, movie and video recommendations, and self-driving vehicles.

And in the near term, the role of AI may be as computer-based applications that use data-derived knowledge to support or advance human activities that are tedious, repetitive, and relatively deterministic, especially where expert resources are lacking. In other words, AI may not only help solve budget issues, it may also help reduce boredom.

The idea of an artificial brain was initially promoted by a handful of scientists from different fields, resulting in the founding of AI research as an academic discipline in 1956. After some initial discoveries, and a clearer understanding of the challenges involved, the field lost steam during the last decades of the 20th century. However, advances continued in the form of various statistical pattern recognition and machine learning techniques.

Then, in 2012, a breakthrough in deep learning was published. The image-classification error rate had been cut in half for the ImageNet dataset in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). By 2017 the best AI algorithms were detecting and recognizing objects in photographic images at an impressive accuracy rate of more than 97%, surpassing human performance.

Since then, AI in imaging has become a relatively mature field. But the use of AI in medical imaging continues to challenge us. We need to recognize that much of the field’s success in medical imaging has been within a narrow focus on specific tasks for which AI has been trained, and that this success depends on the data to which AI has been exposed.

Here at NLM, we’ve been working on image informatics research and advancing computational science techniques and information retrieval using traditional machine-learning methods for many years, even before the advent of deep learning. 

Some of AI’s most exciting applications are happening in underserved and under-resourced regions, and imaging-based AI can help fill the gaps where medical expertise may be limited. My fellow NLM scientists and I have applied and contributed to advancing AI techniques to predict tuberculosis and other pulmonary diseases in digital chest X-ray images, screen for malaria parasites in microscopic blood smear images, and detect age-related eye diseases. A recent landmark paper showed that an AI algorithm was superior to human experts in identifying cervical precancer in women.

These findings are consistent with other AI advances in medical imaging reported in the scientific literature, including reading CT scans for lung-cancer screening, detecting brain tumors, screening for diabetic retinopathy, digital pathology applications for precision oncology, and performing radiologist-level pneumonia detection on chest X-rays. While many of these exemplify amazing advances in medical imaging AI, some are built on or have humble beginnings in outcomes from ImageNet’s object localization and recognition challenge.

NLM’s strategic plan for building a platform for data-driven discovery and health guides our research efforts. We’re developing novel AI algorithms; gaining a deeper understanding of AI decision-making (also known as explainable AI); measuring the impact of data variety, volume, and quality; and identifying more ways to address gaps in translating technical advances to have a positive impact on biomedical research and clinical care.

Our research interests also include intelligent ensembles of deep learning networks where each type of network learns something different from the data, and the learned knowledge is then transferred and fused into other sets. This effort is particularly important for rare diseases, where the number of samples in the population tends to be smaller. Unlike humans, it’s a challenge for AI to learn key patterns from a few samples. But we’re trying to develop this capacity.

Breakthroughs in modern AI techniques in medical imaging are empowering, but these are still early days. Yet achieving AI’s potential as smart assistive technologies appears to be more imminent than replacing human expertise with AI.

I continue to dream of a future in which AI makes our lives healthier and our health care delivery more effective.

 

Photos of Sameer Antani, PhDDr. Antani is a versatile lead researcher advancing the role of computational sciences and automated decision-making in biomedical research, education, and clinical care. His research interests include topics in medical imaging and informatics, machine learning, data science, artificial intelligence, and global health. His primary areas of research and development include cervical cancer, HIV/TB, and visual information retrieval, among others.

Taking Flight: NLM’s Data Science Journey

Guest post by the Data Science @NLM Training Program team.

Data science at NLM is ready to soar!

In 2018, we embarked on a journey to build a workforce ready to take on the challenges of data-driven research and health, and earlier this year we shared our plans for accelerating data science expertise at NLM. Now, it’s time to reflect on our progress and recognize our accomplishments.

Our Data Science @NLM Training Program Open House, held last week, showcased some of the great data science work happening across the Library. We learned from each other and discovered new opportunities to strengthen the Library’s proficiencies in working with data and using analytic tools, furthering NLM’s research practices and services.

Data Science @NLM Poster Gallery

A poster gallery featuring 77 research posters and data visualizations provided a snapshot of the many ways that NLM staff apply data science to their work. It was great to see so many NLM staff sharing their work and engaging in stimulating conversations about innovation.

Three “lightning” presentations gave a glimpse of how NLM staff use data science. NLM Data Science and Open Science Librarian, Lisa Federer, PhD, MLIS, talked about building a librarian workforce to engage with researchers on open science and data science. NLM’s Rezarta Islamaj, PhD, and Donald Comeau, PhD, presented their perspectives on enriching gene and chemical links in PubMed and PubMedCentral and evaluating Medical Subject Headings, or MeSH in indexing for literature retrieval in PubMed.

The open house was also an opportunity for NLM staff who participated in an intensive 120-hour data science fundamentals course to share what they learned and how they’re applying their new skills.  

But this event was more than a celebration of accomplishments. It provided space to reflect on lessons learned, how to use what we’ve learned on a daily basis, and hopes for the future of data science at NLM. Dina Demner-Fushman, MD, PhD, of NLM dove into data science methodologies in her discussion of the Biomedical Citation Selector (BmCS), a high-recall machine-learning system that identifies articles that require indexing for MEDLINE selectively-indexed journals.

Data Science @NLM Ideas Booth

NLM staff brainstormed over 60 ideas to bring data science solutions to new and ongoing projects and talked with data science experts at the open house “ideas booth.” Staff also shared how they will learn, or continue to use, data science in support of their individual career goals.

We were delighted to see over 300 NLM staff participating in the open house, which is just one of the ways that NLM is working to achieve goal 3 of the NLM strategic plan to “build a workforce for data-driven research and health.”

The Data Science @NLM Training Program has helped increase NLM staff awareness of and expertise in data science. NLM staff are now better prepared than ever to demonstrate the Library’s commitment to accelerating biomedical discovery and data-powered health.   

Our data science journey continues, as does the growth of the data science community at NLM. For a recap of the day, follow the experience at #datareadynlm.

We’re taking off!


Photos of the Data Science at NLM Training Program Team; Dianne Babski, Peter Cooper, Lisa Federer and Anna Ripple
Data Science @NLM Training Program team (left to right):
Dianne Babski, Deputy Associate Director, Library Operations
Peter Cooper, Strategic Communications Team Lead, National Center for Biotechnology Information
Lisa Federer, Data Science and Open Science Librarian, Office of Strategic Initiatives
Anna Ripple, Information Research Specialist, Lister Hill National Center for Biomedical
Communications

Engaging Users to Support the Modernization of ClinicalTrials.gov

Guest post by Rebecca Williams, PharmD, MPH, acting director of ClinicalTrials.gov at the National Library of Medicine, National Institutes of Health.

ClinicalTrials.gov is the largest public clinical research registry and results database in the world – providing patients, health care providers, and researchers with information on more than 300,000 clinical studies of a wide range of diseases and conditions. More than 145,000 unique visitors use the public website daily to find and learn about clinical studies, resulting in an average of 215 million pageviews each month.

Recognizing the value of ClinicalTrials.gov to millions of users, the Board of Regents of the National Library of Medicine (NLM) described in the 2017-2027 strategic plan the importance of ensuring the long-term sustainability of this resource. NLM is committed to this goal and aims to modernize ClinicalTrials.gov to deliver a modern user experience on a flexible, extensible, scalable, and sustainable platform that will accommodate growth and enhance efficiency.

We are undertaking this effort to make ClinicalTrials.gov an even more valuable resource with a renewed commitment to engage with and serve the people who rely on it.

These users include the sponsors and investigators who submit clinical trial information for inclusion on the site through the submission portal. They also include patients, health care providers, and researchers who access listed information on ClinicalTrials.gov, whether directly or indirectly through other sites and services that use the ClinicalTrials.gov application programming interface.

Over the past several years, we have conducted testing with users and have already made some improvements in response to this feedback. With modernization, we will continue to support key functions identified by users of ClinicalTrials.gov while also seeking ways to make it an even more valuable resource.

To continue the modernization process, we are now seeking broader engagement with users to further help us determine how to evolve ClinicalTrials.gov. We are spending this summer looking inward by engaging our fellow National Institutes of Health Institutes and Centers to understand how ClinicalTrials.gov could better help in fulfilling NIH’s goals of clinical trial stewardship and transparency

This fall, we plan to expand our reach outward and are proposing to establish a working group of the NLM Board of Regents to focus on the modernization of ClinicalTrials.gov. This working group will provide a transparent forum for communicating and receiving input about efforts to enrich and modernize ClinicalTrials.gov. We want to ensure that we understand and consider changing needs while simultaneously maximizing the value of the growing amount of available information and preserving the integrity of ClinicalTrials.gov as a trusted resource.

We’ve already taken some steps to be more proactive in communicating with our users. We just launched “Hot Off the PRS!” (sign up to receive email announcements), a new informational bulletin for users of the ClinicalTrials.gov Protocol Registration and Results System (PRS). These updates provide timely announcements about new PRS features, relevant regulations (42 CFR Part 11) and policies, and information about other offerings such as the PRS Guided Tutorials (BETA), a new training resource with step-by-step instructions for submitting results information.

We’re excited about how greater user engagement will enrich and modernize ClinicalTrials.gov, improving its value for everyone throughout the clinical research lifecycle.

Please let us know what else we can do to make ClinicalTrials.gov the best it can be.

Photo of Rebecca Williams, PharmD, MPH

Rebecca Williams, PharmD, MPH, oversees the technical, scientific, policy, regulatory and outreach activities related to the operation of ClinicalTrials.gov. Her research interests relate to improving the quality of reporting of clinical research and evaluating the clinical research enterprise.

On the Ethics of Using Social Media Data for Health Research

Guest post by Dr. Graciela Gonzalez-Hernandez, associate professor of informatics at the Perelman School of Medicine, University of Pennsylvania.

Social media has grown in popularity for health-related research as it has become evident that it can be a good source of patient insights. Be it Twitter, Reddit, Instagram, Facebook, Amazon reviews or health forums, researchers have collected and processed user comments and published countless papers on different uses of social media data.

Using these data can be a perfectly acceptable research practice, provided they are used ethically and the research approach is solid. I will not discuss solid scientific principles and statistically sound methods for social media data use here, though. Instead, I will focus on the much-debated ethical principles that should guide observational studies done with social media data.

To help frame our discussion, let’s consider why the ethics of social media data use is called into question. Almost invariably when I present my work in this area or submit a proposal or paper, someone raises the question of ethics, often despite my efforts to address it upfront. I believe this reticence or discomfort comes from the idea that the data can be traced back to specific people and the fear that using the data could result in harm. Some research with social media data might seem innocuous enough. One might think no harm could possibly come from making available the collected data or specific tweets on topics like smoking cessation and the strategies people find effective or not. But consider data focusing on topics such as illegal substance use, addiction recovery, mental health, prescription medication abuse, or pregnancy. Black and white can quickly turn to gray.

Before going further, it is important to understand the fundamental rules for this type of research in an academic setting. In general, researchers who want to use social media data apply to their institutional review board (IRB) for review. Research activities involving human subjects and limited to one or more of the exempt categories defined by federal regulations receive an “exempt determination” rather than “IRB approval.” In the case of social media data, the exemption for existing data, documents, records, and specimens detailed in 45 CFR 46.101(b)(4) generally applies, as long as you don’t contact individual users as part of the research protocol and the data to be studied are openly and publicly available. If you will be contacting individual users, the study becomes more like a clinical trial, needing “informed consent” and full IRB review. (See the National Institutes of Health’s published guidelines for this case.)

Furthermore, exempt studies are so named because they are exempt from some of the federal regulations that apply to human-subjects research. They are not exempt from state laws, institutional policies, or the requirements for ethical research. Most of all, they are not exempt from plain old common sense.

But when it comes to the existing-data exemption, which data are “openly and publicly available” is open to question. To be safe, use only data available to all users of the platform without any extra permissions or approvals. No data from closed forums or groups that would require one to “join” within the platform should be considered “openly and publicly available.” After all, members of such groups generally expect their discussions are “private,” even if the group is large.

Beyond that, when deciding how to use the data or whether to publish the data directly, ask yourself whether revealing the information in a context other than where it was originally posted could result in harm to the people who posted it, either now or later. For example, you could include specific social media posts as examples in a scientific paper, but, if the topic was delicate, you might choose not to publish a post verbatim, instead changing the wording so a search of the host platform would not lead someone to the user. In the case of platforms like Reddit that are built around anonymity, this language modification would not be necessary. If possible, use aggregate data (e.g., counts or topics discussed) rather than individual social media posts.

However you approach your research, datasets used for automatic language processing experiments need to be shared for the results to be reproducible. Which format this takes depends on the data source, but reproducibility does not take a back seat just because these are social media data. To help you further consider the question of how to use or share these data, check out the guidelines published by the Association of Internet Researchers. These guidelines include a comprehensive set of practical questions to help you decide on an ethical approach, and I highly recommend them. In their study of the ethics of social media use, Moreno et al. also address some practical considerations and offer a good summary of the issues.

We are now ready to consider what constitutes ethical research. Ethics, or principles of right conduct, apply to institutions that conduct research, whether in academia or industry. Although ethics is sometimes used interchangeably with morals, what constitutes ethical behavior is less subjective and less personal, defining correct behavior within a relatively narrow area of activity. While there will likely never be a generally agreed upon code of ethics for every area of scientific activity, a number of groups have established principles relevant to social media-based research, including the American Public Health Association, the American Medical Informatics Association, and the previously mentioned Association of Internet Researchers. Principles of research ethics and ethical treatment of persons focus around the policy of “do no harm,” but it falls to IRBs to determine if harm could result from your approach and whether your proposed research is ethical. Even so, however, review boards might have discrepant opinions, as recent work looking into attitudes toward the use of social media data for health research has shown.

So where does that leave those of us looking to conduct health research using social media data?

Take a “stop and think” and “when in doubt, ask” approach before finalizing a study and investing time. Help ensure the researcher’s interests are balanced against those of the people involved (i.e., the users who posted the data) by putting yourself in their shoes. Be cognizant of the needs and concerns of vulnerable communities who might require greater protection, but don’t assume that research involving social media data should not be done or that the data cannot be shared. If the research was ethically conducted, then social media data can and should be shared as part of the scientific process to ensure reproducibility, and there is a lot that can be gained from pursuing it.

headshot of Dr. Graciela Gonzalez HernandezGraciela Gonzalez-Hernandez, MS, PhD, is a recognized expert and leader in natural language processing applied to bioinformatics, medical/clinical informatics, and public health informatics. She is an associate professor with tenure at the Perelman School of Medicine, University of Pennsylvania, where she leads the Health Language Processing Lab within the Institute for Biomedical Informatics and the Department of Biostatistics, Epidemiology, and Informatics.