On the Ethics of Using Social Media Data for Health Research

Guest post by Dr. Graciela Gonzalez-Hernandez, associate professor of informatics at the Perelman School of Medicine, University of Pennsylvania.

Social media has grown in popularity for health-related research as it has become evident that it can be a good source of patient insights. Be it Twitter, Reddit, Instagram, Facebook, Amazon reviews or health forums, researchers have collected and processed user comments and published countless papers on different uses of social media data.

Using these data can be a perfectly acceptable research practice, provided they are used ethically and the research approach is solid. I will not discuss solid scientific principles and statistically sound methods for social media data use here, though. Instead, I will focus on the much-debated ethical principles that should guide observational studies done with social media data.

To help frame our discussion, let’s consider why the ethics of social media data use is called into question. Almost invariably when I present my work in this area or submit a proposal or paper, someone raises the question of ethics, often despite my efforts to address it upfront. I believe this reticence or discomfort comes from the idea that the data can be traced back to specific people and the fear that using the data could result in harm. Some research with social media data might seem innocuous enough. One might think no harm could possibly come from making available the collected data or specific tweets on topics like smoking cessation and the strategies people find effective or not. But consider data focusing on topics such as illegal substance use, addiction recovery, mental health, prescription medication abuse, or pregnancy. Black and white can quickly turn to gray.

Before going further, it is important to understand the fundamental rules for this type of research in an academic setting. In general, researchers who want to use social media data apply to their institutional review board (IRB) for review. Research activities involving human subjects and limited to one or more of the exempt categories defined by federal regulations receive an “exempt determination” rather than “IRB approval.” In the case of social media data, the exemption for existing data, documents, records, and specimens detailed in 45 CFR 46.101(b)(4) generally applies, as long as you don’t contact individual users as part of the research protocol and the data to be studied are openly and publicly available. If you will be contacting individual users, the study becomes more like a clinical trial, needing “informed consent” and full IRB review. (See the National Institutes of Health’s published guidelines for this case.)

Furthermore, exempt studies are so named because they are exempt from some of the federal regulations that apply to human-subjects research. They are not exempt from state laws, institutional policies, or the requirements for ethical research. Most of all, they are not exempt from plain old common sense.

But when it comes to the existing-data exemption, which data are “openly and publicly available” is open to question. To be safe, use only data available to all users of the platform without any extra permissions or approvals. No data from closed forums or groups that would require one to “join” within the platform should be considered “openly and publicly available.” After all, members of such groups generally expect their discussions are “private,” even if the group is large.

Beyond that, when deciding how to use the data or whether to publish the data directly, ask yourself whether revealing the information in a context other than where it was originally posted could result in harm to the people who posted it, either now or later. For example, you could include specific social media posts as examples in a scientific paper, but, if the topic was delicate, you might choose not to publish a post verbatim, instead changing the wording so a search of the host platform would not lead someone to the user. In the case of platforms like Reddit that are built around anonymity, this language modification would not be necessary. If possible, use aggregate data (e.g., counts or topics discussed) rather than individual social media posts.

However you approach your research, datasets used for automatic language processing experiments need to be shared for the results to be reproducible. Which format this takes depends on the data source, but reproducibility does not take a back seat just because these are social media data. To help you further consider the question of how to use or share these data, check out the guidelines published by the Association of Internet Researchers. These guidelines include a comprehensive set of practical questions to help you decide on an ethical approach, and I highly recommend them. In their study of the ethics of social media use, Moreno et al. also address some practical considerations and offer a good summary of the issues.

We are now ready to consider what constitutes ethical research. Ethics, or principles of right conduct, apply to institutions that conduct research, whether in academia or industry. Although ethics is sometimes used interchangeably with morals, what constitutes ethical behavior is less subjective and less personal, defining correct behavior within a relatively narrow area of activity. While there will likely never be a generally agreed upon code of ethics for every area of scientific activity, a number of groups have established principles relevant to social media-based research, including the American Public Health Association, the American Medical Informatics Association, and the previously mentioned Association of Internet Researchers. Principles of research ethics and ethical treatment of persons focus around the policy of “do no harm,” but it falls to IRBs to determine if harm could result from your approach and whether your proposed research is ethical. Even so, however, review boards might have discrepant opinions, as recent work looking into attitudes toward the use of social media data for health research has shown.

So where does that leave those of us looking to conduct health research using social media data?

Take a “stop and think” and “when in doubt, ask” approach before finalizing a study and investing time. Help ensure the researcher’s interests are balanced against those of the people involved (i.e., the users who posted the data) by putting yourself in their shoes. Be cognizant of the needs and concerns of vulnerable communities who might require greater protection, but don’t assume that research involving social media data should not be done or that the data cannot be shared. If the research was ethically conducted, then social media data can and should be shared as part of the scientific process to ensure reproducibility, and there is a lot that can be gained from pursuing it.

headshot of Dr. Graciela Gonzalez HernandezGraciela Gonzalez-Hernandez, MS, PhD, is a recognized expert and leader in natural language processing applied to bioinformatics, medical/clinical informatics, and public health informatics. She is an associate professor with tenure at the Perelman School of Medicine, University of Pennsylvania, where she leads the Health Language Processing Lab within the Institute for Biomedical Informatics and the Department of Biostatistics, Epidemiology, and Informatics.

Information Along the Underground Railroad

A couple years ago, I wrote about how the paintings in Jacob Lawrence’s Migration Series inspired me to think about how the National Library of Medicine gets information to people on the move—people displaced by violence, natural disasters, or economic crises. I felt a similar stirring after viewing the Jeanine Michna-Bales exhibition Photographs of the Underground Railroad at the Phillips Collection last month.

The deep indigo and shadowy black of Michna-Bales’ photographs stand in stark contrast to the oranges, greens, and yellows of Lawrence’s paintings, which occupy a room across the hall at the Phillips, but both have things to tell me.

Michna-Bales’ collection of nighttime photographs immediately pulled me in, helping me sense a whisper of the fear and anxiety escaping slaves might have felt as they slogged their way north toward freedom. The dark, shadowed images required me to peer in closely to detect a house or barn that might have provided a safe place to hide—or concealed danger. The Drinking Gourd constellation, isolated in the night sky, guided the travelers north along dirt roads and winding rivers, while cypress swamps, mangroves, and thick vegetation, barely perceptible in the moonlight, slowed passage.

It’s a chilling piece of history brought to life through the photographer’s lens, but as the exhibition curator underscored, slavery still exists today. More than 20 million people are enslaved around the world.  More than 50% are women; 25% are children under the age of 18. These staggering figures cry out for redress.

What can NLM do to help those working to combat this crisis or treat its victims?

We provide information to those on the front lines.

The Library’s literature can help primary care physicians and emergency room staff identify patients at risk and potentially rescue victims of human trafficking. It can help clinicians deliver health care that is both trauma-informed and culturally sensitive, attuned to victims’ needs and backgrounds. It can give educators ways to train health professionals to recognize and help victims, offer policy makers strategies to reduce human trafficking, and encourage the global health community to investigate the social and economic elements that drive such exploitation. The Library also has articles on human trafficking for the horrific purpose of organ removal and others on the relationship between human trafficking and stress-related illnesses and drug use among survivors.

It’s a harrowing collection but a necessary one, if we are to combat this crisis.

To further help those who are fighting this fight, PubMed lists articles similar to the ones initially found, helping to shape a coherent picture of the clinical challenges, health services, and public policies that can counteract this crime or mitigate its effects. We also provide the free full text of publicly funded research on this topic.

We may be able to do even more in the future. I see opportunities to tailor the health information we provide to the personal culture, worries, and recent experiences of the person searching. It’s a bold vision, but reaching the most vulnerable makes it worth the effort.


If you think someone may be a victim of human trafficking, call or encourage them to call the National Human Trafficking Hotline at (888) 373-7888 for help, resources, and information. You may also text 233733.

Socio-legal Barriers to Data Reuse

Envisioning a sustainable data trust

Guest post by Melissa Haendel, PhD, a leader of and advocate for open science initiatives.

The increasing volume and variety of biomedical data have created new opportunities to integrate data for novel analytics and discovery. Despite a number of clinical success stories that rely on data integration (e.g., rare disease diagnostics, cancer therapeutic discovery, drug repurposing), within the academic research community, data reuse is not typically promoted. In fact, data reuse is often considered “not innovative” in funding proposals and has even come under attack. (See the now infamous “research parasites” editorial in The New England Journal of Medicine.)

The FAIR data principles—Findable, Accessible, Interoperable, and Reusable—are a terrific set of goals for all of us to strive for in our data sharing, but they detail little about how to realize effective data reuse. If we are to grow innovation from our collective data resources, we must look to pioneers in data harmonization for insight into the specific advantages and challenges of data reuse at scale. Current data-licensing practices for most public data resources severely hamper data reuse, especially at scale. Integrative platforms such as the Monarch Initiative, the NCATS Biomedical Data Translator, the Gabriella Miller Kids First Data Resource Portal, and myriad other cloud data platforms will be able to accelerate scientific progress more effectively if licensing issues can be resolved. As a member of these various consortia, I want to facilitate the legal use and reuse of increasingly interconnected, derived, and reprocessed data. The community has previously raised this concern in a letter to NIH.

How reusable are most data resources? In our recently published manuscript, we created a rubric for evaluating the reusability of a data resource from the licensing standpoint. We applied this rubric to more than 50 biomedical data and knowledge resources. These assessments and the evaluation platform are openly available at the (Re)usable Data Project (RDP). Each resource was scored on a scale of zero to five stars on the following measures:

  • findability and type of licensing terms
  • scope and completeness of the licensing
  • ability to access the data in a reasonable way
  • restrictions on how the data may be reused, and
  • restrictions on who may reuse the data.

We found that 57% of the resources scored three stars or fewer, indicating that license terms may significantly impede the use, reuse, and redistribution of the data.

Custom licenses constituted the largest single class of licenses found in these data resources. This suggests the resource providers either did not know about standard licenses or believed the standard licenses did not meet their needs. Moreover, while the majority of custom licenses were restrictive, just over two-thirds of the standard licenses were permissive, leading us to wonder whether some needs and intentions are not being met by the existing set of standard permissive licenses. In addition, about 15% of resources had either missing or inconsistent licensing. This ambiguity and lack of clear intent requires clarification and possibly legal counsel.

A total of 61.8% of data resources use nonpermissive licenses.

Putting this all together, a majority of resources would not meet basic criteria for legal frictionless use for downstream data integration and redistribution, despite the fact that most of these resources are publicly funded, which should mean the content is freely available for reuse by the public.

If we in the United States have a hard time understanding how we may reuse data given these legal restrictions, we must consider the rest of the world—which presumably we aim to serve—and how hard it would be for anyone in another country to navigate this legalese. I hope the RDP’s findings will encourage the worldwide community to work together to improve licensing practices to facilitate reusable data resources for all.

Given what I have learned from the RDP and a wealth of experience in dealing with these issues, I recommend the following actions:

  • Funding agencies and publishers should ensure that all publicly funded databases and knowledge bases are evaluated against licensing criteria (whether the RDP’s or something similar).
  • Database providers should use these criteria to evaluate their resources from the perspective of a downstream data user and update their licensing terms, if appropriate.
  • Downstream data re-users should provide clear source attribution and should always confirm it is legal to redistribute the data. It is very often the case that it is legal to use the data but not to redistribute it. In addition, many uses are actually illegal.
  • Database providers should guide users on how to cite the resource as a whole, as individual records, or as portions of the content when mashed up in other contexts (which can include schemas, ontologies, and other non-data products). Where relevant, providers should follow best practices declared by a community, for example the Open Biological Ontologies citation policy, which supports using native object identifiers rather than creating new digital objects.
  • Data re-users should follow best practices in identifier provisioning and reference within the reused data so it is clear to downstream users what the license actually applies to.

To be useful and sustainable, data repositories and curated knowledge bases need to clearly credit their sources and specify the terms of reuse and redistribution.

I believe that, to be useful and sustainable, data repositories and curated knowledge bases need to clearly credit their sources and specify the terms of reuse and redistribution. Unfortunately, these resources are currently and independently making noncompatible choices about how to license their data. The reasons are multifold but often include the requirement for sustainable revenue that is counter to integrative and innovative data science.

Based on the productive discussions my collaborators and I have had with data resource providers, I propose the community work together to develop a “data trust.” In this model, database resource providers could join a collective bargaining organization (perhaps organized as a nonprofit), through which they could make their data available under compatible licensing terms. The aggregate data sources would be free and redistributable for research purposes, but they could also have commercial use terms to support research sustainability. Such a model could leverage value- or use-based revenue to incentivize resource evolution and innovation in support of emerging needs and new technologies, and would be governed by the constituent member organizations.

casual headshot of Melissa Haendel, PhD Melissa Haendel, PhD, leads numerous local, national, and global open science initiatives focused on semantic data integration and disease mechanism discovery and diagnosis, namely, the Monarch Initiative, the Global Alliance for Genomics and Health (GA4GH), the National Center for Data to Health (CD2H), and the NCATS Biomedical Data Translator.

The Wisdom in Asking Questions

The following has been adapted from a commencement address I gave last month.

Congratulations, graduates! You’ve spent years preparing for this day, years of answering questions—on exams, in clinical debriefings, and in response to your patients’ inquiries. Knowing those answers has been essential to getting you to where you are today.

But now, as you launch a career of service to patients and society, you must become as adept at asking questions as you are at answering them. To be successful, you will need to embrace intentional questioning.

Intentional questioning:

  • Is asking purposeful, well thought-out, understandable, and well-timed inquiries
  • Inspires the responder to take the next step into awareness, action, and insight
  • Is not intended to stimulate recall or appraise comprehension, but to engage with another to engender wonder, reasoning, and action

Think back to the first questions you asked: Why is the sky blue? When is dinner? Where is Mom? These questions were motivated by a curiosity about the world, coupled with a need to feel tethered or secure. I want you to return to that childhood questioning—be curious, know your tethers.

Questions convey wonder about the world and about the “other.” Asking questions of our patients helps them reveal themselves and their concerns. Asking questions of science advances the knowledge needed to diagnose and treat the human response to disease, disability, and developmental challenges. Asking questions reveals where new technologies might help resolve complex health problems, and where innovative technologies may have inadvertently disenfranchised some of our sisters and brothers.

So, embrace asking questions, but ask your questions judiciously. Make sure the questions are worthy.

What does it take to ask good questions?

  • Curiosity
  • Interest
  • A compelling need to know
  • Humility
  • An understanding of the knowledge, skills, motivation, and cultural characteristics of the other

Forty years ago, when I attended my own MSN graduation at Penn, there was no iPhone, no internet, and no PubMed. Now I direct the largest biomedical library in the world, and every day five million people use our resources to answer questions. So, right now you could say I’m in the question-answering business.

But I got here by asking questions: How can computers help nursing? In what ways can we help people better take care of themselves? If we broadened the definition of health to encompass the social and behavioral domains, could we improve health overall?

These questions propelled my research forward and shaped my career. But I didn’t even know enough to ask them early on. No one did. No matter how skilled they were, my faculty—like your faculty—could not have anticipated the knowledge nurses would need in ten years, twenty years, fifty years. You must discover that knowledge, often on your own. That is exactly why you must become adept at intentional questioning.

Intentional questioning addresses three realms.

  1. Knowing self
    • Am I ready?
    • What more do I need to know?
    • Who else should be with me?
    • What would my future self wish I had asked of me now?
  2. Knowing the world (which can guide our research)
    • Why?
    • What if … ?
    • Who can help me know this better?
    • What might be, or has been, the impact of innovation?
  3. Knowing others (such as patients)
    • What brings you here?
    • What can I do for you?
    • What questions do you have? Because listening to the types of questions people ask and the way they ask them can teach us a lot about how they frame the world and add meaning to the important issues in their lives.

Questions are the starting point of dialogue and the starting point of engagement.

And once you ask a question, you must be ready to accept the answer. You don’t always have to like it—the answer to the first research question I posed turned out to be the exact opposite of what I wanted it to be, and then I had to do some fast thinking—but you must always deal with the answer.

Not asking questions

Finally, I must point out that sometimes not asking a question is more powerful than asking it.

Let me tell you a story about one of the most important questions never asked.

In Michael Frayn’s Tony Award-winning play “Copenhagen,” Niels Bohr, his wife Margrethe, and Werner Heisenberg reflect on a long-ago evening when Heisenberg visited Bohr to learn the secret of creating heavy water, which would have accelerated Germany’s development of the atomic bomb. Bohr, in a later conversation with his wife, confessed that he deliberately did not ask Heisenberg the one question that would have led Heisenberg along the line of reasoning that could have resulted in Germany successfully creating an atomic bomb.

Why am I telling you this story? To bring home the idea that sometimes the most important aspect of intentional questioning lies in not asking a question.

When during our practices do we intentionally not pose a question?

As nurses, we might hold off because the person is not ready to hear the answer. Questions confront people with uncertainties and consequences, possibly long before a person is ready to face them.

Cultural factors can also influence our decision. Is this a culture in which an individual has the self-efficacy to answer? Or is this a culture in which complex questions are answered by elders, a family network, or friends?

Sometimes we hold questions because the moment demands our attention and we cannot be distracted from the focus and energy needed to resolve the crisis. And sometimes we don’t ask because we recognize that current circumstances—the state of knowledge or measurement or analytics—aren’t at a place to deliver a proper answer.

My wisdom for you

Graduation speakers are supposed to impart wisdom. In my life the deepest wisdom has arisen from conversations that began with questions. So my wisdom for you: Ask questions early and often.

Questions are part of your future—whether judiciously asking a question or intentionally withholding one. Your education will provide a solid foundation on which to formulate those questions and the base of a scaffolding on which to hang your new understanding.

So I leave you with a bold direction: Stop knowing so much—and be ready to ask more questions! You are ready to be intentional questioners. Please embrace the role because someday, I may be your patient.

Photo credit (commencement, top): Angela Radulescu [Flickr (CC BY-NC-SA 2.0)] | cropped