Guest post by Dr. Graciela Gonzalez-Hernandez, associate professor of informatics at the Perelman School of Medicine, University of Pennsylvania.
Social media has grown in popularity for health-related research as it has become evident that it can be a good source of patient insights. Be it Twitter, Reddit, Instagram, Facebook, Amazon reviews or health forums, researchers have collected and processed user comments and published countless papers on different uses of social media data.
Using these data can be a perfectly acceptable research practice, provided they are used ethically and the research approach is solid. I will not discuss solid scientific principles and statistically sound methods for social media data use here, though. Instead, I will focus on the much-debated ethical principles that should guide observational studies done with social media data.
To help frame our discussion, let’s consider why the ethics of social media data use is called into question. Almost invariably when I present my work in this area or submit a proposal or paper, someone raises the question of ethics, often despite my efforts to address it upfront. I believe this reticence or discomfort comes from the idea that the data can be traced back to specific people and the fear that using the data could result in harm. Some research with social media data might seem innocuous enough. One might think no harm could possibly come from making available the collected data or specific tweets on topics like smoking cessation and the strategies people find effective or not. But consider data focusing on topics such as illegal substance use, addiction recovery, mental health, prescription medication abuse, or pregnancy. Black and white can quickly turn to gray.
Before going further, it is important to understand the fundamental rules for this type of research in an academic setting. In general, researchers who want to use social media data apply to their institutional review board (IRB) for review. Research activities involving human subjects and limited to one or more of the exempt categories defined by federal regulations receive an “exempt determination” rather than “IRB approval.” In the case of social media data, the exemption for existing data, documents, records, and specimens detailed in 45 CFR 46.101(b)(4) generally applies, as long as you don’t contact individual users as part of the research protocol and the data to be studied are openly and publicly available. If you will be contacting individual users, the study becomes more like a clinical trial, needing “informed consent” and full IRB review. (See the National Institutes of Health’s published guidelines for this case.)
Furthermore, exempt studies are so named because they are exempt from some of the federal regulations that apply to human-subjects research. They are not exempt from state laws, institutional policies, or the requirements for ethical research. Most of all, they are not exempt from plain old common sense.
But when it comes to the existing-data exemption, which data are “openly and publicly available” is open to question. To be safe, use only data available to all users of the platform without any extra permissions or approvals. No data from closed forums or groups that would require one to “join” within the platform should be considered “openly and publicly available.” After all, members of such groups generally expect their discussions are “private,” even if the group is large.
Beyond that, when deciding how to use the data or whether to publish the data directly, ask yourself whether revealing the information in a context other than where it was originally posted could result in harm to the people who posted it, either now or later. For example, you could include specific social media posts as examples in a scientific paper, but, if the topic was delicate, you might choose not to publish a post verbatim, instead changing the wording so a search of the host platform would not lead someone to the user. In the case of platforms like Reddit that are built around anonymity, this language modification would not be necessary. If possible, use aggregate data (e.g., counts or topics discussed) rather than individual social media posts.
However you approach your research, datasets used for automatic language processing experiments need to be shared for the results to be reproducible. Which format this takes depends on the data source, but reproducibility does not take a back seat just because these are social media data. To help you further consider the question of how to use or share these data, check out the guidelines published by the Association of Internet Researchers. These guidelines include a comprehensive set of practical questions to help you decide on an ethical approach, and I highly recommend them. In their study of the ethics of social media use, Moreno et al. also address some practical considerations and offer a good summary of the issues.
We are now ready to consider what constitutes ethical research. Ethics, or principles of right conduct, apply to institutions that conduct research, whether in academia or industry. Although ethics is sometimes used interchangeably with morals, what constitutes ethical behavior is less subjective and less personal, defining correct behavior within a relatively narrow area of activity. While there will likely never be a generally agreed upon code of ethics for every area of scientific activity, a number of groups have established principles relevant to social media-based research, including the American Public Health Association, the American Medical Informatics Association, and the previously mentioned Association of Internet Researchers. Principles of research ethics and ethical treatment of persons focus around the policy of “do no harm,” but it falls to IRBs to determine if harm could result from your approach and whether your proposed research is ethical. Even so, however, review boards might have discrepant opinions, as recent work looking into attitudes toward the use of social media data for health research has shown.
So where does that leave those of us looking to conduct health research using social media data?
Take a “stop and think” and “when in doubt, ask” approach before finalizing a study and investing time. Help ensure the researcher’s interests are balanced against those of the people involved (i.e., the users who posted the data) by putting yourself in their shoes. Be cognizant of the needs and concerns of vulnerable communities who might require greater protection, but don’t assume that research involving social media data should not be done or that the data cannot be shared. If the research was ethically conducted, then social media data can and should be shared as part of the scientific process to ensure reproducibility, and there is a lot that can be gained from pursuing it.
Graciela Gonzalez-Hernandez, MS, PhD, is a recognized expert and leader in natural language processing applied to bioinformatics, medical/clinical informatics, and public health informatics. She is an associate professor with tenure at the Perelman School of Medicine, University of Pennsylvania, where she leads the Health Language Processing Lab within the Institute for Biomedical Informatics and the Department of Biostatistics, Epidemiology, and Informatics.