How Should Meaningful Evidence Be Generated From Datasets?

Caroline E. Morton; Christopher T. Rentsch

doi:10.1001/amajethics.2025.27.

State of the Art and Science

Jan 2025

Peer-Reviewed

How Should Meaningful Evidence Be Generated From Datasets?

Caroline E. Morton, MRCGP and Christopher T. Rentsch, PhD

AMA J Ethics. 2025;27(1):E27-33. doi: 10.1001/amajethics.2025.27.

Abstract

Datasets are often considered “ideal” when they are large and contain longitudinal and representative data. But even research that uses ideal datasets might not generate high-quality evidence. This article emphasizes the roles that transparency plays in enhancing observational epidemiological findings’ credibility and relevance and argues that epidemiological research can produce high-quality evidence even when datasets are not ideal. This article also summarizes strategies for bolstering transparency in key phases of research planning and application.

Dataset Size and Scope

In epidemiology research, the quality and believability of findings often hinge on the size and scope of datasets. Datasets are considered to be “ideal” if they are large and the data they contain are longitudinal and largely representative of the underlying population. However, the assumption that large datasets inherently produce high-quality epidemiological research is misleading and should be challenged, as there is more to a high-quality observational study than just the size of the data input. This article summarizes the roles of transparency and systematic reporting of methodologies, data sources, and analytic code in enhancing the credibility of observational studies. This article also posits that high-quality research is possible with datasets that are not ideal, provided that there is a high level of transparency throughout the research process. By examining the different stages of a research study—from conception to execution—this article outlines practical strategies that can be used to increase transparency.

Well-Formulated Research Questions

A critical aspect of epidemiological research is formulating a research question. A well-crafted question guides the entire research process. It helps in defining the scope and design of the study, in identifying an appropriate data source to answer the question, and in selecting appropriate statistical methods to answer the question.

There are excellent resources for guidance on developing health research questions.^1,2,3,4 Briefly, the question should be relevant, answerable (through the collection and analysis of data), and specific. It should address a gap in knowledge or a pressing public health issue. The question should also have practical implications for health care, policy, or further research.

A commonly used framework to specify a clinical research question is referred to as PICOT, which contains 5 elements: population to be studied, intervention or exposure used in the study, comparator or reference group for treatment group comparisons, outcome or result to be measured, and timeframe or duration of data collection. As an example of use of this framework, one study of mRNA COVID-19 boosters aimed “to compare the effectiveness [outcome] of a third dose of either the BNT162b2 [exposure] or the mRNA-1273 [comparator] vaccine among US veterans who had completed an mRNA vaccine primary series [population] and received a third dose between 20 October 2021 and 8 February 2022” or between “1 January and 1 March 2022 [timeframe].”⁵

Data Provenance

Key to transparency is understanding the context in which and the intent for which data were collected. Although there are many different types of bias,^6,7,8,9,10 it can be helpful to think about them in terms of selection bias and information bias. Selection bias is bias that arises when the study sample systematically differs from the population (for example, self-selection into a study). Information bias is bias that arises when key variables are not measured accurately (for example, self-reported disease status). Each of these biases encompasses a number of more specific biases that should be considered further, depending on study design and data source.¹¹

Another way to consider potential biases is to break down the stages of data collection into steps. There are 4 key steps at which bias can occur that involve choices about: (1) location, (2) participant, (3) research team, and (4) software used, each of which are described below. Knowledge of the location in which the participant’s data were collected is critical for identifying any potential differential bias introduced by the setting (eg, the need to pay or have health insurance). For example, was a patient being seen in routine health care, in an emergency setting, or at a private health clinic? Next, researchers should consider the participants and whether their health behavior introduces any biases. For example, participants who are captured in a dataset, either through routine care or in a specific research study, can often request that their information be removed at a later stage (ie, “opt-out”), potentially introducing bias after initial data collection. The research team is also an important but underestimated source of potential bias. Recognizing this source of bias requires consideration of team members’ background, training, and unconscious biases, which may affect the types of data that are recorded. Finally, the software used to create the dataset may also introduce biases by prompting researchers or clinicians to enter data in a specific way or to ask particular questions.

Even randomized controlled studies are vulnerable to bias resulting from misallocation of participants, insufficient data blinding, or loss of subjects to follow-up.

It is not always possible to completely remove biases arising from data provenance, but it is important to acknowledge and account for them when possible in the interests of transparency and accuracy, as these biases might affect both the results and the generalizability of findings to a different population. The questions researchers ask may differ between countries—particularly between those with and without nationalized health care systems—so it is important to understand the context in which and the purpose for which data were collected.

Preregistration

A key approach to improving transparency and trust in research is the preregistration of study protocols.^12,13,14,15 By preregistering study protocols, researchers establish a clear blueprint of their research objectives, methods, and analytical plans before data analysis begins. This proactive approach mitigates risk of publication bias,¹⁶ which refers to the suppression of whole studies—for example, those without statistically significant results.¹⁷ Preregistration also reduces the potential for researchers to disseminate misleading or erroneous results by holding them accountable to their stated methodologies and hypotheses. Transparent documentation of study protocols enables stakeholders, including peer reviewers and readers, to evaluate the integrity and robustness of the study, thereby bolstering confidence in the validity of its findings.

One area of epidemiology in which preregistration is increasingly common is real-world evidence (ie, data derived from sources such as electronic health records, registries, medical claims, and patient self-monitoring) of drug safety and effectiveness.¹⁸ As the US Food and Drug Administration expands the use of real-world evidence in its drug approval decision-making processes,¹⁹ the use of preregistered protocols is critical to improving transparency. Although sponsors and researchers are required to preregister certain clinical trials and report results to ClinicalTrials.gov,²⁰ a similar system for real-world studies did not exist until recently. The Open Science Framework developed by the Center for Open Science aims to fill that gap by offering a free, open-source platform to preregister protocols in addition to other study materials.²¹

A common misconception of preregistering study plans is that they are fixed. On the contrary, protocols are flexible and can be edited as the study progresses. However, transparency is achieved by having all deviations from the original protocol documented across the lifetime of a study. Beyond improving transparency, preregistration cultivates a culture of collaboration and open science,^22,23 encouraging reproducibility within the research community.

Code Sharing

Once data have been made available to researchers, it is usually necessary to “clean” the dataset to get it into a format conducive to analysis. This usually means, at a minimum, applying the prespecified criteria to obtain a narrower dataset. Prespecified criteria might be a particular population (eg, males 65 years or older) or patients with a particular duration of follow-up (eg, at least 1 year of follow-up after baseline. Data cleaning also includes the creation of variables of interest, including exposures and outcomes. Covariates and other variables of interest may be important for adjustment of models, stratification, or identifying subpopulations. Dataset preparation is typically carried out using scripted code, such as Python, SAS, Stata, or R.

Since defining these variables and running analyses are fundamental to understanding how the research protocol was applied to the raw data, researchers should strongly consider publicly sharing code. GitHub is a free service widely used for code sharing.²⁴ A key advantage of GitHub is that every update to code is time stamped, allowing proper version control. Licences can be applied to the code to allow (if permitted) reuse and adaptation. Whenever possible, efforts should be made to add good documentation to any code—including, but not limited to, in-line code comments, a README file, and software versions.

One prominent example of good preregistration practices and code sharing is OpenSAFELY.²⁵ OpenSAFELY is a highly secure, transparent, trusted research environment for analysis of electronic health records data arising from primary and secondary care. All platform activity is publicly logged so that anyone at any time can review what code is being run against the data through the OpenSAFELY jobs-server.²⁶ Before any code is submitted, researchers must preregister a study protocol, which is posted publicly. All software for data management and analysis is shared on GitHub, automatically and openly, for scientific review and efficient reuse.²⁷

Interpretation

Studies are carried out to generate answers to important research questions. It is therefore crucial to appropriately interpret results of a given analysis in a way that generates meaningful, clear, and believable evidence. Accordingly, researchers should clearly explain how conclusions were drawn from the data and how analyses were carried out. The findings should also be presented within the wider context of previously published literature. Do the results fit in with what is already known about the topic? If not, more questions should be asked to interrogate what potential biases might be at play.

When interpreting results, special consideration should be given to issues that are known to cause confusion, such as the differences between absolute and relative risk^28,29,30 and whether the results can be generalized to a wider population.^31,32 A lay summary can clarify potentially confusing issues while helping to explain some of the findings without the statistical jargon. Additionally, a clear, comprehensive figure that conveys a study’s key findings is often what many readers look to first,³³ so authors should be mindful of this preference when developing and selecting the results to put in figures. Finally, infographics and expert opinion pieces can aid understanding if placed alongside a particularly controversial or difficult-to-understand analysis.

Need for Institutional Resources

Transparency and open science can support reproducibility and the responsible conduct of research, but they are not a guarantee of scientific rigor or equitable science.³⁴ An epidemiological study can be transparently reported but still come to the wrong conclusions, as was the case for a seminal study on the protective effect of high-density lipoprotein cholesterol on the risk of coronary heart disease.³⁵ Moreover, data sharing and other open science practices can pose a resource burden on scientists without institutional support; these barriers can be particularly marked for scientists in lower-resource settings.^36,37,38 Even when resources are available, concerns have been raised that the movement towards open data risks “perpetuating a neocolonial dynamic,” wherein it is necessary for researchers to pay for access or purchase costly software or training in order to use data effectively.³⁹ Finally, data sharing, in particular, requires careful consideration to ensure patient privacy and respect the original consent processes.^40,41

Conclusion

High-quality epidemiological research is not always achieved even with an ideal dataset. Transparency is often underutilized as a way to increase the believability—and therefore the meaningfulness—of epidemiological findings from observational research. Transparency can be achieved through specifying a well-formulated research question, acknowledging limitations arising from data provenance, preregistering analysis plans, code sharing, and making measured interpretations. Preregistration and code sharing are pivotal practices for fostering not only transparency and credibility but also accountability for and reproducibility of research designs and analyses. These practices also combat biases and promote a culture of collaboration and open science. Ultimately, increasing the adoption of these modern practices in epidemiology could serve as a cornerstone for building trust among researchers, patients, and the broader public.

State of the Art and Science

How Should Meaningful Evidence Be Generated From Datasets?

Abstract

Dataset Size and Scope

Well-Formulated Research Questions

Data Provenance

Preregistration

Code Sharing

Interpretation

Need for Institutional Resources

Conclusion

Read More

References

Also in this Issue

Jan 2025

Which Values Should Guide Evidence-Based Practice?

Whom Should We Regard as Responsible for Health Record Inaccuracies That Hinder Population-Based Fact Finding?

What Should Health Professions Students Learn About Data Bias?

Lessons From the Political History of Epidemiology for Divisive Times