Abstract
Electronic health records (EHRs) have revolutionized the scale, speed, and granularity at which health data can be collated and summarized for epidemiologic purposes. However, population-level analyses of patient-level data are only as reliable as the accuracy or completeness of patient reporting, clinician data entry, and how systems are programmed. This commentary on a case argues that responsibility for the validity of EHR data should be shared among key stakeholders, including patients. This commentary also proposes models for EHR data inquiry, data entry, and review processes that incorporate roles of community partners, frontline clinicians, and health science experts.
Case
T is a 43-year-old woman who has well-controlled asthma and visits Dr A for an annual checkup from a rural part of the state. Dr A reviews T’s electronic health record (EHR), noting no documentation of COVID-19 vaccination. Dr A remembers results from a national study that rurality was associated with decreased odds for vaccination and asks T about her reasons for not getting any shots. “I did get 2 shots last year, but I didn’t get them here. That’s probably why you’re not seeing them. I plan to get a booster as soon it’s available.”
Dr A wonders why many other patients’ EHRs contain incomplete or inaccurate information and why. “Not only does this affect how I plan my time with my patients, but poor-quality data hinders epidemiological surveillance and tracking of population-level vaccine uptake. Wasn’t the harrowing transition we’ve all just made to EHRs supposed to eliminate problems like this?”
Commentary
EHRs have become ubiquitous in clinical care in economically developed settings, including much of the United States. We review the evolution of these tools from their use in clinical care and billing to population-level health studies and pragmatic clinical trials. We then identify sources of biases and inaccuracies in EHR data and consider the ethics and consequences of using EHR-based data in research. Finally, we discuss the responsibilities of maintaining EHR data accuracy and propose ways to promote engagement among key stakeholders (eg, health care systems and payers, EHR developers, patients, clinicians, and researchers) in building an accurate, representative EHR. We illustrate these issues in a study of vaccination outcomes for patients enrolled in rural and urban health systems.
Brief History of the EHR
Clinical information systems were the predecessors of the modern-day EHR and were first utilized in single clinical sites as early as the 1960s.1 Efforts to transform health record keeping with EHR technology were promulgated with the development of the Department of Veteran Affairs’ VistA and Computerized Patient Record System in the 1970s and 1980s.1 Since then, health care organizations have been incentivized—and eventually mandated—to transition from paper charts to EHRs: first with the passage of the Health Information Technology for Economic and Clinical Health Act (HITECH) as part of the 2009 American Recovery and Reinvestment Act. Through incentive payments, HITECH sought to maximize EHRs’ potential to improve patient safety (including by minimizing illegible handwriting and standardizing health data collection, entry, and reporting) and build EHRs into the scaffolding of health care delivery.2 With the passage of the 21st Century Cures Act in 2016, the HITECH regulations were expanded to require the use of EHRs.3
In addition to improving patient safety, the EHR has been a boon for researchers. The proliferation of the EHR has enhanced the analyzability of clinical encounters through typed clinical note documentation accompanied by structured billing codes. Patient demographics are routinely collected and grouped by researchers to estimate the prevalence of health behaviors, determine at-risk patient populations, identify health disparities, and screen potential participants for clinical trial enrollment.4 Epidemiologic studies based on EHR data are critical for measuring population-level outcomes, including hospitalization or death. EHRs also allow for standardized data collection with the use and dissemination of templates for clinical notes, transforming unstructured text into structured data elements that can be easily extracted from the EHR.5
Sources of Error in EHR Data
Although the EHR has many benefits, limitations of data collection can affect data quality and bias research findings. Common domains of EHR data quality for research purposes include accuracy, completeness, consistency, credibility, and timeliness (see Table 1).6

Like any system data, EHR data are only as strong as their inputs.4 As the case illustrates, barriers to accurate vaccination documentation begin with the patient-clinician interaction. Inaccurate data inputs may result from poor patient-clinician communication and a lack of patient understanding or opportunity to ask for clarification about the questions being asked.1 When reviewing their EHR, patients not uncommonly perceive mistakes.7 Among 22 889 US participants of the OpenNotes study who read their notes and completed error questions, 4830 (21%) identified an EHR mistake, 2043 (42%) of whom reported that the mistake was serious.8 In addition to mistakes, time constraints could lead to inaccurate EHR data. Additional sources of inaccurate data, which are intrinsic to clinical care and not unique to the EHR, include patient preferences regarding disclosure of sensitive information, the receipt of out-of-network care or at care centers utilizing different EHR systems, asynchronous data entry, or clinician omission (see Table 2).9,10,11,12,13

Data entry tools like drop-down menus, copy-paste features, and automatic laboratory value entries can enhance efficiency but can also contribute to system-level errors and omissions, perpetuating biases and inequities.14,15,16 In the case example, and as reported in a recent study,17 rurality was associated with decreased COVID-19 vaccination. However, data entry was incomplete, confounded by human factors that might be exacerbated in rural settings. Inaccurate or incomplete data entry can contribute to sweeping but biased generalizations about treatment disparities, which very well could exist, but are incompletely ascertained due to missing data.4,5,10,18,19
Effects of Regulations, Missingness, and Representativeness on Research
Individual patient privacy and agency could be at odds with the need for high-quality population-level health data. Governing bodies overseeing research activities provide one layer of protection for patients by minimizing privacy risks and ensuring data security and investigator integrity and compliance with established rules of behavior in research. Deidentification of EHR data (in compliance with the Health Insurance Portability and Accountability Act of 1996 Privacy Rule) is standard practice when collating system-wide EHR data for research purposes.20 However, as EHR data is increasingly stripped of identifiers for subsequent public sharing and analyses, designating research based on such data as “not human subjects research” could jeopardize this oversight, with incompletely understood consequences for population-level studies and inferences. In addition, patient opt-out features can perpetuate biases in the final data based on who is or is not choosing this option.11,12,13
Moreover, when epidemiologists analyze EHR data for population health impacts, they often encounter missingness, wherein data for variables of interest are unavailable for each included observation for various reasons. Relying solely on quantitative data inputted by clinical teams thus could limit conclusions, telling incomplete stories. Yet solely focusing on solutions like outreach does not improve rural vaccination rates in the setting of incomplete measurement. Qualitative and mixed methods studies could provide important context for EHR data capture and assist researchers in confirming and contrasting findings derived from the EHR. For example, patient interviews could identify barriers to and mechanisms of vaccine uptake and how and where patients are getting vaccinated. Studies assessing clinician perspectives of EHR data entry options and workflows could also uncover reasons for missing or erroneous patient vaccination history data.
Finally, a fundamental vulnerability of collated, population-level data is whether the included sample is indeed representative of the intended source population. Systemic biases, community engagement, and intersectionality across minoritized groups, sex or gender identity groups, and race or ethnicity groups could influence the visibility of specific populations in EHR data and the resultant output. Population health, health equity, and policy experts have begun to identify sources of and strategies for dealing with bias in the EHR.21 Patient-reported outcomes and health and general literacy are key areas that can be targeted to reduce bias in EHR-based studies.22
Stakeholders’ Responsibilities
Just as the causes of EHR inaccuracies are multifaceted, so are responsibilities for ensuring EHR data fidelity, which are shared among key stakeholders: health care systems, vendors, clinicians, patients, and researchers using the data.23 Vendors and health care systems remain accountable to the general public for EHR functionality, usability, and accuracy. Clinicians must remain engaged with health care systems to ensure their data entry maximizes efficient use of EHR data for clinical documentation, review, and research purposes.10 Reframing the roles of patients as partners in data generation rather than as study subjects or participants could motivate patients to contribute to solutions to high-quality health data collection.4 Specifically, inviting patients to identify strategies for EHR data entry that most align with their preferences could enhance the completeness, credibility, and timeliness of their data. For example, because patient consent is not routinely obtained before using EHR data, patients may be unaware of how third parties could use their data, and reidentification of patients might be easier in certain types of studies, including genetics or rare disease studies or those using diagnostic imaging or clinical text notes.24 “Opt-out” features could return some agency to patients over their EHR data use in research but is not routinely done across health systems. However, this patient-centered approach could, itself, contribute to biased data, as patients who participate in such efforts might not reflect the overall patient population of interest.
Finally, data scientists and researchers are responsible for ensuring that high-quality EHR data are appropriately analyzed in population-level analyses. They should also remain vigilant in recognizing biased results of the analyses performed25 by explicitly addressing the 5 domains of data quality—accuracy, completeness, consistency, credibility, and timeliness—throughout the research process. Research teams also must defend against data breaches and must routinely address limitations and biases of measured data in study manuscripts and other output.6
Accountable EHR Data Use
How can we improve and innovate EHRs to enhance the accuracy of vaccine documentation and other data? An integrated US health system is an attractive answer and could serve as the platform for clinical data integration, but it is unlikely to gain favor in the current politically polarized environment. The Affordable Care Act (ACA) of 2010 was the first reform to the US health care system in a generation and offered a mechanism to build towards integration by leveraging Medicaid infrastructure for beneficiary data collection,26 but state-level single-payer policies have failed to gain political traction.27 The ACA remains a lightning rod for fiscally conservative policy makers, and, although public support appears to have grown over time, an overhaul to the US health system is unlikely to be successful in the coming decade.
Given this landscape, individuals and organizations should undertake to improve EHR data quality.
- Health information vendors should continually reassess EHR data collection tools and interfaces with patients, clinicians, and data scientists to optimize their product’s usability.
- Clinicians must advocate for meaningful approaches to EHR use to support clinical care, rather than simply plodding through required data fields for billing purposes.
- Organizations should consider expanding, standardizing, and integrating the data collected. Although patient-entered EHR data can be an attractive option to increase accuracy of patient data while also empowering patients to control their health care narrative firsthand, it could still result in data skewed towards patients who are computer literate and have access to broadband internet services. Vaccination and medication use data collection could be standardized by automating linkages between pharmacy manufacturing lots or similar measures and EHRs, thereby improving care tracking. Government- and private, nonprofit-supported applications, such as the Immunization Information System and health information exchanges, also hold promise for integrating and harmonizing health information—from vaccinations to medications, subspecialist evaluations, and clinical testing results—across participating regions.6,7,8,9
Limitations
Despite attention to data quality, there exist no standard methods for assessing EHR data quality.28 We acknowledge that the aforementioned proposed solutions for improving the accuracy of EHR data are practical only for measurable processes and outcomes of care. Other important aspects of care, such as patient-clinician demographic concordance or general communication styles, are not currently captured in structured data available in large health care system EHRs. Alternate care delivery models (minute clinics, telehealth, concierge medicine) might also have uncertain impacts on EHR completeness going forward. As conventions shift in health care delivery and its documentation, key stakeholders are necessary to determine how best to tackle pressing health priorities of the population.
Conclusion
EHR use has revolutionized health information collection and analysis. This growth has led to opportunities to generate important reports about the health of hundreds of millions of people practically in real time. Steadfast commitment to high-quality data collection and reporting is necessary for all parties along the pathway of data generation: from EHR developers, programmers, and vendors to patients, clinicians, and epidemiologists. Pulling back the curtain on how each of these groups generate and interact with EHR data is imperative to assure measurement of accurate population-level health outcomes.
References
- Atherton J. Development of the electronic health record. Virtual Mentor. 2011;13(3):186-189.
-
Trout KE, Chen LW, Wilson FA, Tak HJ, Palm D. The impact of meaningful use and electronic health records on hospital patient safety. Int J Environ Res Public Health. 2022;19(19):12525.
- Phelan D, Gottlieb D, Mandel JC, et al. Beyond compliance with the 21st Century Cures Act Rule: a patient controlled electronic health information export application programming interface. J Am Med Inform Assoc. 2024;31(4):901-909.
-
Hollister B, Bonham VL. Should electronic health record-derived social and behavioral data be used in precision medicine research? AMA J Ethics. 2018;20(9):E873-E880.
- Vuong K, Ivers R, Hall Dykgraaf S, Nixon M, Roberts G, Liaw ST. Ethical considerations regarding the use of pooled data from electronic health records in general practice. Aust J Gen Pract. 2022;51(7):537-540.
- Feder SL. Data quality in electronic health records research: quality domains and assessment methods. West J Nurs Res. 2018;40(5):753-766.
- Klinger EV, Carlini SV, Gonzalez I, et al. Accuracy of race, ethnicity, and language preference in an electronic health record. J Gen Intern Med. 2015;30(6):719-723.
-
Bell SK, Delbanco T, Elmore JG, et al. Frequency and types of patient-reported errors in electronic health record ambulatory care notes. JAMA Netw Open. 2020;3(6):e205867.
-
Haneuse S, Arterburn D, Daniels MJ. Assessing missing data assumptions in EHR-based studies: a complex and underappreciated task. JAMA Netw Open. 2021;4(2):e210184.
-
Bowman S. Impact of electronic health record systems on information integrity: quality and safety implications. Perspect Health Inf Manag. 2013;10(fall):1c.
- El Emam K, Jonker E, Moher E, Arbuckle L. A review of evidence on consent bias in research. Am J Bioeth. 2013;13(4):42-44.
-
Kho ME, Duffett M, Willison DJ, Cook DJ, Brouwers MC. Written informed consent and selection bias in observational studies using medical records: systematic review. BMJ. 2009;338:b866.
-
de Man Y, Wieland-Jorna Y, Torensma B, et al. Opt-in and opt-out consent procedures for the reuse of routinely recorded health data in scientific research and their consequences for consent rate and consent bias: systematic review. J Med Internet Res. 2023;25:e42131.
-
Boyd AD, Gonzalez-Guarda R, Lawrence K, et al. Equity and bias in electronic health records data. Contemp Clin Trials. 2023;130:107238.
- Rozier MD, Patel KK, Cross DA. Electronic health records as biased tools or tools against bias: a conceptual model. Milbank Q. 2022;100(1):134-150.
-
Verheij RA, Curcin V, Delaney BC, McGilchrist MM. Possible sources of bias in primary care electronic health record data use and reuse. J Med Internet Res. 2018;20(5):e185.
-
Bernstein E, DeRycke EC, Han L, et al. Racial, ethnic, and rural disparities in US veteran COVID-19 vaccine rates. AJPM Focus. 2023;2(3):100094.
- Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med. 2018;178(11):1544-1547.
-
Getzen E, Ungar L, Mowery D, Jiang X, Long Q. Mining for equitable health: assessing the impact of missing data in electronic health records. J Biomed Inform. 2023;139:104269.
-
Guidance regarding methods for de-identification of protected health information in accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. US Department of Health and Human Services. Reviewed October 25, 2022. Accessed February 22, 2024. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
- Sun M, Oliwa T, Peek ME, Tung EL. Negative patient descriptors: documenting racial bias in the electronic health record. Health Aff (Millwood). 2022;41(2):203-211.
- Boyd AD, Gonzalez-Guarda R, Lawrence K, et al. Potential bias and lack of generalizability in electronic health record data: reflections on health equity from the National Institutes of Health Pragmatic Trials Collaboratory. J Am Med Inform Assoc. 2023;30(9):1561-1566.
- Sittig DF, Singh H. Rights and responsibilities of users of electronic health records. CMAJ. 2012;184(13):1479-1483.
-
Holmes JH, Beinlich J, Boland MR, et al. Why is the electronic health record so challenging for research and clinical care? Methods Inf Med. 2021;60(1/2):32-48.
-
Gianfrancesco MA, Goldstein ND. A narrative review on the validity of electronic health record-based research in epidemiology. BMC Med Res Methodol. 2021;21(1):234.
- Stulberg D. The Patient Protection and Affordable Care Act and reproductive health: harnessing data to improve care. J Health Polit Policy Law. 2013;38(2):441-456.
- Sparer MS. States as policy laboratories: the politics of state-based single-payer proposals. Am J Public Health. 2019;109(11):1511-1514.
- Lewis AE, Weiskopf N, Abrams ZB, et al. Electronic health record data quality assessment and tools: a systematic review. J Am Med Inform Assoc. 2023;30(10):1730-1740.