Evaluating the impact of discordant and missing demographic information on population health assessments using linked electronic health records and Census Bureau microdata
Abstract
Administrative records are increasingly being used to study population-level outcomes, despite high rates of missingness and discrepancies (i.e., discordance) in demographic identifiers across different sources of data, which could reduce the quality of such assessments. Few studies have evaluated the relationship between these phenomena in administrative records and downstream impacts on assessments in consequential domains such as healthcare. We characterize patterns of discordance and missingness of race and ethnicity in electronic health records (EHR; 2010–2021) derived from the American Board of Family Medicine’s primary care registry, linked at the individual-level to restricted U.S. Census Bureau microdata (2000, 2010, 2020 Census; American Community Survey 2005–2022). Among 5.86 million linked patients, 19.3% were missing race and ethnicity information in EHRs, and 8.0% had race and ethnicity information that was recorded discordantly between the two sources, with the lowest discordance for White, Black, and Asian patients and the highest for American Indian and Alaska Native, Native Hawaiian and Pacific Islander (NHPI), and Multiracial patients. Missingness and discordance impacted estimation of group differences for all 50 health outcomes we consider, particularly for smaller racial/ethnic groups, such as a 24 percent change in NHPI Type 2 diabetes diagnosis rates. Our research has three major implications for the work of government agencies, academics, clinicians, and other stakeholders interested in utilizing EHRs for research purposes. First, we demonstrate how the quality of demographic data in administrative records can be comprehensively assessed, which previously has not been possible due to limitations in data access and linkage. Second, we systematically evaluate the impact of discordant and missing demographic information on our ability to accurately estimate disease prevalence. Third, we underscore the importance of evaluating discordance of demographic information both within and across different administrative domains.
Author summary:
Population-level assessments in consequential domains such as healthcare depend on large, high-quality administrative data. However, discordance and missingness of demographic information across records can distort analyses conducted by researchers and policymakers. We provide robust and comprehensive evidence and characterization of these patterns through a dataset of 5.86 million patients in the United States with linked information from electronic health records and restricted U.S. Census Bureau microdata. In particular, we demonstrate how these data quality issues can affect estimation of consequential group-level health outcomes, such as Type 2 diabetes diagnosis rates. Discordance and missingness are widespread and highly concentrated in specific administrative settings like primary care clinics, creating the potential for error at every geographic scale of assessment. However, much can be done to diagnose and mitigate discordance and missingness, particularly at the point when demographic information is collected. With more complete and concordant demographic information and improved data quality in electronic health records and other administrative records, government agencies, academics, and practitioners can more accurately measure and address health challenges.