Reproducibility and data quality: OHDSI takeaways

Data quality in healthcare isn’t a new topic – but ways to ensure excellency are always improving. Here are some takeaways from the OHDSI symposium.
Published
Written by
Picture of IMO Health
Staff

The Observational Health Data Sciences and Informatics, or OHDSI, annual symposium is a multi-stakeholder, interdisciplinary collaborative of professionals who participate in collective research. OHDSI maintains an international network of databases dedicated to the secondary or observational use of health data for medical decision-making informed by large-scale analysis.

In OHDSI’s recent October symposium, methodology to ensure confidence in the evidence produced by observational research was a primary theme. Both the Food and Drug Administration (FDA) and the European Medicines Agency (EMA) utilize data generated through observational research to better understand the uses, safety, and efficacy of medicines.

Challenges to ensuring data quality in healthcare

Reproducibility and data quality are critical to research informed by secondary or observational data. Reproducibility is the ability of independent researchers to determine the same findings when applying the same design and operational choices in the same data source. It plays a role in healthcare data’s ultimate quality, which is key when mapping primary data to the OHDSI Observational Medical Outcomes Partnership (OMOP) common data model (CDM) for analysis. In short, confidence in real-world evidence relies on confidence in the data itself.

But confidence in data isn’t always a guarantee. Common issues in observational research that can lead to conflicting results include observational study bias, or confounding; publication bias; and p-hacking. Confounding can occur when what appears to be a causal relationship between a treatment and an outcome is affected by a third variable that is not accounted for. Publication bias is the tendency to only publish results that are statistically or clinically significant. P-hacking can occur when researchers select data or alter an analysis until they obtain the desired result.

OHDSI’s role in the solution

To address these issues, OHDSI launched the Large-scale evidence generation and evaluation across a network of databases (LEGEND) initiative, which was designed to generate evidence from observational health data. LEGEND describes best practices, which include the application of a systematic and causal effect estimation procedure to provide information about the direction and strength of the relationship between treatment and an outcome. By defining control questions with known answers and running the analysis locally at multiple sites for comparison, researchers can estimate systematic errors and correct for data biases.

While LEGEND addresses the validity of study design, addressing issues of data quality is another matter. EHR data is designed for clinical care and billing purposes, not research. EHR data can be incomplete or inaccurate, lack validity or plausibility, be of insufficient granularity, or lack in conformance. If study data is of poor quality, the integrity of study cohorts can be compromised. OHDSI assesses data quality through the Automated Characterization of Health Information at Large-scale Longitudinal Evidence System (ACHILLES) tool and the newly developed Data Quality Dashboard (DQD). These tools compute a set of summary statistics on the characteristics of data quality that include conformance, completeness, and plausibility.

Data that has been normalized will be more complete and accurate – an improvement in source data quality. While tools that assess data quality and identify potential quality issues are essential to ensure confidence in evidence generated from observational research, beginning with better quality data has the potential to significantly improve the quality of evidence generated in this research.

For a robust look at the many facets of data quality in healthcare, check out IMO Health’s webinar series addressing five aspects of the topic here.

Related Content

Blog digest signup

Resources sent straight to your inbox.

Latest Resources​

Discover how AI agents can automate scientific literature review, reducing analysis time from hours to minutes for faster insights and better cancer...
Life sciences data is expensive – make sure it’s worth it. Learn how to maximize RWD investments and boost ROI through smarter...
At ISPOR 2025, experts explored how GenAI is helping close evidence gaps in rare diseases – turning messy data into reliable, reproducible...