The right patients revealed: A fresh approach to rare disease trials

Rare disease trials struggle with fragmented data and recruitment challenges. See how better clinical data can accelerate enrollment.
Published
Written by
Picture of IMO Health
Staff
Table of Contents
Executive summary: The 30-second takeaway

Rare disease trials fail when clinical intent is lost between documentation, coding, and research workflows. IMO Health’s approach preserves that intent through granular terminology, structured data normalization, and context-aware NLP embedded directly into EHR workflows. The result is faster activation, cleaner feasibility, higher-quality enrollment, and more precise, scalable rare disease research.

Introduction

Rare disease programs rarely fail because a therapy lacks promise. More often, they falter under the weight of bad data, underpowered designs, and recruitment tactics from another era. The impact is profound: millions in development spend evaporate, and patients wait, sometimes indefinitely, for therapies that could change or save their lives.

What makes rare disease programs uniquely fragile is the gap between clinical documentation and research needs: administrative code sets miss critical nuance, while high-value details (genotype, severity, phenotype, temporality) are buried in unstructured notes. Without a way to surface and standardize that information across sites, feasibility, cohorting, and analytics rest on a weak foundation.

This eBook explains why failures begin long before the first patient in. It then lays out a next-generation model rooted in data normalization, intelligent cohorting, and workflow-native activation to shorten timelines, reduce costs, and raise enrollment quality so no eligible patient is left behind.

Chapter One – Why rare disease trials fail before they start

About 50% of rare disease randomized controlled trials are discontinued or remain unpublished.1 That figure represents more than sunk costs; it reflects missed patients who qualify but are never found, and promising therapies that stall or are abandoned. 

Many otherwise eligible patients, often highly knowledgeable about their condition, still never learn about open studies. This under-enrollment fuels what some call orphan, lost, or unseen diseases, and the burden extends beyond sponsors to families and society.

Two intertwined drivers underpin the problem:

  • Operational drag: labor-intensive pre-screening, narrow site networks, and slow activation that suppress throughput and limit access to community settings.
  • Data fragmentation: insufficient representation of rare disease in electronic health record (EHR) nomenclature and inconsistent, non-research-grade detail flowing into feasibility and accrual.

When the data foundation is thin, the processes built on top of it buckle. Sites overestimate eligible pools, screen failures spike, protocol amendments are required to address criteria that proved unworkable in practice, and timelines slip. 

The common root is clear – rare conditions are poorly represented in administrative code sets, and clinical data lacks the standardization research requires.

Chapter Two – The operational challenge

Sites often struggle to identify and enroll eligible patients fast enough to meet timelines.

Many teams still rely on manual code curation and chart review by coordinators and investigators who already carry clinical loads. That burden discourages community sites, where many patients actually receive care, and concentrates research in large academic centers. At many community sites, protocol complexity, limited IT capacity, and scarce research time lead to non-participation altogether.

Even when sites do participate, many underperform, not because patients are absent, but because International Classification of Diseases (ICD)-based filters may pull in clinically similar conditions, inflating lists with false positives and burying true eligibility under hours of manual verification. Administrative lists can make pools look large on paper, while coordinators lack the capacity to locate truly eligible candidates. The result is time diverted from patient outreach and consent to adjudicating false positives.

The common “fix” is to add more sites, which increases oversight complexity and cost without restoring throughput or diversity. 

Operational friction accumulates from feasibility to first-patient-in unless eligibility logic is delivered directly inside site workflows.

Chapter Three – The data challenge

Healthcare data is heterogeneous and siloed. The same clinical reality is encoded differently across systems, and more than 80% of the meaningful detail lives in unstructured text,2 including Subjective, Objective, Assessment, Plan (SOAP) narratives where disease course, severity, and genetics actually reside. Downstream analytics and AI inherit these inconsistencies. Ultimately, without normalization and context-aware extraction, feasibility skews, cohorts drift from clinical intent, and statistical power erodes.

Because the data isn’t aligned to common standards, teams spend significant time and budget on clean-up, and AI models trained on messy inputs lose accuracy and propagate bias. Making the data usable at scale requires consistent mapping of labs, medications, procedures, and diagnoses to industry standards (e.g., LOINC®, RxNorm®, CPT®, SNOMED CT®) so outputs remain comparable across sites and studies.

Compounding the challenge is schema drift across institutions, such as different field names, units, reference ranges, and local vocabularies. Even high-quality single-site datasets can fail to generalize when pooled, unless harmonization and codification are enforced upfront.

Chapter Four – When codes erase clinical intent

Administrative code sets, such as ICD-10-CM, were built for reimbursement, not for research. In rare disease, that mismatch is decisive:

  • Only about 7% of rare diseases have a unique ICD-10-CM code; therefore, of the approximately 7,000 rare diseases, only 500 or so have a dedicated ICD-10-CM entry.3
  • Codes under-capture disease. For example, sepsis is only correctly identified in approximately 35% of cases by coding.4
  • Approximately 20% error rates have been reported for key conditions.5
  • Biomedical evidence doubles roughly every 12 months, yet major code set transitions, including ICD-10-CM to ICD-11-CM, unfold on decade timelines.6

Even with ongoing additions driven by rare disease communities, administrative coding still fails to carry the nuance research needs. The practical outcome is that distinct diseases documented at the point of care collapse into shared administrative categories, stripping out stage, histology, genotype, and severity, which are precisely the details trials rely on for eligibility, stratification, and endpoints.

Chapter Five – Preserving intent with granular clinical terminology

A better foundation begins with the language clinicians actually use. 

IMO Health’s interface terminology captures real clinical phrasing and maps it into administrative code sets, ensuring that billing and operations continue to function without sacrificing clinical intent. The terminology is continuously curated by medical terminologists and practicing clinicians, building on decades of terms that are released up to five times per year to meet both regulatory and non-regulatory needs – including client term requests.

Consider muscular dystrophy disorders: Duchenne and Becker are clinically distinct yet often share an ICD category. In IMO Health’s model, they are separately represented, linked to distinct SNOMED concepts, and mapped appropriately to ICD-10-CM. At this level of detail, concepts can also carry genotype and etiologic signals such as biomarkers, specific mutations, and relevant family history, so rare-disease cohorts preserve the biology that eligibility and endpoints depend on.

Scale and interoperability matter:

  • ICD-10-CM: Approximately 75,000 codes
  • IMO Health: More than 1 million clinical concepts across the ecosystem (including synonyms, acronyms, misspellings, variants, mutations, severity, familial patterns) and over 5 million terms that account for complex clinical nuances through greater specificity and language that accurately represents patient care
  • Each concept maps cleanly across ICD-9-CM, ICD-10-CM, and SNOMED CT, enabling consistent cohorting, surveillance, and analytics across datasets and time.

Independent validation indicates a near-identical match to manual review. In a CDC study comparing manual chart review (the gold standard) with ICD-only and ICD with structured-data algorithms for a cardiac condition, the IMO Health terminology-based method matched manual review almost one-to-one, differing on only one chart.7 In practice, that allows machine-driven identification to approach the fidelity of human chart review, without human cycle times.

Chapter Six – Clinical trial enablement solutions

With IMO Health terminology and EHR-native workflows as the foundation, trial operations can change end-to-end and stay inside the tools sites already use:

  • Site selection: Combine claims-based market intelligence with clinician activity signals to identify where providers are actively diagnosing and treating patients with the target condition. Validate those signals against real-world EHR data to confirm actual protocol-aligned patient counts before site activation.
  • Site activation: Provide electronic health record (EHR)-native assets that sites can import and use directly in their existing workflows, with minimal IT build. Sponsors typically save up to two months of IT enablement. Coordinators can start pre-screening immediately, and providers receive point-of-care alerts inside the EHR.
  • Patient identification: Use high-precision filters to remove noise at the source. In documented programs, this has cut false positives by up to 99%, which reduces screen failures and lets clinical research coordinators (CRCs) spend time on patient outreach instead of adjudication. Once filters and work queues are live, sites report eliminating most manual chart reviews.
  • Trial result collection: Pull structured variables directly from clinical notes and auto-populate the electronic data capture (EDC) from the EHR, eliminating duplicate data entry. This improves completeness (fewer missing fields), reduces entry errors, and speeds interim analyses and data cleaning.

The net effect is faster trial start-up, more accurate site lists, fewer non-performers, cleaner datasets, and a higher proportion of qualified referrals.

Chapter Seven – Case study in Primary Ciliary Dyskinesia

A trial recruitment example centered on Primary Ciliary Dyskinesia (PCD) shows what changes when clinical intent is preserved at scale. In an analysis spanning approximately 300 sites, four major EHRs, and more than 15 million patients, ICD-based querying narrowed the field to about 30,000 records, but produced 97.9% false positives. A major driver of noise was the catch-all ICD label “Other specified congenital malformations of the respiratory system,” which is not PCD and pulls in many unrelated congenital conditions. 

When sponsors relied on claims-coded site lists, more than half of the sites were false positives, meaning they appeared to have PCD on paper but had no true PCD patients, and some organizations had zero relevant patients despite the ICD hits.

Applying IMO Health lexicals changed the picture. Because the terminology encodes etiology, severity, and genetic markers, the candidate pool was narrowed to 645 trial-eligible patients across 129 sites. Operationally, sponsors:

  • Eliminated almost 50% of non-performing sites in this analysis (noting that some programs have reported complete removal of non-performers before start-up).
  • Reduced false positives by more than 97%, dramatically lowering screen failures.
  • Saved over 1,500 hours of manual chart review, shifting coordinator time from adjudication to patient outreach and consent.

Manual review still occurred, but from a much smaller, truer haystack, enabling faster, cleaner pre-screening and more reliable feasibility.

Chapter Eight – Normalizing structured data

Structured data (labs, procedures, medications, diagnoses/problem list) still varies widely by site. 

For instance, take A1C: it may be recorded as HbA1c, Hemoglobin A1c, or Glycosylated hemoglobin A1c, and each label can be tied to different local or standard codes. IMO Precision Normalize resolves these differences by standardizing names and codes, then mapping, for instance, labs to LOINC, medications to RxNorm, and procedures to CPT, so the output is consistent and analysis-ready across institutions.

What effective data normalization does:

  • Standardizes labels, units, and reference ranges and aligns local/site-specific codes to a single standard concept.
  • De-duplicates overlapping entries (e.g., a lab panel and its individual component tests) and brings different value lists into one consistent set.
  • Maps each domain to the right standard (for instance, labs > LOINC, meds > RxNorm, procedures > CPT) for interoperability.
  • Validates cross-site consistency so pooled analyses reflect biology, not documentation quirks.

The result is a clean, harmonized dataset that teams can use immediately. This cuts weeks or months of custom data wrangling and stabilizes downstream analytics and modeling.

Chapter Nine – Unstructured notes with natural language processing (NLP)

Most of the detail that matters for rare disease, such as phenotype, severity, genotype, trajectories, and care context, lives in unstructured clinical notes. To use it at scale, the text must be made machine-readable without losing context.

IMO Health’s clinical natural language processing (NLP) is trained on real clinical language and grounded in decades of terminology curation, so it captures what generic tools miss.

What IMO Health’s clinical NLP system does:

  • Clinical concept extraction: The NLP system scans unstructured free text to find diseases, symptoms, medications, labs, procedures, and genotypic markers.
  • Assertion status detection: Identifies and understands the presence, absence, or possibility of the clinical concept.
  • Relationship extraction: Uncovers connections between clinical concepts throughout the text and links entities that belong together (e.g., mutation > disease subtype > severity).
  • Temporal information detection: Extracts time-related details associated with the identified clinical concepts to build timelines and eligibility windows.
  • Section header detection: Identifies sections within clinical narrative to add context and organization to extracted entities. It recognizes whether a statement appears in HPI, Assessment/Plan, Family History, etc., because section context changes meaning.
  • Concept codification: Comprehensively maps clinical concepts to IMO Health terms and all relevant standard codes for interoperability. 

Why IMO Health’s approach works better

IMO Health’s clinical NLP is tuned to how clinicians actually write, so it recognizes intent even when terms appear as synonyms, acronyms, misspellings, or local vernacular rather than dictionary-perfect strings. 

Its accuracy is strengthened by more than 30 years of IMO Health terminology development and ongoing review by clinicians and terminologists, with regular updates that continually improve performance.

By pairing the NLP engine with curated terminology and editorial guidelines, the approach avoids the misclassification and drift that are common with generic, out-of-the-box language models, and captures the nuance rare disease work requires.

What the output looks like

The system outputs context-preserving, time-stamped records (e.g., severe cardiomyopathy, present, Assessment/Plan, onset 2023-08) and encodes each item to IMO Health terms and the appropriate standard vocabularies.

Each record retains status (present/absent/possible), temporality (when it occurred), and relationships between entities (including sentiment/context where relevant). As a result, the data is ready for natural-history studies, direct EHR-to-EDC transfer for trials, real-world evidence generation and post-marketing surveillance, and clinical research automation, all while preserving context and accuracy.

Why context matters

Phrases like “history of diabetes,” “family history of diabetes,” and “severe diabetes” describe different clinical realities. Treating them as equivalent corrupts feasibility, cohort selection, and endpoints. Context-aware NLP preserves those distinctions at scale.

Chapter Ten – Implementation playbook, from protocol to feasibility

Operationalization starts with the protocol. Sponsors share the eligibility language, and IMO Health subject-matter experts translate it into curated value sets – groups of codes and terms that define clinical concepts – tuned to the task.

Because IMO Health is embedded across every major EHR system and used by over 95% of U.S. providers, these assets can be delivered directly into the workflows sites already use every day. Rather than introducing new software or custom infrastructure, sites receive EHR-digestible files with clear import instructions, allowing coordinators to move from protocol criteria to operational screening in minutes.

High-precision value sets minimize false positives for EHR filtering. (Of the patients the filter flags, what percentage are truly eligible?)

High-recall value sets minimize false negatives for broad pre-screening. (Of all truly eligible patients, what percentage does the filter find?)

These lists are jointly reviewed and optimized, then delivered as ingestible, EHR-native files with clear import instructions, so coordinators can move from concept to action without new software or custom builds.

During feasibility, sites can run these filters before activation to pre-screen for truly eligible patients and to generate actual counts rather than estimates. Claims-based lists are verified, so if an ICD code is shown to capture only a small fraction of the intended population, site lists are re-ranked using IMO Health’s real-time provider insights. The same EHR-digestible translation is then deployed on site to confirm feasibility with eligible patient counts, helping eliminate non-performing sites before they add drag.

Key takeaways – Return on investment (ROI) and what partners gain

Bringing terminology, normalization, clinical NLP, and workflow-native integration together delivers measurable gains across speed, cost, and quality:

  • Activation speed: save up to two months in IT enablement at activation.
  • Noise reduction: cut false positives by up to 99% before CRCs begin review (more than 97% reduction demonstrated in the PCD example).
  • Fewer non-performers: remove approximately 50% of non-performing sites upfront in the PCD analysis (some programs report eliminating all non-performers before start-up).
  • CRC time back: over 1,500 hours of manual chart review saved in the PCD example; in practice, sites eliminate most manual chart reviews once precision filters and queues are live.
  • Data quality: closed-loop EHR-to-EDC capture improves completeness (fewer missing fields) and reduces entry errors, accelerating interim analyses and cleaning cycles.
  • Enrollment quality: fewer screen failures, cohorts that better match clinical intent, and more qualified referrals as in-workflow cues guide providers.
  • Commercial impact: stronger Request for Proposal (RFP) positioning through accurate site lists and operational readiness; over $50,000 saved per underperforming site avoided.

Strategically, these benefits compound across a portfolio: feasibility loops close faster, AI-ready datasets cut analysis rework, and site networks expand into community settings without proportional overhead.

This moves rare disease programs from “rarely ready” to precision-ready.

To learn more about IMO Health’s Life Sciences solutions, visit imohealth.com/life-sciences-solutions/

Or if you’re ready to accelerate your rare disease research, book a demo with one of our experts at imohealth.com/schedule-a-demo/.

SNOMED® and SNOMED CT® are registered trademarks of SNOMED International.

LOINC® is a registered trademark of Regenstrief Institute, Inc.

CPT® is a registered trademark of the American Medical Association.

RxNorm® is a registered trademark of the U.S. National Library of Medicine. All other product and company names may be trademarks™ or registered® trademarks of their respective holders. Use of them does not imply any affiliation or endorsement.

1Rees CA, Pica N, Monuteaux MC, Bourgeois FT. Noncompletion and nonpublication of trials studying rare diseases: A cross-sectional analysis. PLOS Medicine. 2019;16(11):e1002966. Accessed via: https://doi.org/10.1371/journal.pmed.1002966 

2Kong HJ. Managing unstructured big data in healthcare system. Healthcare Informatics Research. 2019;25(1):1–2. Accessed via: https://pmc.ncbi.nlm.nih.gov/articles/PMC6372467/ 

3Aymé S, Bellet B, Rath A. Rare diseases in ICD11: Making rare diseases visible in health information systems through appropriate coding. Orphanet Journal of Rare Diseases. 2015;10(1):35. Accessed via: https://ojrd.biomedcentral.com/articles/10.1186/s13023-015-0251-8 

4Liu B, Hadzi-Tosev M, Liu Y, Lucier KJ, Garg A, Li S, et al. Accuracy of International Classification of Diseases, 10th Revision codes for identifying sepsis: A systematic review and meta-analysis. Critical Care Explorations. 2022;4(11):e0788. Accessed via: https://journals.lww.com/ccejournal/fulltext/2022/11000/accuracy_of_international_classification_of.7.aspx 

5O’Malley KJ, Cook KF, Price MD, Wildes KR, Hurdle JF, Ashton CM. Measuring diagnoses: ICD code accuracy. Health Services Research. 2005;40(5 Pt 2):1620–1639. Accessed via: https://pmc.ncbi.nlm.nih.gov/articles/PMC1361216/

6Fung KW, Xu J, Bodenreider O. The new International Classification of Diseases 11th edition: A comparative analysis with ICD-10 and ICD-10-CM. Journal of the American Medical Informatics Association. 2020;27(5):738–746. Accessed via: https://pmc.ncbi.nlm.nih.gov/articles/PMC7309235/ 

7Kottke TE, Baechler CJ. An algorithm that identifies coronary and heart failure events in the electronic health record. Preventing Chronic Disease. 2013;10:E29. Accessed via: https://stacks.cdc.gov/view/cdc/19650 

Related Content

Latest Resources​

Without standardized procedure terminology and precise code mappings, workflows can become inefficient, impacting scheduling, reimbursement, and more.
Most health tech vendors leverage strong data – but strong doesn’t automatically mean nuanced, complete, or clinically rich. See why this matters.
Learn how IMO IDs create consistency in clinical terminology by preserving clinical concepts as standards and healthcare data evolves.
ICYMI: BLOG DIGEST

The latest insights and expert perspectives from IMO Health

In your inbox, twice per month.