Most of us have encountered an advertisement for a clinical trial at some point, perhaps online or in the subway. But while these colorful banners often appeal to those with broad health symptoms, such as difficulty walking and hearing, or specific conditions like schizophrenia, the reality is that qualifying for a clinical trial is very challenging.
Eligibility criteria must be particular and extensive to ensure that only suitable patients are recruited, and the clinical trials are safe and evaluated effectively. However, traditional approaches to extracting this information from unstructured trial text are often tedious and prone to human error.
To this end, IMO Health’s team of artificial intelligence (AI) experts—encompassing natural language processing (NLP) scientists, biomed engineers, and more—developed a generalizable and scalable GPT-based system to pull eligibility criteria from clinical trial documents across disease domains. A recent study demonstrates how combining such models with clinical NLP techniques can significantly streamline the patient recruitment process and expedite the construction of criteria knowledge bases, leading to advancements in medical knowledge and improved patient care.
Last month, IMO Health’s Surabhi Datta, PhD, Sr. Staff NLP Scientist, and Xiaoyan Wang, PhD, FAMIA, Chief Scientist and Senior Vice President, Life Sciences Solutions, presented these findings at a JAMIA Journal Club webinar. AMIA®, or the American Medical Informatics Association®, is a community of professionals dedicated to improving patient care and healthcare reform through informatics. JAMIA is AMIA’s peer-reviewed journal for biomedical and health informatics, covering all activities in the field.
Read on for a recap of this study and its significance.
Objective: Leverage clinical NLP for data extraction
Researchers have used AI (artificial intelligence) and clinical NLP techniques for years to extract eligibility criteria from clinical trial documents automatically. They’ve adopted various approaches, and some have even proposed combining models (called ensemble learning) to enhance results.
Recently, GPT-based models, such as GPT-4, have gained attention for their ability to understand and generate text. However, no work to date has leveraged or investigated these large language models (LLMs) for criteria information extraction.
IMO Health scientists developed an AI system to close this gap and conducted a research study to assess its capabilities, communicate its strengths, and identify areas for improvement.
Methods: Training and evaluating AutoCriteria
The team pulled clinical trial data for nine diseases from ClinicalTrials.gov and built an information extraction system, AutoCriteria. For each disease, they used three trials for prompt design, five for prompt calibration, and randomly selected 20 to evaluate the system.
AutoCriteria comprises the following modules:
Preprocessing
Trial documents are often long and contain many rules. So, IMO Health experts first split the raw criteria text into two parts: Inclusion (who can join a trial) and Exclusion (who cannot join a trial). Then, they split each of those parts into even smaller chunks of 200 words and ran their system on each part separately, extracting critical details. Finally, they combined all parts separately for the inclusion and exclusion criteria.
Knowledge ingestion
IMO Health’s AI team leveraged knowledge experts to discern each disease’s key medical terms and attributes. This information helped the model identify essential details within the inclusion and exclusion criteria.
Prompt modeling
The scientists experimented with many prompts, iteratively developing, testing, and calibrating them. They finally created two comprehensive prompts, one for inclusion and another for exclusion.
Prompt composition |
---|
1. General instruction |
2. [Inclusion Criteria Text]: < criteria text > |
3. Query part for Inclusion |
Sample output |
Entity type: Lab test |
Attribute: Hemoglobin |
Value: ≥ 10.0 g/dL |
Modifier: NA |
Sourse Sentence: Hemoglobin greater than 10.0 g/dL. |
This figure shows a sample prompt template and its corresponding output.
Postprocessing
This step involved processing the model’s responses to address output inconsistencies and integrating medical knowledge through simple rules. For example, rarely, a response included missing values for entities, indicated by a vague phrase such as “gene name”—these cases were removed.
Evaluation
As part of this stage, the scientists evaluated the prompts manually and calibrated them repeatedly using expert feedback for every disease. For the final system assessment, they reviewed qualitative and quantitative metrics, such as precision, recall, and F1 scores (quantitative), as well as missing and incorrect criteria entities (qualitative).
Results: Quantitative and qualitative metrics
The overall accuracy of AutoCriteria in identifying all contextual information across diseases is 78.95%. In terms of qualitative metrics, the team’s thematic analysis indicated that “accurate logic interpretation of criteria” was one of the model’s strengths, and “overlooking/neglecting the main criteria” was one of its weaknesses.
Significance: A promising future for clinical NLP in life sciences
This study demonstrates AutoCriteria’s potential to alleviate the need for manual annotations when extracting granular information about eligibility from trial documents. The prompts developed for this tool also generalize well across different disease areas. Ultimately, this study proves that AutoCriteria is a scalable solution capable of addressing the complexities of clinical trial processes in real-world settings.