LLMs excel in medical coding with terminology, clinical AI

Key Takeaways

General-purpose LLMs struggle with medical coding accuracy, often producing errors without domain-specific support.
Integrating clinical terminology significantly improves LLM performance, enabling more precise and reliable coding.
Techniques like prompt engineering and fine-tuning with curated data enhance LLM accuracy and build trust in AI-driven coding.

In recent years, there has been growing interest in using artificial intelligence (AI), especially large language models (LLMs), to automate the medical coding process. Medical coding, which involves assigning standardized codes like ICD-10 and CPT® to diagnoses and procedures, is a critical but time-consuming administrative task in healthcare.

LLMs are deep learning models that are trained on vast datasets, making them ideal for generating text output and automating tasks. But according to a recent Mount Sinai study published in the April 19 online issue of NEJM AI, LLMs are poor medical coders.¹  The study emphasizes the need to refine and validate these technologies before implementing them in large-scale clinical settings.

Data quality is a foundational element of success for AI in healthcare. Poor data quality can severely undermine model performance and analytical accuracy for healthcare technology companies aiming to generate actionable insights from clinical data. Inconsistent, incomplete, or unstructured data can lead to skewed predictions, misclassified clinical events, and impaired decision-making. Ensuring data is clean, standardized, and contextually rich is essential not only for model training but also for building trust in AI-powered solutions among clinicians and administrators.

GUIDE

Optimizing LLMs for precise analytical output: The IMO Health approach

LLMs display weaknesses in medical coding

Specifically, the study found that out-of-the-box LLMs like GPT-3.5, GPT-4, Gemini Pro, and Llama2-70b Chat performed poorly at medical coding when simply prompted to generate codes from descriptions. Out-of-the-box LLMs are pre-trained models that have not been fine-tuned or adapted for specific tasks, in contrast to specialized LLMs that have been further trained on domain-specific data to improve their performance in particular areas.

The model with the best performance, GPT-4, which is the latest and most advanced language model developed by OpenAI, the creators of ChatGPT, only achieved 46% exact match accuracy for ICD-9 codes, 34% for ICD-10, and 50% for CPT. The models often generated codes that were imprecise or even contained falsified information.

The Mount Sinai study aligns with IMO Health’s ongoing exploration of automated medical coding. We too have found that out-of-the-box LLMs, while impressive in many areas, often struggle with the complex and nuanced task of medical coding.

To address these limitations, IMO Health enhances LLMs through a proprietary knowledge layer – the combination of our robust terminology, mapping logic, AI techniques, and domain-specific tools. By combining the power of LLMs with our expertise in medical informatics, we create accurate and reliable automated medical coding solutions that support healthcare providers and improve patient care.

Enhancing LLMs with deep clinical ontologies and informatics

Structured clinical terminology, comprised of codified terms from a common clinical vocabulary, can be employed to accurately represent clinical concepts like diseases or lab results. However, managing constantly changing data on millions of clinical terms, concepts, their interrelationships, and complex clinical nuances requires specialized expertise. Improving natural language processing (NLP) model performance demands not only comprehensive, structured clinical terminology – two critical components of IMO Health’s knowledge layer.

Thirty years in the making, IMO Health remains the most advanced and widely adopted terminology solution in the industry, used by 89% of US physicians, nurses, and physician assistants. With its extensive coverage, versatility across electronic health record (EHR) use cases, meticulous content curation, and well-documented guidelines, IMO Health terminology can significantly enhance LLMs for medical coding.

Spanning 24 active domains, IMO Health terminology includes millions of unique concepts and lexicals. It encompasses industry-standard code sets including ICD-10-CM, ICD-9-CM, SNOMED CT®, CPT, HCPCS, ICD-10-PCS, LOINC®, RxNorm®, NDC, and CVX.

Compared to the Unified Medical Language System (UMLS), IMO Health includes approximately 20% more synonyms per concept and a higher percentage of long and complex terms (Figure 1), reflecting the precise language used in clinical care.

Accurate and up-to-date content curation

The creation and maintenance of IMO Health’s terminology is powered by a team of industry experts, including MDs, RNs, pharmacists, medical laboratory scientists, and credentialed HIM professionals.

Collectively, the team boasts more than 150 years of clinical informatics expertise, over 130 years of experience in health information, and 160 years of clinical practice experience spanning various specialties, such as surgery, oncology, radiology, pediatrics, orthopedics, emergency medicine, and family medicine. The team has put in hundreds of thousands of hours over three decades creating, curating, updating, and maintaining content.

With the upcoming need to support United States Core Data for Interoperability (USCDI) Version 3, maintaining accurate content is crucial to avoid costly penalties. IMO Health’s terminology is already compliant to USCDI v4, which means client data will be compliant through 2028.

Well-documented guidelines and instructions

IMO Health terminology includes a wealth of best practices, industry standards, coding guidelines, and IMO Health-specific rules. These resources are meticulously documented with detailed instructions and rich positive and negative examples. Hundreds of pages of editorial guidelines are designed to ensure consistent and high-quality content, promoting the creation of effective LLM prompts to simplify medical coding tasks.

Decades of access to clinical data

With a decades-long history as the terminology and coding foundation in all major EHRs, IMO Health has accumulated an extensive knowledge base. This includes capturing the clinical terms physicians search for when seeking medical codes as well as the codes they select, along with the distributions of search terms, frequencies, and co-occurrences. These insights enhance LLMs by providing additional context to medical codes.

Enhancing LLMs with proven AI techniques

IMO Health employs several proven strategies to optimize LLMs for medical coding, including advanced prompt engineering, retrieval-augmented generation (RAG), AI agents and tools, and fine-tuning.

Advanced prompt engineering

Prompt engineering – or the act of writing and refining inputs to elicit high-quality outputs – plays a crucial role in guiding LLMs to generate more accurate medical codes. At IMO Health, using ICD-10-CM codes as examples, we have summarized 22 coding rules and incorporated them as part of the prompts. In doing so, we have observed a significant improvement in the accuracy of generated ICD-10-CM codes compared to using simple questions alone.

Retrieval augmented generation (RAG)

RAG involves having the LLM reference relevant medical coding information retrieved from IMO Health’s terminology and normalization application programming interfaces (APIs).

By leveraging retrieved codes from IMO Health’s terminologies, we minimize the occurrence of fake or inaccurate codes and reduce hallucinations, or outputs that are nonsensical or entirely fabricated (Figure 2).

This approach simplifies the task from generating codes to selecting from pre-existing candidates, thus reducing the reliance on prior knowledge from the base LLMs. As a result, it becomes possible to use smaller LLMs to build lower-cost and faster-running solutions.

Agents and tools

An LLM agent is a specialized AI system designed to perform specific tasks or functions within a larger AI ecosystem. These agents are often built on top of foundational LLMs and are trained to handle particular domains or use cases.

At IMO Health, we formalize mapping and editorial guidelines into prompts to build a chain of thoughts for LLMs when performing medical coding tasks. We also instruct the LLM to call upon IMO Health tools and APIs, including NLP pipelinesⁱⁱ and our normalization solution, IMO Precision Normalizeⁱⁱⁱ, when applicable. By using agents, the output becomes explainable, trustworthy, and acceptable to human medical coders, instead of a black box (Figure 3).

Fine-tuning

Fine-tuning involves further training the base LLM on high-quality medical coding datasets to improve its understanding of the medical coding task. By exposing the LLM to a large volume of relevant data, including IMO Health terminology synonyms, mapping relationships, and historical product logs, we can fine-tune it to better capture the nuances and intricacies of medical coding.

The result: better performance on medical coding

Improved mapping accuracy

In a recent test on a typical dataset, the top performing out-of-the-box LLM only achieved an accuracy of 55% on ICD-10-CM primary and secondary code prediction. However, when we evaluated the medical coding solution powered by IMO Health’s knowledge layer, which combines LLMs with our proprietary resources and techniques, the performance reached 92% accuracy on the same dataset. This demonstrates the effectiveness of IMO Health’s approach in delivering highly accurate medical coding results (Figure 4).

Enriched results with secondary codes and HCC integration

Using IMO Health as a bridge to medical codes offers several benefits beyond improved accuracy. IMO Health’s terminology not only returns the preferred primary code but also provides preferred secondary codes, cross-referenced to multiple terminologies or code sets. This captures the detailed semantic differences between medical codes, providing a more comprehensive and precise coding output (Figure 5).

IMO Health’s terminology also includes Hierarchical Condition Category (HCC) scores, which are crucial for risk adjustment and reimbursement purposes in value-based care models. By integrating HCC scores directly into the coding process, IMO Health streamlines the workflow and eliminates the need for manual HCC assignment.

Explainable and trustworthy code selections

One of the key advantages of IMO Health’s knowledge layer is its ability to explain why certain medical codes are chosen and why they are more suitable compared to other similar codes. By prompting the LLM with our mapping and terminology resources, the generated explanations are more clinically logical, with fewer hallucinations and false statements. This makes the results more acceptable and trustworthy to medical coders when they review the output.

The explainable nature of code selections is particularly valuable when there is ambiguity or multiple potential codes for a given medical condition or procedure. By providing clear and clinically sound reasoning for the chosen codes, the system instills confidence in medical coders and facilitates a more efficient review process.

This transparency also enables coders to quickly identify and address any potential discrepancies or uncommon cases, further improving the overall accuracy and reliability of the coding output (Figure 6).

Cost-efficiency optimization

Thanks to IMO Health’s comprehensive terminology, synonyms, and mappings, many input diagnosis terms are already covered directly without the need for LLMs. Only the uncovered terms or terms with low confidence scores are sent to LLMs for further analysis.

In an early study, only 25.1% of input diagnosis terms required LLMs, while the overall accuracy on the entire dataset increased from 82.9% to 90.0% (+7.1%).

This selective use of LLMs offers significant cost-efficiency benefits. By leveraging IMO Health’s knowledge layer as the foundation and using LLMs judiciously for more complex or ambiguous cases, the system optimizes computational resources and reduces the overall cost of the medical coding process. This cost-efficiency, combined with the high accuracy and explainability of the system, makes IMO Health an attractive solution for healthcare organizations looking to automate their medical coding workflows.

Custom implementation to meet your needs

IMO Health can deploy AI solutions to address the unique cost constraints and risk tolerance of any organization. For immediate solutions, a supervised machine learning implementation may be appropriate. This is a lower-cost, secure option for those seeking an immediate solution with a low implementation runway. For organizations seeking a more transformative, strategic innovation, our advanced generative AI research and development solution will help meet and exceed data accuracy goals (Figure 7).

Conclusion

IMO Health’s innovations represent a notable breakthrough in medical coding automation. We deliver remarkable accuracy, transparency, and efficiency by combining LLMs with our proprietary knowledge layer, built on rich terminology, advanced AI techniques, and deep clinical expertise.

This knowledge layer empowers LLMs to generate more precise, explainable, and trustworthy coding output, even for complex or ambiguous cases. It includes our extensive terminology foundation, curated mapping logic, clinical editorial guidelines, and proven AI methods such as RAG, prompt engineering, and agent orchestration.

Our approach enables more accurate reimbursement, reduces manual coding work, and improves downstream analytics for population health and risk adjustment. By putting data quality first, IMO Health helps healthcare organizations unlock more value from their data – leading to faster insights, fewer errors, and a stronger return on investment.

Click here to learn more about IMO Health’s knowledge layer and here to learn how our AI-powered solutions simplify clinical workflows and boost healthcare data quality.

¹Soroush, A., Glicksberg, B. S., Zimlichman, E., Barash, Y., Freeman, R., Charney, A. W., … & Klang, E. (2024). Large Language Models Are Poor Medical Coders—Benchmarking of Medical Code Querying. NEJM AI, AIdbp2300040.

²IMO Entity Extraction API: https://developer.imohealth.com/api-catalog/entity-extraction

³IMO Precision Normalize API: https://developer.imohealth.com/api-catalog/imor-precision-normalize-api

⁴IMO Studio: https://studio.imohealth.com/

RxNorm® is a registered trademark of the National Library of Medicine.

SNOMED and SNOMED CT® are registered trademarks of SNOMED International.

Article Topics: Clinical Terminology, Financial Return, AI and NLP, Clinical Documentation and Coding, EHR workflows, Data Quality and Standardization

POINT OF CARE WORKFLOW

DATA QUALITY MANAGEMENT

LLMs excel in medical coding with terminology, clinical AI

GUIDE

Optimizing LLMs for precise analytical output: The IMO Health approach

LLMs display weaknesses in medical coding

Enhancing LLMs with deep clinical ontologies and informatics

Accurate and up-to-date content curation

Well-documented guidelines and instructions

Decades of access to clinical data

Enhancing LLMs with proven AI techniques

Advanced prompt engineering

Retrieval augmented generation (RAG)

Agents and tools

Fine-tuning

The result: better performance on medical coding

Improved mapping accuracy

Enriched results with secondary codes and HCC integration

Explainable and trustworthy code selections

Cost-efficiency optimization

Custom implementation to meet your needs

Conclusion

Click here to learn more about IMO Health’s knowledge layer and here to learn how our AI-powered solutions simplify clinical workflows and boost healthcare data quality.

Related Content

Refining ambient AI for behavioral health with precise clinical terminology

How the IMO Health Platform structures clinical data across healthcare

Why drug repurposing requires biomedical NLP and AI

FDA issues first guidance on AI in drug development

Top health IT news stories from December 2025

Latest Resources

POINT OF CARE WORKFLOW

DATA QUALITY MANAGEMENT

POINT OF CARE WORKFLOW

DATA QUALITY MANAGEMENT

LLMs excel in medical coding with terminology, clinical AI

GUIDE

Optimizing LLMs for precise analytical output: The IMO Health approach

LLMs display weaknesses in medical coding

Enhancing LLMs with deep clinical ontologies and informatics

Accurate and up-to-date content curation

Well-documented guidelines and instructions

Decades of access to clinical data

Enhancing LLMs with proven AI techniques

Advanced prompt engineering

Retrieval augmented generation (RAG)

Agents and tools

Fine-tuning

The result: better performance on medical coding

Improved mapping accuracy

Enriched results with secondary codes and HCC integration

Explainable and trustworthy code selections

Cost-efficiency optimization

Custom implementation to meet your needs

Conclusion

Click here to learn more about IMO Health’s knowledge layer and here to learn how our AI-powered solutions simplify clinical workflows and boost healthcare data quality.

Related Content

Refining ambient AI for behavioral health with precise clinical terminology

How the IMO Health Platform structures clinical data across healthcare

Why drug repurposing requires biomedical NLP and AI

FDA issues first guidance on AI in drug development

Top health IT news stories from December 2025

Latest Resources​

POINT OF CARE WORKFLOW

DATA QUALITY MANAGEMENT

ICYMI: BLOG DIGEST

The latest insights and expert perspectives from IMO Health

Latest Resources