I went to the movie theater a couple of weeks ago to watch a popular and much-talked-about Indian movie with my family. The movie’s title was Ponniyin Selvan. You may be able to pronounce that phrase correctly since it’s written with letters from the English alphabet, but clearly, it has no meaning for someone who speaks only English. Now, if you are from the south of India and speak a regional language called Tamil (in addition to English), then that term would make immediate sense even though it’s written in English.
In a way, this English title for a Tamil movie encapsulates the fundamental problem with applying natural language processing (NLP) techniques to free-text clinical narratives. These narratives can take the form of various documents, including pathology and radiology reports; history and physicals; discharge summaries; and progress notes. On the surface, they all seem to be written in English. But they are not. Often, the sentences in these documents do not conform to the rules of English grammar and are filled with abbreviations, acronyms, and jargon that is unintelligible to the untrained, English-speaking eye.
Each of these clinical documents represents a unique language that can only be understood by a “tribe” that has had formal training in a particular medical discipline. Replace the trained human eye with an NLP algorithm used to read and derive meaning from these narratives, and you are faced with the same problem. An NLP algorithm that has only been trained on well-formed English documents will perform poorly when it encounters real world free-text clinical narratives. Likewise, an NLP algorithm trained on a steady diet of pathology reports will probably perform poorly when it encounters a radiology report.
Training algorithms on robust clinical terminology
“Health care has hundreds of languages” says David Talby, the CTO of John Snow Labs, each with its own language model. If that is true, and we at IMO believe that it is, then there is no “one-size-fits-all” clinical NLP algorithm. We need hundreds of algorithms that are trained specifically to read and understand these multi-lingual clinical narratives. Furthermore, these algorithms need the benefit of clinical terminology that captures the multifarious ways of describing clinical concepts as part of their training regimen.
IMO recognized the multi-lingual and fragmented nature of clinical terminology more than 25 years ago. The company has painstakingly worked to curate millions of commonly used variations of terms for problems, diagnoses, procedures, medications, labs, vaccinations, and much more. These terms – when combined with sophisticated NLP models and arranged in carefully thought through pipelines – have the potential to mitigate healthcare’s “Tower of Babel” problem.