Providers deal with an enormous amount of data each day when caring for their patients, yet they often lack access to the socioeconomic, environmental, and behavioral data that would help them to create a more holistic picture at the point of care. Having access to clean, accurate, and actionable data is also key to providing high quality care. However, studies indicate that only 20% of important patient information lives in structured formats, leaving the remaining 80% hidden away in unstructured narratives found in clinical notes, imaging results, and other text-based documents.
Extracting concepts from narrative text
Natural language processing (NLP) is key to producing a representative picture of a given patient that includes at least some of the narrative-based data mentioned above. NLP is a computerized technique for extracting coded entities from within uncoded narrative text. There are several different methods for doing so – some proprietary and some open source. Examples of open source NLP engines include cTakes, Spark NLP, Flair, AllenNLP, SpaCy, MedCy, and CoreNLP.
NLP has been used to extract concepts that support reimbursement, population, and public health initiatives – as well as life sciences research and development. Traditionally, the quality of NLP has been assessed by comparing it to a gold standard typically produced by subject matter experts who manually extract concepts from a corpus of text. The performance of NLP engines is judged on sensitivity and specificity, or sometimes a combined ‘F’ score, which typically ranges from 70-85%.
NLP, clinical terminology, and standardized codes
Regardless of the engine used, the dictionaries from which the entities are determined are critically important to the value of the NLP. Extracting entities or concepts that are not at the right level of specificity or are not mapped to the necessary standardized codes reduces the value of the tool. And, even when the extracted concepts are faithful to the original text, if they are not mappable to usable standardized codes, the resulting concepts will not be actionable.
NLP is a natural evolution of what we do here at IMO. Clinical interface terminology (CIT) more closely approximates natural language than standardized reference terminologies. This means that NLP using CIT will be able to capture highly specific and comprehensive clinical concepts from the narrative. The results would be a much richer set of data and metadata than what one could traditionally get from mining structured data using traditional techniques.
Since CIT – like the one that powers IMO Core – has curated maps to multiple standard code sets, the extracted concepts can provide significant value to downstream users. In the months and years ahead, we look forward to leveraging our foundational CIT capabilities that are employed at the beginning of the data capture lifecycle to create solutions that can help stakeholders across the healthcare ecosystem – including analytics providers, payers, public health agencies, and life sciences companies.