What’s in a variant? Part II – Making sense of data through harmonization

Part one of our blog series about SARS-CoV-2 variants explained the value of combining viral sequencing data with epidemiological and clinical information in the fight against the pandemic. But challenges remain. To be useable, data must be understandable for all users – a process that’s not as straightforward as it seems.
interoperability in healthcare-01

Harmonization across different terminologies can preserve data specificity and meaning, while also making the information universally accessible. The ongoing COVID-19 pandemic has highlighted how important this harmonized data is – especially when combined with clinical information – to drive research, inform public health, and guide patient care.

COVID-19 and interoperability in healthcare

Viral sequencing data is a critical type of data to collect as the pandemic continues. The use cases for viral sequencing data span multiple arenas – however it can only achieve so much when working in isolation. For instance, virologists can identify key mutations in a variant’s genome that would predict it to be more transmissible. Yet whether it actually is more contagious is unknown until epidemiological data is in play.

Thus, to gain a greater understanding of the impact of SARS-CoV-2 variants, information must be shared between domains and be usable across specialties. This is a challenge, though, because different users rely on different terminologies which can vary in terms of specificity and meaning to capture SARS-CoV-2 data.

Variants and terminology

Scientists use established nomenclatures – such as GISAID, Pango, and Nextstrain – when documenting and reporting new SARS-CoV-2 sequences. These conventions are highly specific, which allow them to convey descriptive and taxonomic information. However, they can be confusing and overly complex to those unfamiliar with genomic sequencing.

Additionally, the World Health Organization (WHO) has its own labelling system that groups sequences together based on shared characteristics – such as lineage – under simple umbrella terms represented by Greek letters. Notably, this system is mostly used to refer to those variants which are considered most harmful, like the Delta variant. Its simplicity makes it ideal for communicating information in a way that can be easily understood by policymakers and clinicians, but it is too broad for clinical research. Thus, what is needed is a way to harmonize across scientific and non-scientific terminologies.  

Standardization across terminologies

In order to have data work together, there needs to be a comprehensive knowledge model that can represent sequencing data at all levels of specificity. This will allow sequencing data to be captured at the greatest level of precision while also allowing for access at higher, more general levels.

Scientists can capture sequences at a high level of specificity – such as Pango lineage – which can be cross mapped to more general concepts – such as WHO label – to service public health or clinical purposes. Note that there isn’t always a one-to-one match across terminologies. For example, the Delta variant actually corresponds to 13 different Pango lineages. This information can be represented hierarchically. This allows for the ability to “roll-up” from more specific concepts to broader ones.


Before SARS-CoV-2 sequencing data can be combined with epidemiological and clinical information, it must be transformed into a format that can be understood by scientists and non-scientists alike. Harmonizing across disparate terminologies with a hierarchical ontology will allow data to be represented in a comprehensive model. This will enable information systems to manage sequencing data at varying levels of specificity without becoming overly complex or unintelligible.

From there, this data can then be combined with epidemiological and clinical information. Yet this integration is not without its own challenges. In the final two blogs of this series, we address and propose solutions to the challenges involved with bringing all of this data together to be used to further research, guide policy, and inform treatment decisions.

For more on combatting COVID-19 variants, check out part one of our series here.

Ideas are meant for sharing.

Sign up today and have Ideas delivered straight to your inbox.

Latest Ideas​

Hear how the Piedmont team approached a successful implementation of IMO Health’s surgical scheduling data solution and take away helpful best practices
In May, the AMA met to discuss adding more CPT codes to the RPM section – but updates and revisions have been
Learn how value sets impact data use and EHR workflows, plus how organizations can enhance their creation and maintenance with innovative tools.