NLP for Clinical Notes - Tools and Techniques

Clinicians add clinical notes to the EMR on each visit. The clinical notes are unstructured in most cases and can benefit from NLP (natural language processing) tools and techniques. Some are created by dictation software or by medical scribes. Family physicians and family practice-centric EMRs like OSCAR EMR rely on unstructured clinical notes.

natural language processing — NLP for Clinical Notes

Clinical notes, because of the unstructured nature is difficult to analyze for statistical insights. Besides, the notes may require further processing for billing and for generating problem charts. The analysis is becoming increasingly important for quality assessments as well.

NLP can be useful in automated analysis of clinical notes. Here I have listed some of the open-source tools (some maintained by me) for such automated analysis of clinical notes.

Apache cTakes for NLP

Apache cTakes (clinical Text Analysis and Knowledge Extraction System) is one of the first open-source NLP systems that extract clinical information from electronic health record unstructured text. Though it is relatively slow, it is still widely used. I have packaged it as a Quarkus application, that is fast. Quarkus (Supersonic Subatomic Java) is designed primarily for docker containers and the quarkus based containers are easy to be deployed and scaled using platforms such as Kubernetes.

SpaCy and related tools for NLP

SpaCy is an open-source python library for NLP. It features NER, POS tagging, dependency parsing, word vectors and is widely used. But spacy is not designed for clinical workflows and may not be directly usable. Scispacy is SpaCy pipeline and models for scientific/biomedical documents trained on biomedical data. MedaCy is a healthcare-specific NLP framework built over spaCy to support the fast prototyping, training, and application of medical NLP models. One of the advantages of Medacy is that it is fast and lightweight.

UMLS

Unified Medical Language System (UMLS), is a set of files and software that brings together biomedical vocabularies for health information systems. UMLS provides a set of RESTful APIs for licensed users. I have created a JavaScript wrapper for the UMLS APIs that are easy to be called from JavaScript programs. It is available from the npm package repository. See the update on UmlsBERT below.

MedCAT

Medical Concept Annotation Tool (MedCAT) is a relatively new tool for extraction and linking of terms from vocabularies such as UMLS and SNOMED for free text in EMRs. The paper describing MedCAT is here. MedCAT models can be further refined by training on a domain-specific corpus of text. MedCAT is fast and very useful.

Word Embeddings for NLP

A word embedding is a weighted model for text where words that have the same meaning have a similar weight. It is one of the most popular methods of deep learning for NLP problems. Word2Vec is a method to construct embeddings and the word2vec model based on the entire Wikipedia corpus is available for use. This paper describes the creation of a clinical concept embedding based on a large corpus of clinical documents. I have created a gensim wrapper for this model that can be used for concept similarity search in python.

BERT and related

Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP pre-training developed by Google. Here is the highly cited official paper. BERT has replaced embeddings as the most successful NLP technique in most domains including healthcare. Some of the refined BERT models used in healthcare are BioBERT and ClinicalBERT.

It is vital to deploy these models in a scalable and maintainable manner to be available for use within EMR systems. We are working on such a framework called ‘Serverless on FHIR’. Give me a shout if you want to know more.

Update (2022): Tools for building multi-modal models.

Template for multi-modal machine learning in healthcare using Kedro. Combine reports, tabular data and images using various fusion methods.
https://github.com/dermatologist/kedro-multimodal
3 forks.
21 stars.
1 open issues.

Recent commits:

Update README.md, GitHub
change graphics, Bell Eapen
Update README.md, GitHub
Merge pull request #1 from dermatologist/add-license-1Create LICENSE, GitHub
Create LICENSE, GitHub

UPDATE: May 30, 2021: The library (ckblib) is now available under MPL 2.0 license (see below). Feel free to use it in your research.

Dark Mode

ckblib (this link opens in a new window) by dermatologist (this link opens in a new window)

Tools to create a clinical knowledge graph from biomedical literature. Includes wrappers for NCBI Eutils, cTakes annotator and Neo4J

Update (Dec 2020):

Researchers from the University of Waterloo have introduced the novel concept of UmlsBERT. Current clinical embedding such as BioBERT described above are generic models, trained further on clinical corpora applying the concept of transfer learning. Most biomedical ontologies such as UMLS define the hierarchies of concepts defined in them. UmlsBERT makes use of these hierarchical group information at the pre-training stage for augmenting the clinical concept embeddings. Table 3 in the paper compares the results with other embeddings, and it is quite impressive. The GitHub repo is here
Way to go George Michalopoulos and team!

Update (Mar 2021):

Create a chatbot to talk to an FHIR endpoint using conversational AI!

Update (May 2022):

ICDBigBird: A Contextual Embedding Model for ICD Code Classification: https://arxiv.org/pdf/2204.10408.pdf

About
Latest Posts

Follow Me

Bell Eapen

Follow Me

15 thoughts on “NLP for Clinical Notes – Tools and Techniques”

Bell says:

April 26, 2020 at 9:10 pm

https://github.com/OHNLP/MedTagger

Bell says:

August 12, 2020 at 10:05 pm

https://github.com/medspacy

Bell says:

September 6, 2020 at 11:55 am

https://github.com/ML4LHS/clinspacy

Bell says:

September 23, 2020 at 6:55 pm

https://github.com/ncbi-nlp/bluebert
***** New Dec 5th, 2019: NCBI_BERT is renamed to BlueBERT *****

Bell says:

October 6, 2020 at 3:19 pm

https://github.com/jgc128/mednli

Bell says:

October 6, 2020 at 3:20 pm

https://github.com/sparrow-platform/disease-diagnostics-engine
This is an API to serve disease diagnostics based on UMLS databases. The disease diagnostics engine is a part of Sparrow AI.

Pingback: Embeddings in healthcare: TypingDNA and Skinmesh - Bell Eapen
Bell says:

December 18, 2020 at 11:13 am

UMLSBert: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus

https://github.com/gmichalo/UmlsBERT
https://arxiv.org/pdf/2010.10391.pdf

Bell says:

December 19, 2020 at 6:22 pm

BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).

https://github.com/ncbi-nlp/bluebert
https://arxiv.org/abs/1906.05474

Bell says:

January 6, 2021 at 10:46 am

Snow Owl – production-ready, scalable terminology server (SNOMED CT, ICD-10, LOINC, dm+d, ATC and others): https://github.com/b2ihealthcare/snow-owl

Pingback: Clinical knowledge representation for reuse - Bell Eapen
Sobia says:

July 6, 2021 at 12:20 am

Hi Bell,

Thanks for the useful information. I am currently working on clinical Notes during my Ph.D. and not exactly using any framework but used Spacy, NegSpacy, and NegEx. So, do you have any idea about NegEx? It’s good to read this article for an early researcher like me.

1. Bell Eapen says:
  
  October 27, 2021 at 9:44 pm
  
  Thanks, Sobia. I have not used NegEx yet.
  
Bell says:

October 27, 2021 at 9:41 pm

repository for Publicly Available Clinical BERT Embeddings: https://github.com/EmilyAlsentzer/clinicalBERT

Pingback: Distilling LLMs to small task-specific models - Bell Eapen

Bell Eapen MD, PhD.

NLP for Clinical Notes – Tools and Techniques