Bell Eapen

Physician | HealthIT Developer | Digital Health Consultant

NLP for Clinical Notes – Tools and Techniques

Clinicians add clinical notes to the EMR on each visit. The clinical notes are unstructured in most cases and can benefit from NLP (natural language processing) tools and techniques. Some are created by dictation software or by medical scribes. Family physicians and family practice-centric EMRs like OSCAR EMR rely on unstructured clinical notes.

natural language processing
NLP for Clinical Notes

Clinical notes, because of the unstructured nature is difficult to analyze for statistical insights. Besides, the notes may require further processing for billing and for generating problem charts. The analysis is becoming increasingly important for quality assessments as well.

NLP can be useful in automated analysis of clinical notes. Here I have listed some of the open-source tools (some maintained by me) for such automated analysis of clinical notes.

Apache cTakes for NLP

Apache cTakes (clinical Text Analysis and Knowledge Extraction System) is one of the first open-source NLP systems that extract clinical information from electronic health record unstructured text. Though it is relatively slow, it is still widely used. I have packaged it as a Quarkus application, that is fast. Quarkus (Supersonic Subatomic Java) is designed primarily for docker containers and the quarkus based containers are easy to be deployed and scaled using platforms such as Kubernetes.

SpaCy and related tools for NLP

SpaCy is an open-source python library for NLP. It features NER, POS tagging, dependency parsing, word vectors and is widely used. But spacy is not designed for clinical workflows and may not be directly usable. Scispacy is SpaCy pipeline and models for scientific/biomedical documents trained on biomedical data. MedaCy is a healthcare-specific NLP framework built over spaCy to support the fast prototyping, training, and application of medical NLP models. One of the advantages of Medacy is that it is fast and lightweight.


Unified Medical Language System (UMLS), is a set of files and software that brings together biomedical vocabularies for health information systems. UMLS provides a set of RESTful APIs for licensed users. I have created a JavaScript wrapper for the UMLS APIs that are easy to be called from JavaScript programs. It is available from the npm package repository. See the update on UmlsBERT below.


Medical  Concept Annotation Tool (MedCAT) is a relatively new tool for extraction and linking of terms from vocabularies such as UMLS and SNOMED for free text in EMRs. The paper describing MedCAT is here. MedCAT models can be further refined by training on a domain-specific corpus of text. MedCAT is fast and very useful.

Word Embeddings for NLP

A word embedding is a weighted model for text where words that have the same meaning have a similar weight. It is one of the most popular methods of deep learning for NLP problems. Word2Vec is a method to construct embeddings and the word2vec model based on the entire Wikipedia corpus is available for use. This paper describes the creation of a clinical concept embedding based on a large corpus of clinical documents. I have created a gensim wrapper for this model that can be used for concept similarity search in python.

BERT and related

Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP pre-training developed by Google. Here is the highly cited official paper. BERT has replaced embeddings as the most successful NLP technique in most domains including healthcare. Some of the refined BERT models used in healthcare are BioBERT and ClinicalBERT.

It is vital to deploy these models in a scalable and maintainable manner to be available for use within EMR systems. We are working on such a framework called ‘Serverless on FHIR’. Give me a shout if you want to know more.

Update (Dec 2020):

Researchers from the University of Waterloo have introduced the novel concept of UmlsBERT. Current clinical embedding such as BioBERT described above are generic models, trained further on clinical corpora applying the concept of transfer learning. Most biomedical ontologies such as UMLS define the hierarchies of concepts defined in them. UmlsBERT makes use of these hierarchical group information at the pre-training stage for augmenting the clinical concept embeddings. Table 3 in the paper compares the results with other embeddings, and it is quite impressive. The GitHub repo is here
Way to go George Michalopoulos and team!

Kickstart NLP with UMLS

The UMLS, or Unified Medical Language System, is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems.

Natural Language Processing (NLP) on the vast amount of data captured by electronic medical records (EMR) is gaining popularity. The recent advances in machine learning (ML) algorithms and the democratization of high-performance computing (HPC) have reduced the technical challenges in NLP. However, the real challenge is not the technology or the infrastructure, but the lack of interoperability — in this case, the inconsistent use of terminology systems.

natural language processing

NLP tasks start with recognizing medical terms in the corpus of text and converting it into a standard terminology space such as SNOMED and ICD. This requires a terminology mapping service that can do this mapping in an easy and consistent manner. The Unified Medical Language System (UMLS) terminology server is the most popular for integrating and distributing key terminology, classification and coding standards. The consistent use of  UMLS resources leads to effective and interoperable biomedical information systems and services, including EMRs.

To make things easier, UMLS provides both REST-based and SOAP-based services that can be integrated into software applications. A high-level library that encapsulated these services, making the REST calls easy to the user is required for the efficient use of these resources.  Umlsjs is one such high-level library for the UMLS REST web services for javascript. It is free, open-source and available on NPM, making it easy to integrate into any javascript (for browsers) or any nodejs applications.

The umlsjs package is available on GitHub and the NPM. It is still work in progress and any coding/documentation contributions are welcome. Please read the file on the repository for instructions. If you use it and find any issues, please report it on GitHub.

Natural language processing (NLP) tools for health analytics

Natural language processing (NLP) is the process of using computer algorithms to identify key elements in language and extract meaning from unstructured spoken or written text. NLP combines artificial intelligence, computational linguistics, and other machine learning disciplines.

natural language processing

In the healthcare industry, NLP has many applications such as interpreting clinical documents in an electronic health record. Natural language processing is important in clinical decision support systems by extracting meaningful information from free-text query interfaces. It may reduce transcription costs by allowing providers to dictate their notes, or generate tailored educational materials for patients ready for discharge. At a high-level NLP includes processes such as structure extraction, tokenization, tagging, part of speech identification and lemmatization.

“cTAKES is a natural language processing system for extraction of information from electronic medical record clinical free-text. Originally developed at the Mayo Clinic, it has expanded to being used by various institutions internationally.”

cTAKES is relatively difficult to install and use, especially if the service needs to be shared by several systems. I have integrated cTakes into an easy to use spring boot application that provides REST web services for clinical document annotation. The repository is here.

  • Clone URL

You need a UMLS username and password for deploying the application. RysannMD is an efficient and fast system for annotating clinical documents developed at Ryerson University. Some of my other experiments with NLP are available here.

Are you working on any NLP projects in medicine?