Bell Eapen

Physician | HealthIT Developer | Digital Health Consultant

Kedro for multimodal machine learning in healthcare 

Healthcare data is heterogenous with several types of data like reports, tabular data, and images. Combining multiple modalities of data into a single model can be challenging due to several reasons. One challenge is that the diverse types of data may have different structures, formats, and scales which can make it difficult to integrate them into a single model. Additionally, some modalities of data may be missing or incomplete, which can make it difficult to train a model effectively. Another challenge is that different modalities of data may require different types of pre-processing and feature extraction techniques, which can further complicate the integration process. Furthermore, the lack of large-scale, annotated datasets that have multiple modalities of data can also be a challenge. Despite these challenges, advances in deep learning, multi-task learning and transfer learning are making it possible to develop models that can effectively combine multiple modalities of data and achieve reliable performance. 

Kedro for multimodal machine learning in healthcare 
Pipelines Kedro for multimodal machine learning

Kedro for multimodal machine learning

Kedro is an open-source Python framework that helps data scientists and engineers organize their code, increase productivity and collaboration, and make it easier to deploy their models to production. It is built on top of popular libraries such as Pandas, TensorFlow and PySpark, and follows best practices from software engineering, such as modularity and code reusability. Kedro supplies a standardized structure for organizing code, handling data and configuration, and running experiments. It also includes built-in support for version control, logging, and testing, making it easy to implement reproducible and maintainable pipelines. Additionally, Kedro allows to easily deploy the pipeline on cloud platforms like AWS, GCP or Azure. This makes it a powerful tool for creating robust and scalable data science and data engineering pipelines. 

I have built a few kedro packages that can make multi-modal machine learning easy in healthcare. The packages supply prebuilt pipelines for preprocessing images, tabular and text data and build fusion models that can be trained on multi-modal data for easy deployment. The text preprocessing package currently supports BERT and CNN-text models. There is also a template that you can copy to build your own pipelines making use of the preprocessing pipelines that I have built. Any number and combination of data types are supported. Additionally, like any other kedro pipeline, these can be deployed on kubeflow and VertexAI. Do comment below if you find these tools useful in your research. 

Link to the repository below.

Dark Mode

kedro-multimodal (this link opens in a new window) by dermatologist (this link opens in a new window)

Template for multi-modal machine learning in healthcare using Kedro. Combine reports, tabular data and image using various fusion methods.

Using OpenFaaS containers in Kubeflow 

OpenFaas

OpenFaaS is an open-source framework for building serverless functions with containers. Serverless functions are pieces of code that are executed in response to a specific event, such as an HTTP request or a message being added to a queue. These functions are typically short-lived and only run when needed, which makes them a cost-effective and scalable way to build cloud-native applications. OpenFaaS makes it easy to build, deploy, and manage serverless functions. OpenFaaS CLI minimizes the need to write boilerplate code. You can write code in any supported language and deploy it to any cloud provider. It provides a set of base containers that encapsulates the ‘function’ with a webserver that exposes its HTTP service on port 8080 (incidentally the default port for Google Cloud Run). OpenFaaS containers can be directly deployed on Google Cloud Run and with the faas CLI on any cloud provider. 

OpenFaaS ® – Serverless Functions Made Simple

Kubeflow

Kubeflow is a toolkit for building and deploying machine learning models on Kubernetes. Kubeflow is designed to make it easy to build, deploy, and manage end-to-end machine learning pipelines, from data preparation and model training to serving predictions and implementing custom logic. It can be used with any machine learning framework or library. Google’s Vertex AI platform can run Kubeflow pipelines. Kubeflow pipeline components are self-contained code that can perform a step in the machine learning workflow. They are packaged as a docker container and pushed to a container registry that the Kubernetes cluster can access. A Kubeflow component is a containerized command line application that takes input and output as command line arguments.  

OpenFaaS containers expose HTTP services, while Kubeflow containers provide CLI services. That introduces the possibility of tweaking OpenFaaS containers to support CLI invocation, making the same containers usable as Kubeflow components. Below I explain how a minor tweak in the OpenFaaS templates can enable this. 

Let me take the OpenFaaS golang template as an example. The same principle applies to other language templates as well. In the golang-middleware’s main.go, the following lines set the main route and start the server. This exposes the function as a service when the container is deployed on Cloud Run.

 
	http.HandleFunc("/", function.Handle),  

	listenUntilShutdown(s, healthInterval, writeTimeout) 

I have added the following lines [see on GitHub] that expose the same function on the command line for Kubeflow.  

	if len(os.Args) < 2 {,  

		listenUntilShutdown(s, healthInterval, writeTimeout) 

	} else { 

		dat, _ := os.ReadFile(os.Args[1]) 

		_dat := function.HandleFile(dat) 

		_ = os.WriteFile(os.Args[2], _dat, 0777) 

	} 

If the input and output file names are supplied on the command line as in kubeflow, it reads from and writes to those files. The kubeflow component definition is as below: 

implementation:
  container:
    image: <image:version>
    command: [
        'sh',
        '-c',
        'mkdir --parents $(dirname "$1") && /home/app/handler "$0" "$1"',
    ]
    args: [{inputPath: Input 1}, {outputPath: Output 1}]

With this simple tweak, we can use the same container to host the function on any cloud provider as serverless functions and Kubeflow components.  You can pull the modified template from the repo below.

Open-source for healthcare

This post is meant to be an instruction guide for healthcare professionals who would like to join my projects on GitHub.

eHealth Programmer Girl

What is a contribution?

Contribution is not always coding. You can clean up stuff, add documentation, instructions for others to follow etc. Issues and feature requests should be posted under the ‘issues’ tab and general discussions under the ‘Discussions’ tab if one is available.

How do I contribute.

How do I develop

  • The .devcontainer folder will have the configuration for the docker container for development.
  • Version bump action (if present) will automatically bump version based on the following terms in a commit message: major/minor/patch. Avoid these words in the commit message unless you want to trigger the action.
  • Most repositories have GH actions to generate and deploy documentation and changelog.

What do I do next

  • My repositories (so far) are small enough for beginners to get the big picture and make meaningful contributions.
  • Don’t be discouraged if you make mistakes. That is how we all learn.

There’s no better time than now to choose a repo to contribute!

Clinical knowledge representation for reuse

The need for computerized clinical decision support is becoming increasingly obvious with the COVID-19 pandemic. The initial emphasis has been on ‘replacing’ the clinician which for a variety of reasons is impossible or impractical. Pragmatically, clinical decision support systems could provide clinical knowledge support for clinicians to make time-sensitive decisions with whatever information they have at the point of patient care.

Siobhán Grayson, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

Providing clinical decision support requires some formal way of representing clinical knowledge and complex algorithms for sophisticated inference. In knowledge management terms, the information requires to be transformed into actionable knowledge. Knowledge needs to be represented and stored in a way conducive to easy inference (knowledge reuse)​1​. I have been exploring this domain for a considerable period of time, from ontologies to RDF datasets. With the advent of popular graph databases (especially Neo4J ), this seems to be a good knowledge representation method for clinical purposes.

To cut a long story short, I have been working on building a suite of JAVA libraries to support knowledge extraction, annotation and transformation to a graph schema for inference. I have not open-source it yet as I have not decided on what license to use. However, I am posting some preliminary information here to assess interest. Please give me a shout, if you share an interest or see some potential applications for this. As always, I am open to collaboration.

The JAVA package consists of three modules. The ‘library’ module wraps the NCBI’s E-Utils API to harvest published article abstracts if that is your knowledge source. Though data extraction from the clinical notes in EMR’s is a recent trend, it is challenging because of unstructured data and lack of interoperability. The ‘qtakes’ module provides a programmable interface to my quick-ctakes or the quarkus based apache ctakes, a fast clinical text annotation engine. Finally, the graph module provides the Neo4J models, repositories and services for abstracting as a knowledge graph.

The clinical knowledge graph (ckb) consists of entities such as Disease, Treatment and Anatomy and appropriate relationships and cypher queries are defined. The module exposes services that can be consumed by JAVA applications. It will be available as a maven artifact once I complete it.

UPDATE: May 30, 2021: The library (ckblib) is now available under MPL 2.0 license (see below). Feel free to use it in your research.

  1. 1.
    Toward a Theory of Knowledge Reuse: Types of Knowledge Reuse Situations and Factors in Reuse Success. Journal of Management Information Systems. Published online May 31, 2001:57-93. doi:10.1080/07421222.2001.11045671
Cite this article as: Eapen BR. (April 28, 2021). Nuchange.ca - Clinical knowledge representation for reuse. Retrieved September 27, 2023, from https://nuchange.ca/2021/04/clinical-knowledge-representation-for-reuse.html.

COVID vaccination tracking with blockchain

COVID vaccine rollout has the potential to bring relief to billions of people around the world. But as encouraging as these programs may be, it is extremely important to note that a vaccine cannot be as effective if it is not effectively distributed and trusted by the public.

SPQR10, CC BY-SA 4.0 https://creativecommons.org/licenses/by-sa/4.0, via Wikimedia Commons

IBM Blockchain has a vaccine distribution network for manufacturers to proactively monitor for adverse events and improve recall management. Moderna is planning to explore vaccine traceability with the IBM blockchain.

The International Air Transport Association (IATA) is planning to launch a system of digital ‘passports’ as proof that passengers have been vaccinated against COVID-19. Blockchain technology could offer a better data-storage system for such vaccination records. A decentralized blockchain ledger would be anonymous, immutable and transparent and the entries can be publicly audited.

A vaccine blockchain system could support vaccine traceability and smart contract functions and can be used to address the problems of vaccine expiration and vaccine record fraud. Additionally, the use of machine learning models can provide valuable recommendations to immunization practitioners and recipients, allowing them to choose better immunization methods and vaccines as recommended by this study. A blockchain-based system developed by Singapore-based Zuellig Pharma can help governments and healthcare providers manage vaccine distribution and administration. UK hospitals are using blockchain to track the temperature of coronavirus vaccines.

In my opinion, a blockchain application in healthcare should satisfy the following characteristics:

  1. Both patient and provider should have an interest in the decentralized storage of the concerned piece of information. One party may be neutral, but there should not be a collision of interests.
  2. One or more third parties should have an interest in this information and may have a reason not to trust the patient or provider.
  3. The information should be a dynamic time-bound list that requires periodic updating.
  4. The privacy concern related to the concerned information should be minimal.
  5. The information should not be easy to measure or procure from other sources.

Vaccination satisfies the above criteria and as such blockchain may be a good solution for this problem. Before the COVID-19 pandemic, I had played a bit with solidity and made a web application with three different views:

  1. Provider view: From this view, a provider can extend an offer to save the information on the blockchain to a patient.
  2. Patient view: From this view, a patient can accept an offer extended by a provider.
  3. Lookup view: To look up information on any patient.

Vac-chain is a prototype of on-chain storage of vaccination information on Ethereum blockchain using smart contracts in solidity using the truffle Drizzle box (React/Redux).

Cite this article as: Eapen BR. (March 24, 2021). Nuchange.ca - COVID vaccination tracking with blockchain. Retrieved September 27, 2023, from https://nuchange.ca/2021/03/covid-vaccination-tracking-with-blockchain.html.

Clinical Query Language – Part 1

Clinical Query Language (CQL) is a high-level query language to represent and generate unambiguous quality measures or clinical decision rules. I am not a CQL expert. These are my notes from a system development perspective. I am trying to make sense of this emerging concept and add my notes here in the hope that others may find this useful.

U.S. Navy photo by Chief Warrant Officer 4 Seth Rossman. / Public domain (wikimedia)

Clinical Query Language is designed to be intuitive for clinicians authoring the queries for quality measures and clinical decision support. The decision support rules are mostly alert type rules at the individual and population level that is calculated from a database (not usually diagnostic decision support). You can use any data model with CQL.

Here is an example segment of CQL:

define “InDemographic”:
AgeInYearsAt(start of MeasurementPeriod) >= 16 and AgeInYearsAt(start of MeasurementPeriod) < 24
and “Patient”.”gender” in “Female Administrative Sex”

As Clinical Query Language follows strict semantics, you can autogenerate lexers, parsers and visitors using ANTLR. In simple terms, CQL’s semantics can be represented as a ‘grammar’ that ANTLR can read and generate code to process any CQL in a variety of programming languages, including Java, Javascript, Python, C# and Go. The CQL grammar files are here: https://cql.hl7.org/08-a-cqlsyntax.html. Incidentally, CQL grammar inherits from fhirpath.

If you wish to generate code from these files, there are two things to note:

  • You need to rename CQL.g4 to cql.g4 as the library names are case sensitive and should correspond to the filename.
  • Put fhirpath.g4 in the same folder as cql.g4, and cql refers to fhirpath grammar.

Clinical Query Language aims to provide a high-level domain-independent language for clinicians that can be translated into low-level database logic. As CQL does not prescribe a data model, an intermediary format linking CQL to the data management logic is required. That is called Expression Logical Model (ELM) that we will discuss in part 2.

Kickstart NLP with UMLS

The UMLS, or Unified Medical Language System, is a set of files and software that brings together many health and biomedical vocabularies and standards to enable interoperability between computer systems.

Natural Language Processing (NLP) on the vast amount of data captured by electronic medical records (EMR) is gaining popularity. The recent advances in machine learning (ML) algorithms and the democratization of high-performance computing (HPC) have reduced the technical challenges in NLP. However, the real challenge is not the technology or the infrastructure, but the lack of interoperability — in this case, the inconsistent use of terminology systems.

natural language processing
UMLS for NLP

NLP tasks start with recognizing medical terms in the corpus of text and converting it into a standard terminology space such as SNOMED and ICD. This requires a terminology mapping service that can do this mapping in an easy and consistent manner. The Unified Medical Language System (UMLS) terminology server is the most popular for integrating and distributing key terminology, classification and coding standards. The consistent use of  UMLS resources leads to effective and interoperable biomedical information systems and services, including EMRs.

To make things easier, UMLS provides both REST-based and SOAP-based services that can be integrated into software applications. A high-level library that encapsulated these services, making the REST calls easy to the user is required for the efficient use of these resources.  Umlsjs is one such high-level library for the UMLS REST web services for javascript. It is free, open-source and available on NPM, making it easy to integrate into any javascript (for browsers) or any nodejs applications.

The umlsjs package is available on GitHub and the NPM. It is still work in progress and any coding/documentation contributions are welcome. Please read the CONTRIBUTING.md file on the repository for instructions. If you use it and find any issues, please report it on GitHub.

How to deploy an h2o ai model using OpenFaaS on Digitalocean in 2 minutes

H2O is an open-source, distributed and scalable machine learning platform written in JAVA. H2O supports many of the statistical & machine learning algorithms, including gradient boosted machines, generalized linear models, deep learning and more.  OpenFaaS® (Functions as a Service) is a framework for building Serverless functions easily with Docker. Read my previous post to learn more about OpenFaaS and DO. 

H2O AI model deployment

H2O has a module aptly named sparkling water that allows users to combine the machine learning algorithms of H2O with the capabilities of Spark. Integrating these two open-source environments provides a seamless experience for users who want to make a query using Spark SQL, feed the results into H2O to build a model and make predictions, and then use the results again in Spark. For any given problem, better interoperability between tools provides a better experience.

H2O Driverless AI is a commercial package for automatic machine learning that automates some of the most difficult data science and machine learning workflows such as feature engineering, model validation, model tuning, model selection, and model deployment. H2O also has a popular open-source module called AutoML that automates the process of training a large selection of candidate models. H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. AutoML makes hyperparameter tuning accessible to everyone.

H2O allows you to convert the models to either a Plain Old Java Object (POJO) or a Model Object or an Optimized (MOJO) that can be easily embeddable in any Java environment. The only compilation and runtime dependency for a generated model is the h2o-genmodel.jar file produced as the build output of these packages. You can read more about deploying h2o models here.

I have created an OpenFaaS template for deploying the exported MOJO file using a base java container and the dependencies defined in the gradle build file. Using the OpenFaaS CLI (How to Install) pull my template as below:

mkdir watersplash
cd watersplash

faas-cli template pull https://github.com/dermatologist/java-ext --prefix your-docker-uname

faas-cli new --lang java-h2o watersplash

Copy the exported MOJO zip file to the root folder along with build.gradle and settings.gradle. Make appropriate changes to handle.java as per the needs of the model, as explained here. Add http://digitaloceanIP:8080 to watersplash.yml

 provider:
  	name: openfaas
  	gateway: http://digitaloceanIP:8080

and finally:

 faas-cli up -f watersplash.yml

That’s it! Congratulations! Your model is up and running! Access it at http://digitaloceanIP:8080/function/watersplash

If you get stuck at any stage, give me a shout below. 

Deploy a fastai image classifier using OpenFaaS for serverless on DigitalOcean in 5 easy steps!

Fastai is a python library that simplifies training neural nets using modern best practices. See the fastai website and view the free online course to get started. Fastai lets you develop and improve various NN models with little effort. Some of the deployment strategies are mentioned in their course, but most are not production-ready.

OpenFaaS® (Functions as a Service) is a framework for building Serverless functions easily with Docker that can be deployed on multiple infrastructures including Docker swarm and Kubernetes with little boilerplate code. Serverless is a cloud-computing model in which the cloud provider runs the server, and dynamically manages the allocation of machine resources and can scale to zero if a service is not being used. It is interesting to note that OpenFaaS has the same requirements as the new Google Cloud Run and is interoperable. Read more about OpenFaaS (and install the CLI) from their website.

DigitalOcean: I host all my websites on DigitalOcean (DO) which offers good (in my opinion) cloud services at a low cost. They have data centres in Canada and India. DO supports Kubernetes and Docker Swarm, but they offer a One-Click install of OpenFaaS for as little as $5 per month (You can remove the droplet after the experiment if you like, and you will only be charged for the time you use it.) If you are new to DO, please sign up and setup OpenFaaS as shown here:

In fastai class, Jeremy creates a dog breed classifier.

As STEP 1, export the model to .pkl as below

learn.export()

This creates the export.pkl file that we will be using later. To deploy we need a base container to run the prediction workflow. I have created one with Python3 along with fastai core and vision dependencies (to keep the size small). It is available here: https://hub.docker.com/r/beapen/fastai-vision But you don’t have to directly use this container. My OpenFaaS template will make this easy for you.

STEP 2: Using the OpenFaaS CLI (How to Install) pull my template as below:

mkdir dog-classifier
cd dog-classifier
faas-cli template pull https://github.com/dermatologist/python3-ml --prefix your-docker-uname
faas-cli new --lang fastai-vision dog-classifier

STEP 3: Copy export.pkl to the model folder

STEP 4: Add http://digitaloceanIP:8080 to dog-classifier.yml

provider:
  name: openfaas
  gateway: http://digitaloceanIP:8080

and finally in STEP 5:

faas-cli up -f dog-classifier.yml

That’s it! Your predictor is up and running! Access it at http://digitaloceanIP:8080/function/dog-classifier

The template has a builtin image uploader interface! If you get stuck at any stage, give me a shout below. More to follow on using OpenFaaS for deploying machine learning workflows!

Serverless on FHIR: Management guidelines for the semi-technical clinician!

Serverless is the new kid on the block with services such as AWS Lambda, Google Cloud Functions or Microsoft Azure Functions. Essentially it lets users deploy a function (Function As A Service or FaaS) on the cloud with very little effort. Requirements such as security, privacy, scaling, and availability are taken care of by the framework itself. As healthcare slowly yet steadily progress towards machine learning and AI, serverless is sure to make a significant impact on Health IT. Here I will explain serverless (and some related technologies) for the semi-technical clinicians and put forward some architectural best practices for using serverless in healthcare with FHIR as the data interchange format.

artificial intelligence
Serverless on FHIR

Let us say, your analyst creates a neural network model based on a few million patient records that can predict the risk for MI from BP, blood sugar, and exercise. Let us call this model r = f(bp, bs, e). The model is so good that you want to use it on a regular basis on your patients and better still, you want to share it with your colleagues. So you contact your IT team to make this happen.

This is what your IT guys currently do: First, they create a web application that can take bp, bs and e as inputs using a standard interface such as REST and return r. Next, they rent a virtual machine (VM) from a cloud provider (such as DigitalOcean). Then they convert this application into a container (docker) and deploy it in the VM. You now can use this as an application from your browser (chrome) or your EMR (such as OpenMRS or OSCAR) can directly access this function. You can share it with your colleagues and they can access it in their browsers and you are happy. The VM can support up to 3 users at a time.

In a couple of months, your algorithm becomes so popular that at any one time hundreds of users try to access it and your poor VM crashes most of the time or your users have to wait forever. So you call your IT guys again for a solution. They make 100 copies of your container, but your hospital is reluctant to give you the additional funding required.

Your smart resident notices that your application is being used only in the morning hours and in the night all the 100 containers are virtually sleeping. This is not a good use of the funding dollars. You contact your IT guys again, and they set up Kubernetes for orchestrating the containers according to usage. So, what is Serverless? Serverless is a framework that makes all these so easy that you may not even need your IT guys to do this. (Well, maybe that is an exaggeration)

My personal favourite serverless toolset (if you care) is Kubernetes + Knative + riff. I don’t try to explain what the last two are or how to use them. They are so new that they keep changing every day. In essence, your IT team can complete all the above tasks with few commands typed on the command line on the cloud provider of your choice. The application (function rather) can even scale to zero! (You don’t pay anything when nobody uses it and add more containers as users increase, scaling down in the night as in your case).

Best Practices

What are the best practices when you design such useful cloud-based ‘functions’ for healthcare that can be shared by multiple users and organizations? Well, here are my two cents!

First, you need a standard for data exchange. As JSON is the data format for most APIs, FHIR wins hands down here.

Next, APIs need a mechanism to expose their capabilities and properties to the world. For example, r = f(bp, bs, e) needs to tell everyone what it accepts (bp, bs, e) and what it returns (at the bare minimum). FHIR has a resource specifically for this that has been (not so creatively) named as an Endpoint. So, a function endpoint should return a FHIR Endpoint resource with information about itself if there is no payload.

What should the payload be? Payload should be a FHIR Bundle that has all the FHIR Resources that the function needs (bp, bs and e as FHIR Observations in your case). The bundle should also include a FHIR Subscription resource that points to the receiving system (maybe your EMR) for the response ( r ).

So, what next?

Take the phone and call your IT team. Tell them to take
Kubernetes + Knative + riff for a spin! I might do the same and if I do, I will share it here. And last but not the least, click on the blue buttons below! 🙂