Bell Eapen MD, PhD.

Bringing Digital health & Gen AI research to life!

Deploy a fastai image classifier using OpenFaaS for serverless on DigitalOcean in 5 easy steps!

Fastai is a python library that simplifies training neural nets using modern best practices. See the fastai website and view the free online course to get started. Fastai lets you develop and improve various NN models with little effort. Some of the deployment strategies are mentioned in their course, but most are not production-ready.

OpenFaaS® (Functions as a Service) is a framework for building Serverless functions easily with Docker that can be deployed on multiple infrastructures including Docker swarm and Kubernetes with little boilerplate code. Serverless is a cloud-computing model in which the cloud provider runs the server, and dynamically manages the allocation of machine resources and can scale to zero if a service is not being used. It is interesting to note that OpenFaaS has the same requirements as the new Google Cloud Run and is interoperable. Read more about OpenFaaS (and install the CLI) from their website.

DigitalOcean: I host all my websites on DigitalOcean (DO) which offers good (in my opinion) cloud services at a low cost. They have data centres in Canada and India. DO supports Kubernetes and Docker Swarm, but they offer a One-Click install of OpenFaaS for as little as $5 per month (You can remove the droplet after the experiment if you like, and you will only be charged for the time you use it.) If you are new to DO, please sign up and setup OpenFaaS as shown here:

In fastai class, Jeremy creates a dog breed classifier.

As STEP 1, export the model to .pkl as below

learn.export()

This creates the export.pkl file that we will be using later. To deploy we need a base container to run the prediction workflow. I have created one with Python3 along with fastai core and vision dependencies (to keep the size small). It is available here: https://hub.docker.com/r/beapen/fastai-vision But you don’t have to directly use this container. My OpenFaaS template will make this easy for you.

STEP 2: Using the OpenFaaS CLI (How to Install) pull my template as below:

mkdir dog-classifier
cd dog-classifier
faas-cli template pull https://github.com/dermatologist/python3-ml --prefix your-docker-uname
faas-cli new --lang fastai-vision dog-classifier

STEP 3: Copy export.pkl to the model folder

STEP 4: Add http://digitaloceanIP:8080 to dog-classifier.yml

provider:
  name: openfaas
  gateway: http://digitaloceanIP:8080

and finally in STEP 5:

faas-cli up -f dog-classifier.yml

That’s it! Your predictor is up and running! Access it at http://digitaloceanIP:8080/function/dog-classifier

The template has a builtin image uploader interface! If you get stuck at any stage, give me a shout below. More to follow on using OpenFaaS for deploying machine learning workflows!

Machine Learning on Diabetic Retinopathy Images

Artificial intelligence (AI) and Machine Learning (ML) are having a profound impact on the way medicine is being practiced. AI/ML algorithms and techniques fit imaging applications easily and can help with automation. Radiology is the specialty that has benefitted the most from the AI/ML revolution. Melanoma detection in Dermatology is another obvious winner.

Image credit: pixabay.com

Many of the machine learning algorithms are reasonably well known. The real challenge is to get the infrastructure to crunch massive amounts of data, getting the ideal dataset for a problem, optimizing the model for performance and deploying the model for use. If you are relatively new to ML, Kaggle is a useful resource for you to start.

I will briefly introduce Kaggle for those who have not used it before. Kaggle is a platform for posting datasets that you have collected. They also provide ‘kernels’ or computational resources (typically Jupyter Notebooks) for collaborative analysis. The datasets can be made private or public under a variety of license options. Organizations post competitions and reward teams that solve them. Solutions are typically posted as predictions on a test dataset or share the kernel code

I recently noticed a good competition on Kaggle that the eHealth community may find interesting. Aravind Eye Hospital in India has posted a dataset consisting of fundoscopic images of diabetic retinopathy with varying degrees of severity. The dataset consists of thousands of images collected in rural areas by the technicians of Aravind hospital from the rural areas of India. The challenge is to develop a model that can predict the severity of diabetic retinopathy from the fundoscopic image. Further, the successful solutions will be shared with other Ophthalmologists through the 4th Asia Pacific Tele-Ophthalmology Society (APTOS) Symposium.

The competition page is available here: https://www.kaggle.com/c/aptos2019-blindness-detection
Let me know if anybody wants to team up!

Random forest model for predicting the total length of hospital stay (TLOS)

TL;DR here is the Random Forest classifier code:

And an (obvious) upfront disclaimer: This is a learning project. This is not for actual use.

DAD is a database consisting of patient demographics, comorbidities, interventions and the length of stay for the de-identified 10% sample of hospital admissions. DAD (2014-15) has an enhanced dataset with variables that were created at Western to act as flags for ICD-10 and CCI groupings, to make using the file easier.

Here is an experiment with the DAD enhanced dataset to create a Random forest model for predicting the total length of hospital stay (TLOS) in less than 100 lines of code. Random forests are an ensemble classifier, that operates by building multiple decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. This is a learning project for Apache Spark and Spark ML using pyspark. The accuracy of the model taking all derived categorical variables is low.

I have access to Apache Spark @ CC. If you are installing Spark in your computer you may have to change the following:

SparkContext.setSystemProperty('spark.executor.memory', '48g')
SparkContext.setSystemProperty('spark.driver.memory', '6g')

Some of the commonly tweaked parameters can be changed here:

RF_NUM_TREES = 3
RF_MAX_DEPTH = 4
RF_MAX_BINS = 12

Uncomment the following line to include only variables that you need.

# df.select([c for c in df.columns if c in ['TLOS_CAT', 'COLNAME', 'COLNAME']]).show()

Here is the repo. How can this model be improved? Maybe a PCA before the RF? or Am I missing something important?

 

Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2014-15). However, the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.

Negative N to Unknown U

The identification of disease specific genes is pivotal in clinical informatics. This paper describes an improved algorithm for machine learning in which the negative N is classified more appropriately as Unknown U.

English: Weka Data Mining Open Software in Java
English: Weka Data Mining Open Software in Java (Photo credit: Wikipedia)

Peng Yang, Xiao-Li Li, Jian-Ping Mei, Chee-Keong Kwoh, and See-Kiong Ng. Positive-Unlabeled Learning for Disease Gene Identification
Bioinformatics first published online August 24, 2012 doi:10.1093/bioinformatics/bts504

SVMs are an important tool in bioinformaticians armamentarium. Weka is a collection of machine learning algorithms for data mining tasks.