Artificial intelligence (AI) and Machine Learning (ML) are having a profound impact on the way medicine is being practiced. AI/ML algorithms and techniques fit imaging applications easily and can help with automation. Radiology is the specialty that has benefitted the most from the AI/ML revolution. Melanoma detection in Dermatology is another obvious winner.
Many of the machine learning algorithms are reasonably well known. The real challenge is to get the infrastructure to crunch massive amounts of data, getting the ideal dataset for a problem, optimizing the model for performance and deploying the model for use. If you are relatively new to ML, Kaggle is a useful resource for you to start.
I will briefly introduce Kaggle for those who have not used it before. Kaggle is a platform for posting datasets that you have collected. They also provide ‘kernels’ or computational resources (typically Jupyter Notebooks) for collaborative analysis. The datasets can be made private or public under a variety of license options. Organizations post competitions and reward teams that solve them. Solutions are typically posted as predictions on a test dataset or share the kernel code
I recently noticed a good competition on Kaggle that the eHealth community may find interesting. Aravind Eye Hospital in India has posted a dataset consisting of fundoscopic images of diabetic retinopathy with varying degrees of severity. The dataset consists of thousands of images collected in rural areas by the technicians of Aravind hospital from the rural areas of India. The challenge is to develop a model that can predict the severity of diabetic retinopathy from the fundoscopic image. Further, the successful solutions will be shared with other Ophthalmologists through the 4th Asia Pacific Tele-Ophthalmology Society (APTOS) Symposium.
Health data warehousing is becoming an important requirement for deriving knowledge from the vast amount of health data that healthcare organizations collect. A data warehouse is vital for collaborative and predictive analytics. The first step in designing a data warehouse is to decide on a suitable data model. This is followed by the extract-transform-load (ETL) process that converts source data to the new data model amenable for analytics.
The OHDSI – OMOP Common Data Model is one such data model that allows for the systematic analysis of disparate observational databases and EMRs. The data from diverse systems needs to be extracted, transformed and loaded on to a CDM database. Once a database has been converted to the OMOP CDM, evidence can be generated using standardized analytics tools that are already available.
Each data source requires customized ETL tools for this conversion from the source data to CDM. The OHDSI ecosystem has made some tools available for helping the ETL process such as the White Rabbit and the Rabbit In a Hat. However, health data warehousing process is still challenging because of the variability of source databases in terms of structure and implementations.
Hephestus is an open-source python tool for this ETL process organized into modules to allow code reuse between various ETL tools for open-source EMR systems and data sources. Hephestus uses SqlAlchemy for database connection and automapping tables to classes and bonobo for managing ETL. The ultimate aim is to develop a tool that can translate the report from the OHDSI tools into an ETL script with minimal intervention. This is a good python starter project for eHealth geeks.
Anyone anywhere in the world can build their own environment that can store patient-level observational health data, convert their data to OHDSI’s open community data standards (including the OMOP Common Data Model), run open-source analytics using the OHDSI toolkit, and collaborate in OHDSI research studies that advance our shared mission toward reliable evidence generation. Join the journey! here
Disclaimer: Hephestus is just my experiment and is not a part of the official OHDSI toolset.
Natural language processing (NLP) is the process of using computer algorithms to identify key elements in language and extract meaning from unstructured spoken or written text. NLP combines artificial intelligence, computational linguistics, and other machine learning disciplines.
In the healthcare industry, NLP has many applications such as interpreting clinical documents in an electronic health record. Natural language processing is important in clinical decision support systems by extracting meaningful information from free-text query interfaces. It may reduce transcription costs by allowing providers to dictate their notes, or generate tailored educational materials for patients ready for discharge. At a high-level NLP includes processes such as structure extraction, tokenization, tagging, part of speech identification and lemmatization.
“cTAKES is a natural language processing system for extraction of information from electronic medical record clinical free-text. Originally developed at the Mayo Clinic, it has expanded to being used by various institutions internationally.”
cTAKES is relatively difficult to install and use, especially if the service needs to be shared by several systems. I have integrated cTakes into an easy to use spring boot application that provides REST web services for clinical document annotation. The repository is here.
You need a UMLS username and password for deploying the application. RysannMD is an efficient and fast system for annotating clinical documents developed at Ryerson University. Some of my other experiments with NLP are available here.
And an (obvious) upfront disclaimer: This is a learning project. This is not for actual use.
DAD is a database consisting of patient demographics, comorbidities, interventions and the length of stay for the de-identified 10% sample of hospital admissions. DAD (2014-15) has an enhanced dataset with variables that were created at Western to act as flags for ICD-10 and CCI groupings, to make using the file easier.
Here is an experiment with the DAD enhanced dataset to create a Random forest model for predicting the total length of hospital stay (TLOS) in less than 100 lines of code. Random forests are an ensemble classifier, that operates by building multiple decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. This is a learning project for Apache Spark and Spark ML using pyspark. The accuracy of the model taking all derived categorical variables is low.
I have access to Apache Spark @ CC. If you are installing Spark in your computer you may have to change the following:
Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2014-15). However, the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.