Machine learning Archives

Random forest model for predicting the total length of hospital stay (TLOS)

TL;DR here is the Random Forest classifier code:

And an (obvious) upfront disclaimer: This is a learning project. This is not for actual use.

DAD is a database consisting of patient demographics, comorbidities, interventions and the length of stay for the de-identified 10% sample of hospital admissions. DAD (2014-15) has an enhanced dataset with variables that were created at Western to act as flags for ICD-10 and CCI groupings, to make using the file easier.

Here is an experiment with the DAD enhanced dataset to create a Random forest model for predicting the total length of hospital stay (TLOS) in less than 100 lines of code. Random forests are an ensemble classifier, that operates by building multiple decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. This is a learning project for Apache Spark and Spark ML using pyspark. The accuracy of the model taking all derived categorical variables is low.

I have access to Apache Spark @ CC. If you are installing Spark in your computer you may have to change the following:

SparkContext.setSystemProperty('spark.executor.memory', '48g')
SparkContext.setSystemProperty('spark.driver.memory', '6g')

Some of the commonly tweaked parameters can be changed here:

RF_NUM_TREES = 3
RF_MAX_DEPTH = 4
RF_MAX_BINS = 12

Uncomment the following line to include only variables that you need.

# df.select([c for c in df.columns if c in ['TLOS_CAT', 'COLNAME', 'COLNAME']]).show()

Here is the repo. How can this model be improved? Maybe a PCA before the RF? or Am I missing something important?

Parts of this material are based on the Canadian Institute for Health Information Discharge Abstract Database Research Analytic Files (sampled from fiscal years 2014-15). However, the analysis, conclusions, opinions and statements expressed herein are those of the author(s) and not those of the Canadian Institute for Health Information.

Published by Bell Eapen on August 29, 2018 | Permalink

Negative N to Unknown U

The identification of disease specific genes is pivotal in clinical informatics. This paper describes an improved algorithm for machine learning in which the negative N is classified more appropriately as Unknown U.

English: Weka Data Mining Open Software in Java (Photo credit: Wikipedia)

Peng Yang, Xiao-Li Li, Jian-Ping Mei, Chee-Keong Kwoh, and See-Kiong Ng. Positive-Unlabeled Learning for Disease Gene Identification
Bioinformatics first published online August 24, 2012 doi:10.1093/bioinformatics/bts504

SVMs are an important tool in bioinformaticians armamentarium. Weka is a collection of machine learning algorithms for data mining tasks.

Published by Bell Eapen on September 25, 2012 | Permalink

Bell Eapen MD, PhD.

Random forest model for predicting the total length of hospital stay (TLOS)

Negative N to Unknown U

Related articles