Bell Eapen MD, PhD.

Bringing Digital health & Gen AI research to life!

Six things data scientists in healthcare should know

Healthcare, like most other fields, is eager to get on the data science bandwagon. Data scientists can make a huge difference in the way big data is utilized for clinical decision-making. However, there are paradigmatic differences in the way data scientists from quantitative fields view the world, compared to their clinical counterparts. This is especially true in the emerging fields of machine learning and artificial intelligence. This may lead to considerable inefficiencies. As a person trained in both fields, here is my take on this.

Data scientists
Credit: Dasaptaerwin, CC0, via Wikimedia Commons

Data scientists should focus on the problem and not the solutions

Data scientists are excited about the latest GPT or BERT. Data scientists tend to refine the model a bit more using 10 more GPUs! In the process, they tend to solve problems that do not exist. From my experience practicing medicine in extremely resource-poor areas, simple solutions are valued more than BERT running on Kubernetes! This is true in the developed world as well, and many teams may have fundamental data needs that need to be tackled first.

Explanation comes before prediction

Emerging machine learning methods prioritize prediction accuracy compromising on explainability in the process. Clinicians, in most cases, cannot use nor trust a model that arrives at a conclusion without showing how it reached there. Hence, in the clinical domain, a simple logistic regression model may be more acceptable than a deep learning neural network. Parsimony is the key and a bit of feature selection to ensure parsimony will be appreciated always.

You need to know the clinical terminologies

A basic understanding of the clinical terminologies and terminology systems such as SNOMED and ICD is vital. It helps in understanding the clinical community better. Any healthcare analytics to consider variations in terminologies and adopt a standard system for consistency. Any tool that data scientists build for the clinical community should have support for terminology systems.

Biostatistics is more pervasive than you think

Most healthcare professionals are trained in biostatistics. Hence, the thinking leans towards population, sampling, randomization, blindings and showing a ‘statistically significant’ difference. Moving towards machine learning needs a paradigmatic shift. It may be useful to have a discussion on this at the outset.

Classes are of unequal importance

In healthcare, finding one class (e.g. cancer) is more important than the other class (e.g. no cancer). One class may need active intervention to save lives. Hence, sensitivity and specificity are of vital importance than accuracy!

Life is precious!

In healthcare, there is no room for error. Some decisions may have disastrous consequences while few others may save lives. As a data scientist in the healthcare domain, you should be cognizant of the fact that healthcare data is different from banking/airline data.

Best peptides for women

In continuation with my previous posts here and in skin deep, let me add few more things I found recently.

Peptide Synthesis Deprotection
Peptide Synthesis Deprotection (Photo credit: Beige Alert)

The article titled “Systematic Discovery of New Recognition Peptides Mediating Protein Interaction Networks “ published recently in PLOS Biology ( full text at : ) describes the signalling by short peptide segments. Short peptide segments interact with globular protein domains that share a common sequence pattern (e.g., SH3 binding to PxxP). They also point out that sequence comparison experiments are unlikely to discover the optimum short motif. They recommend using data from genome-scale interaction studies. So the methodology adopted by the designers of peptide-21 may not be robust.

OK, now I am going to do something here that I rarely ever do on my blogs: I am going to shamelessly self promote me without actually giving away much information. My plan is to get noticed by the cosmetic tycoons and make some money out of this incessant babbling about beauty peptides 🙂

English: Example of mechanism of direct penetr...
English: Example of mechanism of direct penetrating peptide (Photo credit: Wikipedia)

I used a newly published algorithm (not mine) for bioinformatics analysis and found some interesting information.

The most commonly occurring tetra peptide in collagen is in fact GXXG and PXXP the commonest being GPPG.

The commonest tetra peptide repeats were GERG, GEKG, GFPG, GENG, GPRG, GHRG.

This is what terapeptide-21 is based on and there is nothing new till here.

But here comes the most surprising part!

The highest scoring short peptide of probable biological (signal) function does not belong to the above list!!

Hey, Do I hear my phone ringing?????

Lignin and Plant tissue utilization

PLOS Computational Biology: Functional Analysis of Metabolic Channeling and Regulation in Lignin Biosynthesis: A Computational Approach:

The growing energy needs of our burgeoning population can only be tackled by advances in biotechnology. Lignin is a plant cell wall protein that reduces forage digestibility, pulping efficiency, and sugar release. Any improvement in lignin digestibility could have enormous impact on the utilization of grains like corn and is an area of active research. This article explains a novel computational approach to decipher the intricacies of lignin biosynthesis.

Structure of a typical plant cell
Structure of a typical plant cell (Photo credit: Wikipedia)

Potential ways of improving energy utilization include reduction in the indigestible cell-wall fraction, improvement in the digestibility of cell wall, reduction in the rate of GI passage and increased rate of absorption. Lignin reduction has been achieved using anti-sense genes to limit production of key enzymes on the lignin biosynthesis pathway. A genome wide comparative survey of various grain crops might identify genes that can be targeted by an anti-sense approach. Unfortunately Bt corn that has been genetically modified to express the Cry1Ab protein of Bacillus thuringiensis to kill lepidopteran pests, has a higher lignin content than non-Bt corn.

Breaking down cellulose strands by Lyocell process and cellulase activity inducers like Sophorose could improve the digestibility of cell wall. I wonder whether we could employ any of the tricks we cosmetic dermatologists use to create nanoparticles for increasing the absorption!

‘via Blog this’

The 4th Virtual Training Workshop on Bioinformatics in 2010, organized by Asian Bioinformatics Research and Education Network (ABREN) was a great success, drawing 1,869 participants from 76 countries. Now they are planning to hold the 5th Workshop for two months starting from December, 2011. The registration will start on Nov. 4th.

Workshop and registration site:

The art of taking online help:

I am not a big time researcher with lots of international experience. However I would like to make an attempt to suggest few guidelines for the young Indian bioinformatician, seeking online help for project or showcasing their profile online.

How to address a researcher online? Generally in research community, people are not bothered much about show of respect. Hence sir, respected sir, the most adorable etc can be translated to lack of confidence or to too much submissiveness. Hence it is appropriate to address anybody by the second name adding the appropriate title. Just using the first name is also OK. However title is often taken seriously and addressing a Dr/Prof as Mr is a cardinal sin even if you add a liberal dose of sir/almighty to that.

Career guidance is often done face to face or over the phone or through forums specifically dedicated for that. However before posting career guidance questions to forums search the forum for similar questions unless your profile is unique. Questions like I am going to finish my Kinder Garden What should I do next to become a successful bioinformatician is unlikely to fetch many answers. If you dont have enough time to search the forum, dont expect anybody else to send a personal two page letter to you.

The same applies to very broad, open ended questions. Questions like How is bioinformatics important for clinical medicine? is unlikely to get much attention. Be as specific as possible. Do not expect others to provide complete answers in a platter. Answers will be mostly very short, incomplete and often cryptic (because you may not know what the other person is talking about). Be ready to do some background research on the answer rather than asking for more information.

Posting your profile in online forums is also an art. Bioinformatics is a very broad field and employers look for certain specific skills which you may not always have. I often see sequence analysis, genomics, proteiomics, computer programming, PERL, RUBY, EMERALD, systems biology, drug designing, structural and molecular biology, talking, reading and sleeping in the skill set, everyone competing to make the complete list. In reality, no body can be a complete bioinformatician and it is better to showcase your core competency which needs to be substantiated by your projects or publications.

Please post your comments / criticisms / suggestions here.

Bioinformatics Projects

I feel bioinformatics projects broadly fall into three categories.

  1. Academic project as part of UG or PG course.
  2. Professional projects for biotech/drug companies
  3. Hobbyist/personal projects

When you do an academic project it is important to achieve a preset target within a limited time frame. Hence you have to adopt a bottom up approach wherein you know what your final result is going to be and work your way up. It is always better to keep it simple. You always have time to do more complicated things later on.

Professional projects also have a predefined goal. However it has a wider scope but often has the backing of a team. Funding is also available. This is what most of us aspire to do once we become full fledged professionals.

The third type of projects is for people who are not primarily bioinformaticians, but try to explore this nascent specialty, keeping their field as the initial entry point. They often try a top down approach and may not be always successful!

Let me suggest the following topics, categorized based on your area of expertise.

  • IT Prepare a database (organism, disease, or any other) and deploy it online
  • Microbiology Select an organism and do a comparative genomic study
  • Biochemistry Model a pathway using systems biology tools and discuss its clinical significance.
  • Pharmacology Docking studies and study of the targets of existing drugs.
  • Clinicians Expression profile study of any disease of interest.
  • Vet / Agri Functional genomic study of any chosen gene.