Bell Eapen MD, PhD.

Bringing Digital health & Gen AI research to life!

Come, join us to make generative AI in healthcare more accessible! 

ChatGPT captured the imagination of the healthcare world though it led to the rather misguided belief that all it needs is a chatbot application that can make API calls. A more realistic and practical way to leverage generative AI in healthcare is to focus on specific problems that can benefit from its ability to synthesize and augment data, generate hypotheses and explanations, and enhance communication and education. 

Bovee and Thill, CC BY 2.0 , via Wikimedia Commons
Generative AI Image credit: Bovee and Thill, CC BY 2.0, via Wikimedia Commons

One of the main challenges of applying generative AI in healthcare is that it requires a high level of technical expertise and resources to develop and deploy solutions. This creates a barrier for many healthcare organizations, especially smaller ones, that do not have the capacity or the budget to build or purchase customized applications. As a result, generative AI applications are often limited to large health systems that can invest in innovation and experimentation. Needless to say, this has widened the already big digital healthcare disparity. 

One of my goals is to use some of the experience that I have gained as part of an early adopter team to increase the use and availability of Gen AI in regions where it can save lives. I think it is essential to incorporate this mission in the design thinking itself if we want to create applications that we can scale everywhere. What I envision is a platform that can host and support a variety of generative AI applications that can be easily accessed and integrated by healthcare organizations and professionals. The platform would provide the necessary infrastructure, tools, and services to enable developers and users to create, customize, and deploy generative AI solutions for various healthcare problems. The platform would also foster a community of practice and collaboration among different stakeholders, such as researchers, clinicians, educators, and patients, who can share their insights, feedback, and best practices. 

I have done some initial work, guided by my experience in OpenMRS and I have been greatly inspired by Bhamini. The focus is on modular design both at the UI and API layers. OpenMRS O3 and LangServe templates show promise in modular design. I hope to release the first iteration on GitHub in late August 2024. 

Do reach out in the comments below, if you wish to join this endeavour, and together we can shape the future of healthcare with generative AI. 

Why is RAG not suitable for all Generative AI applications in healthcare?

Retrieval-augmented generation (RAG) is a method of generating natural language that leverages external knowledge sources, such as large-scale text corpora. RAG first retrieves a set of relevant documents for a given input query or context and then uses these documents as additional input for a neural language model that generates the output text. RAG aims to improve the factual accuracy, diversity, and informativeness of the generated text by incorporating knowledge from the retrieved documents. 

RAG applications
Image credit: Nomen4Omen with relabelling by Felix QW, CC BY-SA 3.0 DE, via Wikimedia Commons

However, it may not be suitable for all healthcare applications because of the following reasons: 

RAG relies on the quality and relevance of the retrieved documents, which may not always be available or accurate for specific healthcare domains or tasks. For example, if the task is to generate a personalized treatment plan for a patient based on their medical history and symptoms, RAG may not be able to retrieve any relevant documents from a general-domain corpus, or it may retrieve outdated or inaccurate information that could harm the patient’s health. 

– RAG may not be able to capture the complex and nuanced context of healthcare scenarios, such as the patient’s preferences, values, goals, emotions, or social factors. These aspects may not be explicitly stated in the retrieved documents, or they may require additional knowledge and reasoning to infer. For example, if the task is to generate empathetic and supportive messages for a patient who is diagnosed with a terminal illness, RAG may not be able to consider the patient’s psychological state, coping strategies, or family situation, and may generate generic or inappropriate responses that could worsen the patient’s distress.

– RAG cannot be used to summarize a patient’s medical history as it may not be able to extract the most relevant and important information from the retrieved documents, which may contain a lot of noise, redundancy, or inconsistency. For example, if the task is to generate a concise summary of a patient’s chronic conditions, medications, allergies, and surgeries, RAG may not be able to filter out irrelevant or outdated information, such as the patient’s demographics, vital signs, test results, or minor complaints, or it may include conflicting or duplicate information from different sources. This could lead to a confusing or inaccurate summary that could misinform the patient or the healthcare provider. 

Therefore, RAG is not suitable for all Generative AI applications in healthcare, and it may require careful design, evaluation, and adaptation to ensure its safety, reliability, and effectiveness in specific healthcare contexts and tasks. 

Cite this article as: Eapen BR. (May 11, 2024). - Why is RAG not suitable for all Generative AI applications in healthcare?. Retrieved July 17, 2024, from

Grounding vs RAG in Healthcare Applications 

Both Grounding and RAG (Retrieval-Augmented Generation) play significant roles in enhancing LLMs capabilities and effectiveness and reducing hallucinations. In this post, I delve into the subtle differences between RAG and grounding, exploring their use in generative AI applications in healthcare. 

Grounding vs RAG
Image credit: [ Michael Havens -], CC BY 2.0, via Wikimedia Commons

What is RAG? 

RAG, short for Retrieval-Augmented Generation, represents a paradigm shift in the field of generative AI. It combines the power of retrieval-based models with the fluency and creativity of generative models, resulting in a versatile approach to natural language processing tasks.  

  • RAG has two components; the first focuses on retrieving relevant information and the other on generating textual outputs based on the retrieved context. 
  • By incorporating a retrieval mechanism into the generation process, RAG can enhance the model’s ability to access external knowledge sources and incorporate them seamlessly into its responses. 
  • RAG models excel in tasks that require a balance of factual accuracy and linguistic fluency, such as question-answering, summarization, decision support and dialogue generation. 

Understanding Grounding 

On the other hand, grounding in the context of AI refers to the process of connecting language to its real-world referents or grounding sources in perception, action, or interaction. 

  • It helps models establish connections between words, phrases, and concepts in the text and their corresponding real-world entities or experiences. 
  • Through grounding, AI systems can learn to associate abstract concepts with concrete objects, actions, or situations, enabling more effective communication and interaction with humans. 
  • It serves as a bridge between language and perception, enabling AI models to interpret and generate language in a contextually appropriate manner. 

Contrasting RAG and Grounding 

While RAG and grounding both contribute to enhancing AI models’ performance and capabilities, they operate at different levels and serve distinct purposes in the generative AI landscape. 

  • RAG focuses on improving the generation process by incorporating a retrieval mechanism for accessing external knowledge sources and enhancing the model’s output fluency. 
  • Grounding, on the other hand, emphasizes connecting language to real-world referents, ensuring that AI systems can understand and generate language in a contextually meaningful way. 
  • In general, grounding uses a simple and faster model with a lower temperature setting, while RAG uses more “knowledgeable” models at higher temperatures. 
  • Grounding can also be achieved by finetuning a model on the grounding sources. 

Implications for Healthcare applications 

In the domain of healthcare, grounding is especially useful when the primary intent is to retrieve information for the clinician at the point of patient care. Typically, generation is based on patient information or policy information and the emphasis is on generating content that does not deviate much from the grounding sources. The variation from the source can be quantitatively measured and monitored easily. 

In contrast, RAG is useful in situations where LLMs are actively involved in interpreting the information provided to them in the prompt and making inferences, decisions or recommendations based on the knowledge originally captured in the model itself. The expectation is not to base the output on the input, but to use the provided information for intelligent and useful interpretations. It is difficult to quantitatively assess and monitor RAG and some form of qualitative assessment is often needed. 

In conclusion, RAG and grounding represent essential components in the advancement of generative AI and LLMs. By understanding the nuances of these concepts and their implications for healthcare, researchers and practitioners can harness their potential to create more intelligent and contextually aware applications. 

Navigating the Complexities of Gen AI in Medicine: 5 Development Blunders to Avoid

Below, I have listed five critical missteps that you should steer clear of to ensure the successful integration of Gen AI in Medicine. This post is primarily for healthcare professionals managing a software team developing a Gen AI application.

Image credit: Nicolas Rougier, GPL via Wikimedia Commons

#1 Focus on requirements

Gen AI is an evolving technological landscape. ChatGPT’s user interface makes it look simplistic. Even a simple interface to any LLM is useful for mundane clinical chores (provided PHI is handled appropriately). However, developing an application that can automate tasks or assist clinical decision-making requires much engineering. It’s crucial to define clear and detailed requirements for your Gen AI solution. Without a comprehensive understanding of the needs and constraints, your project can easily become misaligned with clinical goals. Ensure that your AI application is not only technically sound but also meets the specific demands of healthcare settings. This precision will guide your software development process, avoiding costly detours or features that do not add value to healthcare providers or patients.

#2 Avoid solutioning

When working with your software team, be wary of dictating specific semi-technical solutions too early in the process. Most applications require techniques beyond prompting in a text window. It’s essential to allow your engineering team to explore and assess various options that can best meet the outlined requirements. By fostering an environment where creative and innovative problem-solving flourishes, you enable the team to find the most effective and sustainable technological path. This approach can also lead to discoveries of new capabilities of Gen AI that could benefit your project.

#3 Prioritize features

It’s essential to prioritize the features that will bring the most value to the end-user. Engage with stakeholders, including clinicians and patients, to understand what functionalities are most critical for their workflow and care delivery. This collaborative approach ensures the practicality of the AI application and aligns it with user needs. Avoid overloading your app with unnecessary features that complicate the user experience and detract from the core value proposition. Instead, aim for a lean product with high-impact features.

Gen AI app development is a time-consuming and technically challenging process. It is important to keep this in mind while prioritizing. Time and resource management are key in this regard. Allocate sufficient time for your team to refine their work, ensuring that each feature is developed with quality and precision. This disciplined approach to scheduling also helps in avoiding burnout among your team members, which is common in high-pressure development environments. Remember, a feature-packed application that lacks reliability or user-friendliness is less likely to be embraced by the healthcare community. Focus on delivering a polished, useful tool.

#4 You may never get it right, the first time when it comes to Gen AI in Medicine

Accept that perfection is unattainable on the initial try. In the world of software, especially with Gen AI, iterative testing and refinement are key. Encourage your team to build a Minimum Viable Product (MVP) and then improve it through user feedback and continuous development cycles. This iterative process is crucial to adapt to the ever-changing needs of healthcare professionals and to integrate the latest advancements in AI. Also, don’t underestimate the value of user testing; real-world feedback is invaluable.

#5 Avoid technology pivots and information overloads

Avoiding abrupt technological shifts late in the development cycle is critical. Such pivots can be costly and disruptive, derailing the project timeline. Stay committed to the chosen technology stack unless significant, unforeseeable impediments arise. Additionally, guard against overwhelming your team with excessive information. While staying informed is crucial, too much data can paralyze decision-making. Strive for a balance that empowers your team with the knowledge they need to be effective without causing analysis paralysis.

In my next post, I will explain the symbols and notations that I employ in my Gen AI in Medicine development process. BTW, What is your next Gen AI in Medicine project?

Medprompt: How to architect LLM solutions for healthcare.

Leveraging the power of advanced machine learning, particularly large language models (LLMs), has increasingly become a transformative element in healthcare and medicine. The applications of LLMs in healthcare are multifaceted, showing immense potential to improve patient outcomes, streamline administrative tasks, and foster medical research and innovation.

David S. Soriano, CC BY-SA 4.0 <>, via Wikimedia Commons
Image Credit: David S. Soriano, CC BY-SA 4.0, via Wikimedia Commons

Architecting LLM solutions in the healthcare domain is challenging because of the intricacies associated with healthcare data and the complex nature of healthcare applications. In this post, I will give some recommendations based on the widely popular LangChain library, giving some examples.

The first step is to define the overarching problem you are trying to solve. It can be broad as in getting the right information about a patient to the doctor. Next, subdivide the problem into subproblems that can be tackled separately. For example in the above case, we need to find the patient’s health record, convert it into an easily searchable form (embedding), find areas of interest in the record and generate a summary or an answer to the specific question. Next, find solutions for each problem that may or may not require an LLM. Finally, design the orchestrator that can stitch everything together.

LangChain has some useful abstractions that will help in the last two steps. If the solution does not involve an LLM and mostly involves data retrieval and transformations, use the tool abstraction. If you need one or more LLM calls to achieve it, use the chain abstraction. Agents are the orchestrators that can stitch everything together. It is important to carefully craft the prompts for the chains and agents. Rigorous testing is vital. This includes technical performance and validation of the model’s recommendations by healthcare professionals to ensure they are accurate and clinically relevant.

MEDPrompt coming soon!

Named Entity Recognition using LLMs: a cTakes alternative?

TLDR: The targeted distillation method described may be useful for creating an LLM-based cTakes alternative for Named Entity Recognition. However, the recipe is not available yet. 

Image credit: Wikimedia

Named Entity Recognition is essential in clinical documents because it enhances patient safety, supports efficient healthcare workflows, aids in research and analytics, and ensures compliance with regulations. It enables healthcare organizations to harness the valuable information contained in clinical documents for improved patient care and outcomes. 

Though Large Language Models (LLMs) can perform Named Entity Recognition (NER), the capability can be improved by fine-tuning, where you provide the model with input text that contains named entities and their associated labels. The model learns to recognize these entities and classify them into predefined categories. However, as described before fine-tuning Large Language Models (LLMs) is challenging due to the need for substantial, high-quality labelled data, the risk of overfitting on limited datasets, complex hyperparameter tuning, the requirement for computational resources, domain adaptation difficulties, ethical considerations, the interpretability of results, and the necessity of defining appropriate evaluation metrics. 

Targeted distillation of Large Language Models (LLMs) is a process where a smaller model is trained to mimic the behaviour of a larger, pre-trained LLM but only for specific tasks or domains. It distills the essential knowledge of the LLM, making it more efficient and suitable for particular applications, reducing computational demands.  

This paper described targeted distillation with mission-focused instruction tuning to train student models that can excel in a broad application class. The authors present a general recipe for such targeted distillation from LLMs and demonstrate that for open-domain NER. Their recipe may be useful for creating efficient distilled models that can perform NER on clinical documents, a potential alternative to cTakes. Though the authors have open-sourced their generic UniversalNER model, they haven’t released the distillation recipe code yet. 

REF: Zhou, W., Zhang, S., Gu, Y., Chen, M., & Poon, H. (2023). UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. ArXiv. /abs/2308.03279 

Distilling LLMs to small task-specific models

Deploying large language models (LLMs) can be difficult because they require a lot of memory and computing power to run efficiently. Companies want to create smaller task-specific LLMs that are cheap and easy to deploy. Such small models may even be more interpretable, an important consideration in healthcare.

Distilling LLMs

Distilling LLMs refers to the process of training a smaller, more efficient model to mimic the behaviour of a larger, more complex LLM. This is done by training the smaller model on the same task as the larger model but using the predictions of the larger model as “soft targets” or guidance during training. The goal of distillation is to transfer the knowledge and capabilities of the larger model to the smaller model, without requiring the same level of computational resources.

Distilling step-by-step is an efficient distillation method proposed by Google that requires less amount of training data. The intuition is that the use of rationale generated by a chain of thought prompting along with labels during training, thereby framing it as multi-task learning, improves distillation performance. We can use ground truth labels or use a teacher LLM to generate the labels and rationale. Ground truth labels are the correct labels for the data, and they are typically obtained from human annotators. The rationale for each label can be generated by using the model to generate a short explanation for why the model predicted that label.

The paper on the method is here and the repository is here. I have converted the code from the original repository into a tool that can be used to distill any seq2seq model into a smaller model based on a generic schema. See the repository below. The original paper uses Google’s T5-v1 model, which is a large-scale language model that was developed by Google. It is part of the T5 (Text-to-Text Transfer Transformer) family of models and is based on the Transformer architecture. You can find more open-source base models for distilling on huggingface. The next plan is to use this method to create a model that can predict the FHIR filter for this repository.

Distilling LLMs step by step!

I will update this post regularly with my findings and notes on distilling models. Also, please check out my post on NLP tools in healthcare.

Kedro for multimodal machine learning in healthcare 

Healthcare data is heterogenous with several types of data like reports, tabular data, and images. Combining multiple modalities of data into a single model can be challenging due to several reasons. One challenge is that the diverse types of data may have different structures, formats, and scales which can make it difficult to integrate them into a single model. Additionally, some modalities of data may be missing or incomplete, which can make it difficult to train a model effectively. Another challenge is that different modalities of data may require different types of pre-processing and feature extraction techniques, which can further complicate the integration process. Furthermore, the lack of large-scale, annotated datasets that have multiple modalities of data can also be a challenge. Despite these challenges, advances in deep learning, multi-task learning and transfer learning are making it possible to develop models that can effectively combine multiple modalities of data and achieve reliable performance. 

Pipelines Kedro for multimodal machine learning

Kedro for multimodal machine learning

Kedro is an open-source Python framework that helps data scientists and engineers organize their code, increase productivity and collaboration, and make it easier to deploy their models to production. It is built on top of popular libraries such as Pandas, TensorFlow and PySpark, and follows best practices from software engineering, such as modularity and code reusability. Kedro supplies a standardized structure for organizing code, handling data and configuration, and running experiments. It also includes built-in support for version control, logging, and testing, making it easy to implement reproducible and maintainable pipelines. Additionally, Kedro allows to easily deploy the pipeline on cloud platforms like AWS, GCP or Azure. This makes it a powerful tool for creating robust and scalable data science and data engineering pipelines. 

I have built a few kedro packages that can make multi-modal machine learning easy in healthcare. The packages supply prebuilt pipelines for preprocessing images, tabular and text data and build fusion models that can be trained on multi-modal data for easy deployment. The text preprocessing package currently supports BERT and CNN-text models. There is also a template that you can copy to build your own pipelines making use of the preprocessing pipelines that I have built. Any number and combination of data types are supported. Additionally, like any other kedro pipeline, these can be deployed on kubeflow and VertexAI. Do comment below if you find these tools useful in your research. 

Dark Mode

kedro-multimodal (this link opens in a new window) by dermatologist (this link opens in a new window)

Template for multi-modal machine learning in healthcare using Kedro. Combine reports, tabular data and image using various fusion methods.