The integration of generative AI into healthcare has the potential to revolutionize the industry, from drug discovery to personalized medicine. However, the success of these applications hinges on the availability of high-quality, curated datasets such as MIMIC. These datasets are crucial for training and testing AI models to ensure they can perform tasks accurately and reliably.
The Medical Information Mart for Intensive Care (MIMIC) dataset is a comprehensive, freely accessible database developed by the Laboratory for Computational Physiology at MIT. It includes deidentified health data from over 40,000 critical care patients admitted to the Beth Israel Deaconess Medical Center between 2001 and 2012. The dataset encompasses a wide range of information, such as demographics, vital signs, laboratory test results, medications, and caregiver notes. MIMIC is notable for its detailed and granular data, which supports diverse research applications in epidemiology, clinical decision-making, and the development of electronic health tools. The open nature of the dataset allows for reproducibility and broad use in the scientific community, making it a valuable resource for advancing healthcare research.
MIMIC-IV has been converted into the Fast Healthcare Interoperability Resources (FHIR) format and exported as newline-delimited JSON (ndjson). FHIR provides a structured way to represent healthcare data, ensuring consistency and reducing the complexity of data integration. However, importing the ndjson export of FHIR resources into a FHIR server can be challenging. Having the MIMIC-IV dataset loaded onto a FHIR server could be incredibly valuable. It would provide a consistent and reproducible environment for testing and developing Generative AI applications. Researchers and developers could leverage this setup to create and refine AI models, ensuring they work effectively with standardized healthcare data. This could ultimately lead to more robust and reliable AI applications in the healthcare sector. Here I show you how to do it in two easy steps using docker and the MIMIC-IV demo dataset.
STEP 1: Start the FHIR server
Use docker-compose to spin up the latest HAPI FHIR server that supports bulk data import using the docker-compose.yml file as below.
version: "3.7"
services:
fhir:
image: hapiproject/hapi:latest
ports:
- 8080:8080
restart: "unless-stopped"
environment:
- hapi.fhir.bulkdata.enabled=true
- hapi.fhir.bulk_export_enabled=true
- hapi.fhir.bulk_import_enabled=true
- hapi.fhir.cors.enabled=true
- hapi.fhir.cors.allow_origin=*
- hapi.fhir.enforce_referential_integrity_on_write=false
- hapi.fhir.enforce_referential_integrity_on_delete=false
- "spring.datasource.url=jdbc:postgresql://postgres-db:5432/postgres"
- "spring.datasource.username=postgres"
- "spring.datasource.password=postgres"
- "spring.datasource.driverClassName=org.postgresql.Driver"
- "spring.jpa.properties.hibernate.dialect=ca.uhn.fhir.jpa.model.dialect.HapiFhirPostgres94Dialect"
postgres-db:
image: postgis/postgis:16-3.4
restart: "unless-stopped"
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=postgres
ports:
- 5432:5432
volumes:
- postgres-db:/var/lib/postgresql/data
volumes:
postgres-db: ~
Please note that the referential integrity on write is set to false.
docker compose up to start the server at the following base URL: http://localhost:8080/fhir
STEP 2: Send a POST request to the $import endpoint.
The full MIMIC-IV dataset is available here for credentialed users. The demo dataset used in the request below is available here. You don’t have to download the dataset. The request below contains the URL to the demo data sources. Anyone can access the files, as long as they conform to the terms of the license specified in this page. All you need is an internet connection for the docker environment. The FHIR $import operation allows for bulk data import into a FHIR server. When using resource type Parameters, you can specify the types of FHIR resources to be imported. This is done by including a Parameters resource in the request body, which details the resource types and their respective data files. I use the VSCODE REST Client extension to make the request and the format below aligns with its requirements. However, you can make the POST request in any way you prefer.
###
POST http://localhost:8080/fhir/$import HTTP/1.1
Prefer: respond-async
Content-Type: application/fhir+json
{
"resourceType": "Parameters",
"parameter": [ {
"name": "inputFormat",
"valueCode": "application/fhir+ndjson"
}, {
"name": "inputSource",
"valueUri": "http://example.com/fhir/"
}, {
"name": "storageDetail",
"part": [ {
"name": "type",
"valueCode": "https"
}, {
"name": "credentialHttpBasic",
"valueString": "admin:password"
}, {
"name": "maxBatchResourceCount",
"valueString": "500"
} ]
}, {
"name": "input",
"part": [ {
"name": "type",
"valueCode": "Observation"
}, {
"name": "url",
"valueUri": "https://physionet.org/files/mimic-iv-fhir-demo/2.0/mimic-fhir/ObservationLabevents.ndjson"
} ]
}, {
"name": "input",
"part": [ {
"name": "type",
"valueCode": "Medication"
}, {
"name": "url",
"valueUri": "https://physionet.org/files/mimic-iv-fhir-demo/2.0/mimic-fhir/Medication.ndjson"
} ]
}, {
"name": "input",
"part": [ {
"name": "type",
"valueCode": "Procedure"
}, {
"name": "url",
"valueUri": "https://physionet.org/files/mimic-iv-fhir-demo/2.0/mimic-fhir/Procedure.ndjson"
} ]
}, {
"name": "input",
"part": [ {
"name": "type",
"valueCode": "Condition"
}, {
"name": "url",
"valueUri": "https://physionet.org/files/mimic-iv-fhir-demo/2.0/mimic-fhir/Condition.ndjson"
} ]
}, {
"name": "input",
"part": [ {
"name": "type",
"valueCode": "Patient"
}, {
"name": "url",
"valueUri": "https://physionet.org/files/mimic-iv-fhir-demo/2.0/mimic-fhir/Patient.ndjson"
} ]
} ]
}
That’s it! It takes a few minutes for the bulk import to complete, depending on your system resources.
Feel free to reach out if you’re interested in collaborating on developing a gold QA dataset for testing clinician-facing GenAI applications. My research is centered on creating and validating clinician-facing chatbots.
- Loading MIMIC dataset onto a FHIR server in two easy steps - November 20, 2024
- R&D and Innovation in IT; to or not to combine both - November 15, 2024
- Locally hosted LLMs - July 14, 2024