RDF and Population Informatics

A PICTURE OF A RDF (Photo credit: Wikipedia)

I have been an RDF fan even before I used it for dermbase. I promptly signed the Yosmite Manifesto and blogged about it last year. After gaining more experience in the regional health information exchange initiative(s), I still feel that RDF is important, but in a different way.

Most federated regional clinical viewers query host databases, convert the results into an intermediary format (mostly xml or HL7), apply filters and then provide a consolidated view in the browser and mobile as html embellished with jQuery. Though this seems not-so-scalable technology, it works remarkably well in a regional context. Federated clinical viewers also attempt to create data warehouses on top of the Clinical Viewer. Such data warehouses have enormous potential in population informatics and RDF could be an ideal framework for this purpose.

RDF is a proven technology that is schema agnostic. However in this context the biggest advantage of RDF is its data-atomic nature that enables each data element to be queried, changed, or deleted independent of any other data element. RDF blank nodes can be used to effectively anonymize the data. From a data analytics perspective representation in the RDF format makes data amenable for “reasoning” to discover new knowledge.

Genomic data analytics has revolutionized pre-clinical research. Growing popularity of Health Information Technology (HIT) and Health Information Exchange (HIE) has not yet resulted in a similar impact on population health. There are some fundamental differences between genomic and clinical data.

The fundamental characteristics of genomic data are:

1. The data format is simple though it can be annotated in different ways.
2. Raw data is collected first without consideration of relevance. Hypothesis formulation and analysis come later.
3. The data is mostly anonymous.
4. The format and analysis protocol remain the same.

The clinical data has different characteristics:

1. The data is often complex and hence it is difficult to have a uniform format.
2. Data is collected to prove or disprove a hypothesis/diagnosis. Hence only relevant data is collected.
3. The data is often tagged to an individual.
4. The analysis protocol and data collection depends on the hypothesis/diagnosis.

RDF framework would allow abstracting population data from normal everyday HIE data for clinical practice, but both operating within the same ecosystem. The framework will also allow clinical data to have the analytics friendly qualities of genomic data. The clinical viewer can push data into the RDF repository without a separate warehousing process thereby reducing overhead and increasing relevance. New generation wearable devices and monitors can push anonymized raw data directly into the RDF repository. The privacy and security concerns of this architecture will be minimal.

There is another hitherto unexplored advantage for such a clinical RDF repository. Temporal data related to climate changes and other events such as natural calamities can also be pushed into the “structureless” RDF repository making it possible to assess the population health impact of such events.