a closer look at unstructured healthcare data

what is the difference between structured and unstructured data?

Structured data consists of a variable name and a value. Values can be numerical (e.g. height, weight and blood pressure), categorical (e.g. blood type) or ordinal (e.g. stages of a disease diagnosis). Structured data can be automatically combined and processed because it has straightforward boundaries and is created and stored in a standardized format.

Unstructured data, on the other hand, is not machine readable and therefore requires considerable pre-processing in preparation for use with analysis tools. Any data not stored in a pre-defined database format is considered unstructured.

An estimated 80% of healthcare data is unstructured. The most common unstructured sources include EHR free-form text fields, discharge summaries, progress notes, physician clinical notes, lab reports, socioeconomic data, medical images and any faxed records.

why is most healthcare data unstructured?

A significant portion of unstructured data was never meant to be structured and simply doesn't fit neatly into structured value fields. Clinical notes, for example, are usually complex and heterogeneous and cannot be mapped to the predefined structure of a data table. Medical images, such as X-rays and MRI’s, which are generally indecipherable to all except highly trained professionals, and also not translatable to a data table or a relational database.

The industry's reliance on faxing also bears some responsibility for the prevalence of unstructured data in healthcare. Despite broad adoption of EHR systems in the past decade, fax remains the primary means of exchanging patient information. Most EHR's still do not properly support interoperability and only offer integration with other users of their system. Providers who need to exchange patient information often resort to printing and faxing structured records - even to share information within their own organization. Records sent by fax are either filed away or scanned and attached to the patient record as static PDF attachments. Extra steps, including Optical Character Recognition (OCR), must be applied to convert the scanned record into machine-readable text before it can be re-structured for use with analysis tools.

Why is Unstructured healthcare data uniquely challenging?

Translating the language of healthcare presents unique challenges due to the inconsistent and esoteric nature of medical terminology. There is a wide variety of ways to express the same clinical concepts across organizations and specialties. Clinical jargon, acronyms, abbreviations and misspellings are all commonly found in both structured and unstructured data, further complicating the process of conversion and categorization.

While the insights contained within unstructured data are essential to risk assessment and population health initiatives, it is typically not incorporated into analysis. Converting unstructured data into structured inputs is an expensive and complex endeavor that requires a specialized combination of machine learning algorithms and clinically trained natural language processing (NLP). As a result, many plans and providers resort to inefficient and unreliable manual processes to extract insights from unstructured data.

NLP and Machine Learning Can Unlock Insights In Unstructured Data

Recent advancements in NLP and machine learning have made it increasingly possible to unlock the valuable insights trapped in unstructured data. Clinical natural language processing algorithms are trained to understand the vast nuances in the language of healthcare and the complexity of clinical conditions using deep learning techniques. These algorithms can decipher patterns, relationships and relevant assertions even within the complexities of unstructured text.

NLP can be particularly useful for identifying indications of disease within medical records. Text mining using NLP techniques can recognize the many different iterations of possible indicators of disease and surface reliable suggestions for providers to assess at the point-of-care.

How MDPortals Can Help

MDPortals collects patients’ records from every possible source across the continuum of care, synthesizes the captured data into one composite health record, abstracts all HCC opportunities, and delivers them directly into your EHR, all in a matter of hours.

The data MDPortals extracts from a wide range of disparate sources comes in a variety of incompatible and unstructured formats. The MDPortals synthesis engine prepares the ingested data for use with analysis tools and point-of-care guidance by transforming the raw inputs into a clean and reliable picture of a patients’ health.

The MDPortals synthesis engine utilizes OCR, advanced parsing techniques, Clinical Natural Language Processing (cNLP), proprietary logic and enterprise master patient index (EMPI) matching to clean, de-duplicate and enrich all structured and unstructured inputs from data acquisition.

Faxed records, free-text physician notes, scanned charts and other unstructured inputs are converted into machine-readable text and mapped to the appropriate section in the patient’s structured record.

The final package is delivered as a single longitudinal record into each patient’s chart within your EHR, viewable natively. The MDPortals Compendium is compatible with any Meaningful Use certified EHR.

Contact us to learn more