Imagine wanting to analyze the notes that a doctor’s typed up in a patient’s electronic health records (EHR) who has tested positive for COVID-19, including descriptions of symptoms and complications in great detail. These notes contain nuances that could be vital to understanding the development of the disease, the manner of transmission and the most effective treatments with the least side effects. Doing this in a timely and efficient manner can be critically important to address issues for better prevention, preparedness and even a cure.
Similarly, there is exorbitant amounts of healthcare and bio-medical care data for various diseases made available through physician notes, insurance claim, EHR, medical journals, news feeds, social media, etc. All of this data lacks utility, unless mined and brought into shape. The emerging technologies in text processing techniques and resources give way to an ocean of opportunities for providing useful insight, analysis and deduction which mimic the behavior of experts associated with healthcare and its related domain.
This post exemplifies use of some of latest technologies and resources to mine concepts from the bio-medical domain, and applying Teradata Vantage’s advanced analytics capabilities to analyze and predict useful diagnosis and prescription.
Electronic Health Records (EHR) are the digital patient information records that are inputted by a physician/clinician after each visit/examination. These recorded entries are manual, free-form text inputs containing a variety of medical information including patient demographics, diseases, anatomy, medication, treatments, dosages, etc. - all which lack structure. These records are often grammatically incorrect, have misspelt names and acronyms which are difficult to disambiguate from different contexts of usage.
In order to process such complex and irregular domain-specific text, we need at our disposal some powerful tools which are able to disambiguate, mine and structure the text which can, in turn, provide ground for further advanced analytics:
- One powerful instrument for cleaning and shaping text is Regex. Using Vantage’s Regex functions, text is transformed by removing non-ascii and other mark-up tags, performing sentence segmentation and other text normalization tasks.
- Next, we use an important entity recognition tool, MetaMap, which is used to map biomedical text to Unified Medical Language System (UMLS) concepts. It uses a knowledge intensive approach coupled with natural language processing and computational linguistics to categorize concepts and acronyms into 137 possible types and groups of categories. This is a key resource to understanding medical information which is made freely available to promote and improve healthcare services. Through API calls, we were able to transform our dataset into a rich corpus tagged with medical entities and their inter-relations. An example output of an entity tagged sentence is shown in Figure 1.
Figure 1: Bio-Medical Entity Recognition
- Syntactics dependency parser gives grammatical structuring to a sentence which in turn helps to pin-point deeper analysis about expressed opinion. Concept negations, conjunctions and adjectival terms help to extract aspect information and opinionated terms from the sentence. This helps to identify at a finer level the sentiment associated with a specific terms or concept rather than jumbled sentiment at the coarse sentence level. To build dependency parse, advanced NLP libraries from python are a good choice, whereas for sentiment analysis, in-built models and trainers are available within Vantage.
Figure 2: Dependency Parse of Opinionated Sentence
Figure 3: Disorder type mention in each visitor report along with sentiment
For each inspection report, using the features for various categories such as medication, diseases, body parts, etc., along with possible associated sentiment of each aspect, we are able to build advanced models for the medical condition of patients. By using native Vantage capabilities, example analytics are built to obtain useful insight and deductions:
- Using the features for disorders and anatomy, we build a classifier to predict possible diagnosis for a patient. Such analytics can assist in a physician’ decision-making in prescribing medication and treatments, taking into account the patient’s past and present conditions along with historical treatment record.
- Clustering of physician reports based of various types of features, particularly disorder and anatomy, can reveal related examinations and patients with related symptoms and diseases. This is particularly useful when profiling patients based on their illness patterns.
- Using N-Path, it’s possible to obtain a trace of prescribed medication and visualize how physicians have treated cases belonging to the particular medical condition of patients.
Figure 4: Clustering Visualized using PCA and TSNE graphs
Figure 5: NPath tracing medication prescription
Given that healthcare, pharmaceutical and cosmetic companies are looking towards AI-enabled technologies to help provide useful insight into medical diagnosis, the approach presented here showcases Teradata’s ability to combine Vantage’s advanced analytics offering -- seamlessly integrated with open-source tools and techniques in text processing -- to decipher complex healthcare-related issues pertinent to industry requirements.