Authors : Ramanan SV, Radhakrishna K, Waghmare A, Raj T, Nathan SP, Sreerama SM, Sampath S
Publication Year : 2016
Electronic Health Record (EHR) use in India is generally poor, and structured clinical information is mostly lacking. This work is the first attempt aimed at evaluating unstructured text mining for extracting relevant clinical information from Indian clinical records. We annotated a corpus of 250 discharge summaries from an Intensive Care Unit (ICU) in India, with markups for diseases, procedures, and lab parameters, their attributes, as well as key demographic information and administrative variables such as patient outcomes. In this process, we have constructed guidelines for an annotation scheme useful to clinicians in the Indian context. We evaluated the performance of an NLP engine, Cocoa, on a cohort of these Indian clinical records. We have produced an annotated corpus of roughly 90 thousand words, which to our knowledge is the first tagged clinical corpus from India. Cocoa was evaluated on a test corpus of 50 documents. The overlap F-scores across the major categories, namely disease/symptoms, procedures, laboratory parameters and outcomes, are 0.856, 0.834, 0.961 and 0.872 respectively. These results are competitive with results from recent shared tasks based on US records. The annotated corpus and associated results from the Cocoa engine indicate that unstructured text mining is a viable method for cohort analysis in the Indian clinical context, where structured EHR records are largely absent.