Mining For Errors in Electronic Medical Records

This is a joint project of Chris Jermaine at Rice and Elmer Bernstam at the University of Texas Health Science Center.

The work is supported by the National Science Foundation as a collaborative project entitled: "Data Mining and Cleaning for Medical Data Warehouses"; the NSF award number is 0964526 at Rice and 0964613 at U Texas. Initial work on the project was funded through a seed grant from the John & Ann Doerr Fund for Computational Biomedicine.

The Promise of EMR Analysis

One of the most important benefits of the move towards electronic medical records (EMRs) is the potential for discovering important, clinically-actionable knowledge by analyzing large amounts of warehoused medical data. For example:

1. EMRs could be used to address the efficacy vs. effectiveness question. Efficacy is, "Does X work in a clinical trial?" E.g., "Does aspirin prevent heart attacks in a randomized clinical trial?" Effectiveness is, "Does X have the predicted effect in actual practice?" E.g., Do people in the community who are given aspirin actually have fewer heart attacks?

2. EMRs could be used to find support for (or against) specific hypotheses that health scientists might want to test (e.g., ``Is there an association between X and Y in patient type Z?'').

3. EMR databases could also be used in conjunction with computer algorithms to automatically discover key patterns (new diseases, cancer risk factors, etc.) whose existence is now totally unsuspected.

The Problem with EMRs. Unfortunately, we are a long way from using EMRs to answer these sorts of questions. MRs consist of both structured information (such as billing codes and demographic information) and unstructured information (such as text and images), and the information is not always easy to understand or extract, nor are they reliable. Demographic data are perhaps the most high quality, since patient birthdate, race, and gender are often recorded as columns in a database table. But even simple demographic data cannot be taken at face value; the most common patient weight in many databases is zero pounds, simply because that is the default value or because no one bothered to code an unknown or un-measured weight using a "null". Laboratory results may be somewhat less reliable, since the fact that the results are reported by different laboratories causes data integration problems---different labs report different results in different ways. The diagnosis and medical procedure information---which are perhaps of greatest importance when mining the data---are very unreliable. Diagnoses and treatments are usually stored in EMRs as codes that were originally extracted from electronic billing records. In practice, these codes often bear little resemblance to the actual diagnoses and treatments listed on the patients' charts. "migraine" on the chart becomes a "headache" in the billing data; a patient who has some confusion due to a medication described on his chart has "paranoid personality disorder" listed in the billing data, and so on. Such inaccuracies will result in systematic biases that severely impact the utility of any analysis that is applied to the data. To understand or extract diagnoses, it is often necessary to look at text-based clinical notes, which introduces all manner of difficulty, since automated processing of text in a domain as diverse and unconstrained as medicine is a very difficult problem.

Specific Problems Addressed

During the first year of the project, we have worked on the following problems:

Development of a Gold-Standard Dataset. A fundamental problem in medical informatics is document classification. For example, imagine that a medical researcher is trying to design a clinical trial. Using a database of medical records, the researcher is trying to identify patients in remission from breast cancer, who are aged 40 to 50. Once they have been identified, they can be contacted to inquire about possible participation in the trial.

The difficulty is that the database that needs to be analyzed may have information on half a million patients, and the fraction of patients who meet the criteria in question might be 0.1%. This means that automatic or semi-automatic methods must be used to classify the data. In order to design and evaluate such methods, it is critical to have a "gold-standard" database of pre-labeled data. One of the sub-projects we have undertaken over the first year of the project is developing just such a database using the U Texas clinical data warehouse. Specifically, we have produced a database of 10,000 electronic medical records where each medical record has been labeled with whether or not the record corresponds to a patient with breast cancer. We did this by first randomly choosing 10,000 records from the U Texas database, and then developing a keyword-based query through trial and error that was able to cut down the data set from 10,000 records to 1,000 records in such a way that we are quite sure that we did not lose any breast cancer cases. This involved two domain experts (medical doctors) coming up with a set of keywords and billing codes that would catch anyone and everyone with breast cancer---this includes medications for breast cancer, different names for the disease, and different tests for the disease. Then, the experts went through all of the 1,000 records that were returned and reviewed each of them by hand, cutting down the 1,000 to around 80 that were actual breast cancer cases.

Automatic Highlighting of Discriminative Passages in Text. Stated simply, the task of labeling medical records is very difficult. We strongly believe that no currently available, fully automatic tool for the task of labeling electronic medical records can simultaneously produce high recall and high precision. We have found that support vector machines, decision trees, random forests, and even domain-specific, ontology-based AI-like methods fail frequently for a variety of reasons. For example, imagine that we have two records. In the clinical notes (free text), the records clearly state that two women can have the same set of tests (mammograms, MRIs, biopsies) leading to the same set of keywords in the record. But in the end, one woman has cancer and one does not. The lack of cancer is indicated via the sentence "The mass was found to be benign", and the presence via the sentence "The mass was found to be malignant." Any method that correctly identifies the lack of cancer would have to understand that the benign mass was being tested for breast cancer cells. And any method correctly recognizing the breast cancer would have to understand that the malignant mass was found in breast tissue. And even when a sentence clearly states that a patient has breast cancer, things are not so simple: "The patient's mother has breast cancer. The mother was diagnosed at age 35, the patient at age 32.

This leads us to believe that there is no classification methodology on the horizon that is going to solve this problem, meaning that if high precision and recall are mandatory, then manual or semi-manual methods must be used. But the problem with having a human go through a large database of free-text medical records is that it can take a very long time to label the documents, with negative cases being the most difficult. Some narratives have tens of thousands of words, and if the narrative does not describe someone with the desired condition, the human labeler may feel that it is necessary to read the entire document with some care in order to determine that there is no indication that the patient has the condition. Thus, we have developed a method that takes as input a set of free-text documents, as well as a set of (likely incorrect) labels. Our method goes through the documents, and automatically learns to underline the passages in the text that a human labeler should look at most carefully in order to determine whether or not the document is a positive or negative example. For example, if the goal is to find records corresponding to breast cancer, our method might take as input the text:

On 12/10/10, the patient visited her doctor complaining of a persistent cough and runny nose. This was diagnosed as a likely allergy and an OTC antihistamine was recommended. At that time, the primary care physician noted that the patient had not had a mammogram for nearly two years, despite a family history of breast cancer. The patient had a mammogram on 1/1/11. The resulting image was poor and so a second was ordered and performed two weeks later. A mass with an irregular boarder was found. The mass was biopsied and found to be malignant.

And output:

On 12/10/10, the patient visited her doctor complaining of a persistent cough and runny nose. This was diagnosed as a likely allergy and an OTC antihistamine was recommended. At that time, the primary care physician noted that the patient had not had a mammogram for nearly two years, despite a family history of breast cancer. The patient had a mammogram on 1/1/11. The resulting image was poor and so a second was ordered and performed two weeks later. A mass with an irregular boarder was found. The mass was biopsied and found to be malignant.

We are currently designing a trial were the goal is to determine whether or not this sort of underlining can make human coders more efficient, while at the same time not compromising accuracy.

Grouping and Visualization for Co-Occurrence Analysis. The goal is to develop automatic tools that help an analyst to understand the co-morbidities that are present in the database, as well as to identify those "outlier" conditions that have unique co-occurrence patterns, unlike any other. We have developed a probabilistic graphical model, where each node in the model is one particular condition. An edge is placed between two condition if there is a person in the warehouse who has both of the conditions (the conditions exhibited by each patient are extracted using a standard medical concept extraction software). Our model then groups the morbidities into classes so that there are a large number of links within each class. Furthermore, each class is positioned in a latent, Euclidean space, in such a way that two classes that are close to one another tend to have a lot of links between them, but two classes that are far apart tend to have fewer links. The net result of this is that the various conditions within a class will tend to co-occur with one another, and that conditions that are in two different classes that are close to one another in the latent space will also tend to co-occur with one another. Furthermore, those conditions that are in "outlier" classes (those that are far from the center of the Euclidean space) are those that either do not co-occur with any other, or those that exhibit co-occurrence patterns that are seemingly random (that is, the condition does not seem to have an affinity for any other condition).

To date, we have designed and implemented a statistical model and associated algorithms for this problem. We are currently in the process of applying our methods to the U Texas clinical data warehouse; as a proof of concept we have applied our methods to social network analysis. If you are interested in learning more about our method, it is described in detail in a paper that describes that proof-of-concept application:

Zhuhua Cai, Chris Jermaine: The Latent Community Model for Detecting Sybil Attacks in Social Networks. Appeared in NDSS 2012.

High Dimensional Data Imputation. Often, medical records contain missing data, and it is necessary to impute that data for future analysis. This is particularly the case with time series data, which tend to be very high-dimensional and are ubiquitous in medical records. We considered the problem of imputation in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm²) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets.

Our objective was to develop an alternative for the standard GMRF that does not require a priori estimation of this covariance matrix. Along these lines, we proposed a very simple alternative to the GMRF for high-dimensional imputation called the pairwise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p; where p will typically be much smaller than the number of entries in a m by m covariance or precision matrix. This is described in detail in the following paper:

Zhuhua Cai, Christopher M. Jermaine, Zografoula Vagena, Dionysios Logothetis, Luis Leopoldo Perez: The Pairwise Gaussian Random Field for High-Dimensional Data Imputation . Appeared in ICDM 2013.

Other Problems. We have also looked at several other problems in the area, including:

1. Development of a Bayesian method to count the number of patients in an EMR database that exhibit a certain condition.

2. Using an EMR database for real-time biosurveillance--for example, trying to discover in real-time emerging associations, such as the (now) widely known and dangerous relatnionship between Vioxx use and acute coronary events.

3. Record de-duplication for EMR databases.

Project Participants

The following people are either funded by this project or have worked on the project:

Elmer Bernstam: Principal Investigator at UT.
Anna Drummond: PhD student at Rice.
Jorge Herskovic: Faculty at UT.
Chris Jermaine: Principal Investigator at Rice.
Christopher Loverich: Postdoctoral researcher at UT.
Risa Myers: PhD student at Rice.
Pamella Bozzo Silva: Research assistant at UT.
Foula Vagena: Postdoctoral researcher at Rice
Cai Zhuhua: PhD student at Rice.

The Promise of EMR Analysis

Specific Problems Addressed

Project Participants

Last modified Jun 13th, 2016.