Mining For Errors in
Electronic Medical Records
This is a joint project of Chris Jermaine at Rice and Elmer Bernstam at the University of Texas Health Science Center.
The
work is supported by the National Science Foundation as a collaborative project
entitled: "Data Mining and Cleaning for Medical Data Warehouses"; the NSF award
number is 0964526
at Rice and 0964613
at U Texas. Initial work on the project was funded through a seed grant from
the John & Ann Doerr Fund for Computational Biomedicine.
One of the most important benefits of the
move towards electronic medical records (EMRs) is the potential for discovering
important, clinically-actionable knowledge by analyzing large amounts of
warehoused medical data. For example:
1.
EMRs could be
used to address the efficacy vs. effectiveness question.
Efficacy is, "Does X work in a
clinical trial?" E.g., "Does aspirin
prevent heart attacks in a randomized clinical trial?" Effectiveness is, "Does X
have the predicted effect in actual practice?" E.g., Do people in the
community who are given aspirin actually have fewer heart attacks?
2.
EMRs could be
used to find support for (or against) specific hypotheses that health
scientists might want to test (e.g., ``Is there an association between X and
Y in patient type Z?'').
3.
EMR databases
could also be used in conjunction with computer algorithms to automatically
discover key patterns (new diseases, cancer risk factors, etc.) whose existence
is now totally unsuspected.
The Problem with EMRs. Unfortunately, we are a long way from using
EMRs to answer these sorts of questions.
MRs consist of both structured information (such as billing codes and
demographic information) and unstructured information (such as text and
images), and the information is not always easy to understand or extract, nor
are they reliable. Demographic data are
perhaps the most high quality, since patient birthdate, race, and gender are often
recorded as columns in a database table. But even simple demographic data
cannot be taken at face value; the most common patient weight in many databases
is zero pounds, simply because that is the default value or because no one
bothered to code an unknown or un-measured weight using a "null". Laboratory
results may be somewhat less reliable, since the fact that the results are
reported by different laboratories causes data integration problems---different
labs report different results in
different ways. The diagnosis and
medical procedure information---which
are perhaps of greatest importance when
mining the data---are very unreliable.
Diagnoses and treatments are usually stored in EMRs as codes that were
originally extracted from electronic billing records.
In practice, these codes often bear little resemblance to the
actual diagnoses and treatments listed on the patients' charts.
"migraine" on the chart becomes a
"headache" in the billing data; a patient who has some confusion due to a
medication described on his chart has "paranoid personality disorder" listed in
the billing data, and so on. Such
inaccuracies will result in systematic biases that severely impact the utility
of any analysis that is applied to the data.
To understand or extract diagnoses, it is often necessary to look at
text-based clinical notes, which introduces all manner of difficulty, since
automated processing of text in a domain as diverse and unconstrained as medicine
is a very difficult problem.
During the first year of the project, we have
worked on the following problems:
Development of a Gold-Standard Dataset. A fundamental problem in medical informatics
is document classification. For
example, imagine that a medical researcher is trying to design a clinical
trial. Using a database of medical records, the researcher is trying to
identify patients in remission from breast cancer, who are aged 40 to 50.
Once they have been identified, they can be
contacted to inquire about possible participation in the trial.
The difficulty is that the database that
needs to be analyzed may have information on half a million patients, and the
fraction of patients who meet the criteria in question might be 0.1%. This
means that automatic or semi-automatic methods must be used to classify the
data. In order to design and evaluate such methods, it is critical to have a "gold-standard"
database of pre-labeled data. One of the sub-projects we have undertaken over
the first year of the project is developing just such a database using the U
Texas clinical data warehouse.
Specifically, we have produced a database of 10,000 electronic medical
records where each medical record has been labeled with whether or not the
record corresponds to a patient with breast cancer. We did this by first randomly
choosing 10,000 records from the U Texas database, and then developing a keyword-based
query through trial and error that was able to cut down the data set from
10,000 records to 1,000 records in such a way that we are quite sure that we
did not lose any breast cancer cases.
This involved two domain experts (medical doctors) coming up with a set
of keywords and billing codes that would catch anyone and everyone with breast
cancer---this includes medications for breast cancer, different names for the
disease, and different tests for the disease. Then, the experts went through
all of the 1,000 records that were returned and reviewed each of them by hand,
cutting down the 1,000 to around 80 that were actual breast cancer cases.
Automatic Highlighting of Discriminative
Passages in Text. Stated
simply, the task of labeling medical records is very difficult. We strongly believe
that no currently available, fully automatic tool for the task of labeling
electronic medical records can simultaneously produce high recall and high
precision. We have found that support vector machines, decision trees, random
forests, and even domain-specific, ontology-based AI-like methods fail
frequently for a variety of reasons. For example, imagine that we have two
records. In the clinical notes (free text), the records clearly state that two
women can have the same set of tests (mammograms, MRIs, biopsies) leading to
the same set of keywords in the record. But in the end, one woman has cancer
and one does not. The lack of cancer is indicated via the sentence "The mass
was found to be benign", and the presence via the sentence "The mass was found
to be malignant." Any method that correctly identifies the lack of cancer would
have to understand that the benign mass was being tested for breast cancer
cells. And any method correctly
recognizing the breast cancer would have to understand that the malignant mass
was found in breast tissue. And even when a sentence clearly states that a
patient has breast cancer, things are not so simple: "The patient's mother has
breast cancer. The mother was diagnosed at age 35, the patient at age 32.
This leads us to believe that there is no
classification methodology on the horizon that is going to solve this problem,
meaning that if high precision and recall are mandatory, then manual or semi-manual
methods must be used. But the problem with having a human go through a large
database of free-text medical records is that it can take a very long time to
label the documents, with negative cases being the most difficult. Some
narratives have tens of thousands of words, and if the narrative does not
describe someone with the desired condition, the human labeler may feel that it
is necessary to read the entire document with some care in order to determine
that there is no indication that the patient has the condition. Thus, we have
developed a method that takes as input a set of free-text documents, as well as
a set of (likely incorrect) labels. Our method goes through the documents, and
automatically learns to underline the passages in the text that a human labeler
should look at most carefully in order to determine whether or not the document
is a positive or negative example. For example, if the goal is to find records
corresponding to breast cancer, our method might take as input the text:
On 12/10/10, the patient visited her doctor complaining of a persistent cough and runny nose. This was diagnosed as a likely allergy and an OTC antihistamine was recommended. At that time, the primary care physician noted that the patient had not had a mammogram for nearly two years, despite a family history of breast cancer. The patient had a mammogram on 1/1/11. The resulting image was poor and so a second was ordered and performed two weeks later. A mass with an irregular boarder was found. The mass was biopsied and found to be malignant.
And output:
On 12/10/10, the patient visited her doctor
complaining of a persistent cough and runny nose. This was diagnosed as a
likely allergy and an OTC antihistamine was recommended. At that time, the primary care physician noted that
the patient had not had a mammogram for nearly two years, despite a family
history of breast cancer. The patient had a mammogram on 1/1/11. The
resulting image was poor and so a second was ordered and performed two weeks
later. A mass with an irregular boarder was found. The mass was biopsied and
found to be malignant.
We are currently designing a trial were the
goal is to determine whether or not this sort of underlining can make human
coders more efficient, while at the same time not compromising accuracy.
Grouping and Visualization for Co-Occurrence Analysis. The goal is
to develop automatic tools that help an analyst to understand the co-morbidities
that are present in the database, as well as to identify those
"outlier" conditions that have unique co-occurrence patterns, unlike
any other. We have developed a probabilistic graphical model, where each node
in the model is one particular condition. An edge is placed between two condition
if there is a person in the warehouse who has both of the conditions (the
conditions exhibited by each patient are extracted using a standard medical
concept extraction software). Our model then groups the morbidities into
classes so that there are a large number of links within each class.
Furthermore, each class is positioned in a
latent, Euclidean space, in such a way that two classes that are close to one
another tend to have a lot of links between them, but two classes that are far
apart tend to have fewer links. The net result of this is that the various conditions
within a class will tend to co-occur with one another, and that conditions that
are in two different classes that are close to one another in the latent space
will also tend to co-occur with one another. Furthermore, those conditions that
are in "outlier" classes (those that are far from the center of the
Euclidean space) are those that either do not co-occur with any other, or those
that exhibit co-occurrence patterns that are seemingly random (that is, the condition
does not seem to have an affinity for any other condition).
To date, we have designed and implemented a statistical model and
associated algorithms for this problem. We are currently in the process of
applying our methods to the U Texas clinical data warehouse; as a proof of
concept we have applied our methods to social network analysis. If you are
interested in learning more about our method, it is described in detail in a
paper that describes that proof-of-concept application:
Zhuhua Cai, Chris Jermaine: The Latent Community Model for
Detecting Sybil Attacks in Social Networks. Appeared in NDSS 2012.
High Dimensional Data Imputation.
Often, medical records contain missing data, and it is necessary to impute that data for future analysis. This is particularly the case with time series data, which tend to be very high-dimensional and are ubiquitous in medical records. We considered the problem of imputation in very high-dimensional data with an arbitrary covariance structure. The modern solution to this problem is the Gaussian Markov random field (GMRF). The problem with applying a GMRF to very high-dimensional data imputation is that while the GMRF model itself can be useful even for data having tens of thousands of dimensions, utilizing a GMRF requires access to a sparsified, inverse covariance matrix for the data. Computing this matrix using even state-of-the-art methods is very costly, as it typically requires first estimating the covariance matrix from the data (at a O(nm2) cost for m dimensions and n data points) and then performing a regularized inversion of the estimated covariance matrix, which is also very expensive. This is impractical for even moderately-sized, high-dimensional data sets.
Our objective was to develop an alternative for the standard GMRF that does not require a priori estimation of this covariance matrix.
Along these lines, we proposed a very simple alternative to the GMRF for high-dimensional imputation called the pairwise Gaussian random field or PGRF for short. The PGRF is a graphical, factor-based model. Unlike traditional Gaussian or GMRF models, a PGRF does not require a covariance or correlation matrix as input. Instead, a PGRF takes as input a set of p (dimension, dimension) pairs for which the user suspects there might be a strong correlation or anti-correlation. This set of pairs defines the graphical structure of the model, with a simple Gaussian factor associated with each of the p (dimension, dimension) pairs. Using this structure, it is easy to perform simultaneous inference and imputation of the model. The key benefit of the approach is that the time required for the PGRF to perform inference is approximately linear with respect to p; where p will typically be much smaller than the number of entries in a m by m covariance or precision matrix.
This is described in detail in the following paper:
Zhuhua Cai, Christopher M. Jermaine, Zografoula Vagena, Dionysios Logothetis, Luis Leopoldo Perez:
The Pairwise Gaussian Random Field for High-Dimensional Data Imputation
. Appeared in ICDM 2013.
Other Problems. We have also looked at several other problems
in the area, including: 1.
Development of a Bayesian
method to count the number of patients in an EMR database that exhibit a
certain condition. 2.
Using an EMR database for
real-time biosurveillance--for example, trying to discover in real-time
emerging associations, such as the (now) widely known and dangerous
relatnionship between Vioxx use and acute coronary events. 3.
Record de-duplication for EMR databases. The following people are either funded by this project or
have worked on the project: Project
Participants
Last modified Jun 13th, 2016.