Data Science

Matching for Non Random Studies

Experimental designs such as A/B testing are a cornerstone of
statistical practice. By randomly assigning treatments to subjects, we
can test the effect of a test versus a control (as in a clinical trial
for a proposed new drug) or can determine which of several web page
layouts for a promotional offer receives the largest response. Designed,
controlled experiments are a common feature of much of scientific and
business research. THe internet is a natural platform from which to
launch tests on almost any topic, and the principles of randomization
are easily understood.

Unfortunately, not all data are collected in this manner. Observational
studies are still commonplace. In medicine, many published results rely
on the observation of patients as they receive care, as opposed to
participating in a planned study. On the internet, offers are often sent
to whitelists which are made up of non-randomly selected potential
customers. And in manufacturing, it may not always be possible to
control environmental variables such as temperature and humidity, or
operational characteristics such as traffic intensity.

The fact that treatments are not randomly assigned should not
necessarily mean that the data subsequently collected is of no use. In
this note we will discuss ways that researchers can adjust their
analysis to account for the way in which treatments were assigned or
observed. We begin with a study of men who were admitted to hospital
with suspicion of heart attack. There were 400 subjects, aged 40-70,
with mortality within 30 days as the outcome. The treatment of interest
in this case is whether a new therapy involving a test medication is
more effective than the standard therapy. The issue is that the new
therapy was not applied randomly, but rather that patient prognosis was
a partial determinant in which treatment was given.

The raw data shows a lower mortality rate for the treatment group:

Outcomes of Study
Alive Dead
Control 168 40
Test 165 27

This corresponds to an odds ratio of 0.687, suggesting lower overall
mortality for the treatment group (although a test for the hypothesis of
equal mortality rates does not reject at α = 0.05). The odds ratio in
this case is the ratio of the odds of death given the test relative to
the odds of death given the control; values less than 1 show a stronger
likelihood of survival for test subjects relative to control subjects.

However, as suggested above, it does not appear that the treatment and
control are completely similar in terms of their covariates, as might be
expected in a fully randomized design – it appears that older subjects
and those with higher severity scores (a clinical rating of disease
severity) received the test at higher rates than the control:

png

png

Quantification of the treatment effect is at least partially confounded
with some of the underlying conditions that correlate with increased
expected mortality. So how to proceed? Regression (in this case,
logistic regression) is typically used to measure the effects of a
variable conditional on levels of the other variables, adjusting for
inequities in the distributions of the explanatory factors. Using a
logistic regression in this case, with a binary variable for treatment
(0 = Control, 1 = Test) and also linear variables for Age, Serverity,
and Risk Score (another prognostic assessment), we find that the odds
ratio for treatment vs control drops to 0.549, and is significantly less
than 1 at α = 0.05, with a 95% confidence interval of (0.31, 0.969).

However, this is an observational study, and to be useful, we need to
have more evidence that the effect of the test treatment can be measured
cleanly apart from the other variables AND from any (possibly
unintentional) selection biases that might arise due to the way
treatments were assigned to patients. That is, we need an unbiased
estimate of treatment effect, i.e., the effect of the test treatment
have if applied to the entire population of interest. To do this, we
need to introduce the concept of a counterfactual.

Propensity Scores and Other Matching Methods

In a perfect world, we would be able to measure the effect of both the
treatment and control on each subject. In most cases, including our
example on heart attacks, such a measurement is not available and not
possible. If we denote the response to the test treatment as
Y1 and the response to the control treatment as
Y0, we see that only one of these is observable — subjects
that get the treatment correspond to Y1 and subjects that
get the control correspond to Y0. For each case, the
unobservable outcome is called a counterfactual, a conceptual quantity
that does not exist. If we use Z to indicate which treatment was applied
(Z = 1 for test and Z = 0 for control), then the observation Y can
be expressed as the sum of an observation and an unobserved
counterfactual:

Y = *Z*Y1 + (1 − Z)Y0

For each subject we are interested in Y1 − Y0,
the difference between the observation and the unobserved
counterfactual. When looking at the target population as whole, we are
likely interested in the average treatment effect:

Δ = E(Y1 − Y0)=E(Y1)−E(Y0)

where E denotes expectation or average. Unless the treatment
assignment is independent of the other factors, estimates for Δ might
be biased. Matching and other approaches known as propensity scores
are among the techniques that can be used to make the assignment of test
and control appear more random.

A propensity score is an estimate of the “probability” that a subject
gets assigned to test or control. If assignment is completely random,
then the propensity score is simply 0.50 for all subjects (assuming
equal sizes go into test and control). It can be shown that if all the
characteristics used to determine both Z (assignment to test or
control) and (Y0, Y1) are known (e.g., age,
severity), then partitioning on this set of confounders X will allow
us to develop an unbiased estimate of Δ.

One form of partitioning is straightforward matching of test and
control subjects together when they share the same values of all
confounders. Matching may be easy to do when there are only one or two
confounders, but rapidly gets harder and the number of potential
confounders increases. A more flexible approach is propensity score
matching — it can also be shown that partitioning on propensity scores
p(x) are as good as partitioning on the raw X-variables
themselves, as long as all subjects have a change of being selected for
both test and control.

In our example, what does this imply? To obtain propensity scores, we
can build a logistic regression model with response variable Z (the
probability that a subject is assigned to the test group). This gives us
estimated propensity scores $hat{p}(x)$. Doing this in our example
finds that age, severity, and risk index all are significant at
α = 0.05 for predicting the treatment assignment.

There are a variety of approaches we can take at this point. One is to
use the estimated propensity scores to match test and control accounts,
as if we had conducted a randomized matched pairs design that assigns
the test treatment to one subject in each pair. Doing this we get an
estimate of the odds ratio of 0.511, with 95% confidence limits of
(0.268, 0.973), only slightly lower than that found via logistic
regression.

A second approach is to use the propensity score to weight subjects and
proceed as if we were doing a randomized design with weights of
treatment assignment given by the reciprocal of the $hat{p}(x)$. This
approach gives an estimated odds ratio of 0.567 with confidence limits
of (0.301, 1.064) – again very close to that obtained by the original
logistic regression.

It is important when doing any of matching or score adjustments to make
sure that the adjusted groups (test and control) resemble each other as
much as possible. To determine the effect of the matching, we look at
each variable in turn regressed against both treatment and propensity
score. In the following plot, filled circles show the t-statistics for
each treatment difference for each of the confounder variable before
adjusting for propensity score, and the open circles after adjustment.
The propensity score appears to have made the test and control groups
look much more similar to each other than before.

png

Summarizing the results, we see that all the adjustment methods give
similar results in terms of measuring the effect of the test treatment.
It is often the case that logistic regression, which produces an
estimate of the conditional effect of treatment given the confounding
variables, is very similar to the adjusted analyses that produce
estimates of the marginal effect of the treatment on the population.
Whenever there is interest in estimating the effects of treatments, it
is recommended that a propensity-based analysis be conducted to ensure
that potential biases due to non-random design are mitigated to the
highest degree possible.

Estimates and 95% Confidence Limits for Odds Ratio of Test vs. Control
Estimate Lower CI Upper CI
Unadjusted 0.687 0.403 1.171
Log. Regression 0.549 0.310 0.969
Matched cases 0.511 0.268 0.973
Propensity-weighted 0.567 0.301 1.064

Acknowledgements and Resources

This work is based on part on a modified version of an analyses by Ben
Cowling (http://web.hku.hk/~bcowling/examples/propensity.htm#thanks).

To read more about propensity scores and matching, see Dehijia and Wahba
(http://www.uh.edu/~adkugler/Dehejia&Wahba.pdf) and Rosenbaum and
Rubin
(http://www.stat.cmu.edu/~ryantibs/journalclub/rosenbaum_1983.pdf).

Matchit is an R package that implements many forms of matching
methods to better balance data sets
(https://gking.harvard.edu/matchit).

All ModelOp Blog Posts 

Is This Machine Learning Model Any Good?

When I hear a data scientist say that their classification model is "98% accurate", I tend to take it with a grain of salt. Most real-world classification problems come with wildly unbalanced classes (one outcome is much more likely than the other). Think about...

Is This Machine Learning Model Any Good? When I hear a data scientist say that their classification model is "98% accurate", I tend to take it with a grain of salt. Most real-world classification problems come with wildly unbalanced classes (one outcome is much more...

Forbes Articles by Stu Bailey

Forbes Articles by Stu Bailey

The Enterprise AI Challenge: Common Misconceptions – January 15, 2020 Misconception 1 (of 5): Enterprise AI Is Primarily About The Technology – January 31, 2020

Q&A with Ben Mackenzie, AI Architect

Q&A with Ben Mackenzie, AI Architect

2 Minute Read By Ben Mackenzie & Linda Maggi How AI Architects are the Key to Operationalize and Scale Your AI Initiatives Each week we meet more and more clients who are realizing the importance of operationalizing the AI model lifecycle and who are dismissing...

AI Needs to Break Free from “Frozen” Processes

AI Needs to Break Free from “Frozen” Processes

4 Minute Read By Scott Rose There is no disputing that artificial intelligence (AI) has had a massive impact on a broad range of human activities, an impact that has been widely publicized. Accounts like this one from WIRED magazine are impressive. But then...

24 Basic Bullets  For Brewing Better Beer

24 Basic Bullets For Brewing Better Beer

4 Minute Read By Greg Lorence This time, I figured I’d rewind a bit from the last couple of posts, wherein I drove lots of beer a very long way for a very important work party, and change the focus a bit. Now, this is certainly not directly related to the work we do...

ModelOp Golden Ale Takes a Holiday – Part 2

ModelOp Golden Ale Takes a Holiday – Part 2

2 Minute Read By Greg Lorence Before we go much further, I feel obligated to state what is likely already obvious: I’m not all about that #InstaLife. All accompanying photography was snapped with little regard for composition, typically while stretching out from 4-6...