In the field of machine learning and particularly in supervised learning, correlation is crucial to predict the target variable with the help of the feature variables. Rarely do we think about causation and the actual effect of a single feature variable or covariate on the target or response. Some even go so far as to say that „correlation trumps causation“ like in the book „Big Data: A Revolution That Will Transform How We Live, Work, and Think“ by Viktor Mayer-Schönberger and Kenneth Cukier. Following their reasoning, with Big Data there is no need to think about causation anymore, since nonparametric models will do just fine using correlation alone. For many practical use cases, this point of view may seem acceptable — but surely not for all.

Consider for instance you are managing an advertisement campaign with a budget allowing you to send discount vouchers to 10,000 of your customers. Obviously, you want to maximise the outcome of the campaign, meaning you want to focus on customers that buy because they received a voucher. If \(y_{1i}\) and \(y_{0i}\) are the amounts of money spent by a customer \(i\) described by \(x_i\) that either received or did not receive a voucher, we want to find a subset \(I’\subset{}I\) where \(I\) is the set of all former customers with \(|I’|=10,000\) so that \(\sum_{i\in I‘}y_{1i} – y_{0i}\) is maximised.

Another example would be to estimate the effect of additional booking options in an online marketplace. A common use case of an online vehicle marketplace is to provide estimations to sellers on the effect that additional booking options (e.g. highlighting, top of page etc.) may have on the selling time of a given vehicle. Likewise, many dealers will be interested in knowing how changing the price of a vehicle will affect the probability of selling the vehicle within a certain period of time or the expected selling time.

These two examples outline the need for methods to estimate the actual causal effect of a controllable covariate onto the response. Before we dip our toes into the deep sea of causal inference, let’s consider a few general aspects. The first and maybe not so obvious point when coming from a supervised learning background is that causal inference is thinking about *what did not happen*. That means that we don’t know for instance how much a customer that received a voucher would have ordered if they had not received one. This fundamental problem basically renders it an unsupervised learning problem. More interesting aspects of causal inference are summarized in a blog post by Macartan Humphreys on 10 Things to Know About Causal Inference. The question remains how can we estimate the causal effect of a controllable covariate?

## Strongly ignorable

The answer to this question lies in another important (and actually the original) use case of causal inference, which is the analysis of therapy effects. In a best-case scenario, the effect of a therapy can be determined in a randomized trial by comparing the response of a treatment group to a control group. In a randomized trial, the allocation of participants to the test or control group is random and thus independent of any covariates \(X\). Following the original paper of Rosenbaum & Rubin ^{2}, in a randomized trial the treatment assignment \(Z\) and the (unobservable) potential outcomes \({Y_1, Y_0}\) are conditionally independent given the covariates \(X\), i.e.

Furthermore, we assume that each participant in the experiment has a chance to receive each treatment, i.e. \(0 < p(Z=1|x) < 1\). The treatment assignment is said to be *strongly ignorable* if those two conditions hold for our observed covariates \(x\). As already mentioned, in a randomized experiment the treatment assignment is strongly ignorable.

## Causal effect in a randomized trial

In a randomized trial, the strong ignorability of \(Z\) allows us to estimate the effect of the treatment by comparing the response of the treatment group with that of the control group. The following approach may be used to estimate the individual effect of an additional booking option on a vehicle’s selling time with machine learning methods like a random forest:

- Train the model with the covariates \(X\) and \(Z\) as feature and response \(Y\) as target,
- predict for a given \(x\) the response \(\hat{y}_1\) with \(Z=1\) and \(\hat{y}_0\) with \(Z=0\),
- calculate the effect with \(\hat{y}_1 – \hat{y}_0\) or \(\frac{\hat{y}_1}{\hat{y}_0}\).

In a real-world, big data problem we often have no control over the experimental setup, we are just left with the data. This happens for instance in observational studies. Imagine for instance the treatment with an experimental drug having strong side-effects that might cure a life-threatening disease. A controlled, randomized experiment where the control groups gets a placebo might be impracticable or even unethical.

In such a nonrandomized experiment, there is no proper mathematical way to check if the treatment is strongly ignorable^{3}. According to Pearl^{4} by now assuming strong ignorability we are basically assuming that our covariate set \(X\) is *admissible*, i.e. \(p(y|\mathrm{do}(z))=\sum_{x}p(y|x,z)p(x)\). Here, Pearl’s \(\mathrm{do}\)-notation \(p(y|\mathrm{do}(z))\) denotes the “causal effect” of \(Z\) on \(Y\), i.e. the distribution of \(Y\) after setting variable \(Z\) to a constant \(Z = z\) by external intervention. In practice the assumption of admissibility of \(X\) is often used to estimate a causal effect. This led to wrong results in some studies as well as controversies^{3}, so one should always be aware that the entire causal analysis depends on the validity of this assumption.

So far, we have not only assumed admissibility of \(X\) but also a randomized trial for our approach. Therefore, we should check beforehand that

,

(which is a necessary but not sufficient condition), before applying the former approach. This can be verified by using \(X\) to predict \(Z\). If this is not possible and thus \(p(z|x) = p(z)\), the former approach is viable. But what if \(Z\) is not independent of \(X\) — as is often the case with real-world data. For instance, dealers of expensive vehicle brands might be more willing to spend money and therefore tend to use more booking options. The data from the last marketing campaign presumingly includes a bias induced by the current strategy of the marketing department on how to pick customers that get a voucher. In these cases, we have to isolate the effect of \(Z\) from our covariates \(X\).

## Propensity score

By predicting \(Z\) based on \(X\), we have estimated the *propensity score*, i.e. \(p(Z=1|x)\). This of course assumes that we have used a classification method that returns probabilities for the classes \(Z=1\) and \(Z=0\). Let \(e_i=p(Z=1|x_i)\) be the propensity score of the \(i\)-th observation, i.e. the propensity of the \(i\)-th participant getting the treatment (\(Z=1\)).

We can use the propensity score to define weights \(w_i\) to create a synthetic sample in which the distribution of measured baseline covariates is independent of treatment assignment^{5}, i.e.

\(w_i=\frac{z_i}{e_i}+\frac{1-z_i}{1-e_i},\)

where \(z_i\) indicates if the \(i\)-th subject was treated.

The covariates from our data sample \(x_i\) are then weighted by \(w_i\) to eliminate the correlation between \(X\) and \(Z\), which is a technique known as *inverse probability of treatment weighting* (IPTW). This allows us to estimate the causal effect via the following approach:

- Train a model with covariates \(X\) to predict \(Z\),
- calculate the propensities scores \(e_i\) by applying the trained model to all \(x_i\),
- train a second model with covariates \(X\) and \(Z\) as features and response \(Y\) as target by using \(w_i\) as sample weight for the \(i\)-th observation,
- use this model to predict the causal effect like in the randomized trial approach.

IPTW is based on a simple intuition. For a randomized trial with \(p(Z=1)=k\) the propensity score would be equal for all patients, i.e. \(e_i=\frac{1}{k}\) and thus \(w_i=k\). In a nonrandomized trial, we would assign low weights to samples where the assignment of treatment matches our expectation and high weights otherwise. By doing so, we draw the attention of the machine learning algorithm to the observations where the effect of treatment is most prevalent, i.e. least confounded with the covariates.

## Python implementation

We can set up a synthetic experiment to demonstrate and evaluate this method with the help of Python and Scikit-Learn. A synthetic experiment is appropriate to address the fundamental problem of causal inference described above. With real data, we don’t know what would have happened if we had not treated someone, sent a voucher or not booked that additional option and vice versa. Therefore, we derive a model that describes the relationship of \(X\) to \(Y\) as well as the effect of \(Z\) on \(Y\). We use this model to generate observational data where for each sample \(x_i\) we either have \(Z=1\) or \(Z=0\) and thus incomplete information in our data. Our task is now to estimate the effect of \(Z\) with the help of the generated data.

The following portion of this article is available for download as a Jupyter notebook.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | from math import exp, log import numpy as np import pandas as pd import scipy as sp from scipy import stats from scipy.special import expit from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import GaussianNB from sklearn.calibration import CalibratedClassifierCV import statsmodels.api as sm import statsmodels.formula.api as smf import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns sns.set_style("whitegrid") sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5}) plt.rcParams['figure.figsize'] = 10, 8 np.seterr(divide='ignore', invalid='ignore') np.random.seed(42) |

### Model

Assume we have patients characterized by sex and age suffering from some disease with different severities. The recovery time of a patient depends only on the sex, age, severity and if the patient is on medication. Let the expected recovery time in days \(t_{recovery}\) be defined as

\(\mathbf{E}(t_{recovery}) = \exp(2+0.5\cdot{}I_{male}+0.03\cdot{}age+2\cdot{}severity-1\cdot{}I_{medication}),\)

where \(I\) is an indicator function.

Furthermore, we will assume a Poisson distribution in order to generate some synthetic data of our patient’s recovery time. Due to our definition, treating the disease with medication reduces the recovery time to \(exp(-1)\approx 0.37\) of the recovery time having no treatment. Although the recovery time is specific to each patients, i.e. his/her features, the effect of a reduction to 37% of the recovery time without medication is the same for all patients.

1 2 3 4 5 | def exp_recovery_time(sex, age, severity, medication): return exp(2+0.5*sex+0.03*age+2*severity-1*medication) def rvs_recovery_time(sex, age, severity, medication, *args): return stats.poisson.rvs(exp_recovery_time(sex, age, severity, medication)) |

For the features of the patients we will use a Beta distribution to show how badly the disease struck the patients, a Gamma distribution for the age of our patients and a Bernoulli distribution for the gender of the patients.

1 2 3 4 5 6 | N = 10000 # number of observations, i.e. patients sexes = np.random.randint(0, 2, size=N) # sex == 1 if male otherwise female ages_dist = stats.gamma(8, scale=4) ages = ages_dist.rvs(size=N) sev_dist = stats.beta(3, 1.5) severties = sev_dist.rvs(size=N) |

It’s always a good idea to take a look at the nontrivial distributions:

1 2 3 4 5 6 7 8 9 10 11 12 | f, (ax1, ax2) = plt.subplots(2) ax1.set_title('age') ax1.set_xlabel('years') ax1.set_ylabel('density') ax1.set_xlim(0, np.max(ages)) ax2.set_title('severity') ax2.set_xlabel('0 = lowest severity, 1 = highest severity') ax2.set_ylabel('density') ax2.set_xlim(0, 1) sns.distplot(ages, ax=ax1) sns.distplot(severties, ax=ax2) plt.tight_layout(); |

### Randomized trial

In a controlled randomized trial we randomly select patients and assign them with a chance of 50% to either treatment or control. Therefore, the assignment of treatement is completely random and independent.

1 | meds = np.random.randint(0, 2, size=N) |

We assemble everything in a dataframe also including a constant column.

1 2 3 4 5 6 | const = np.ones(N) df_rnd = pd.DataFrame(dict(sex=sexes, age=ages, severity=severties, medication=meds, const=const)) features = ['sex', 'age', 'severity', 'medication', 'const'] df_rnd = df_rnd[features] # to enforce column order df_rnd['recovery'] = df_rnd.apply(lambda x: rvs_recovery_time(*x) , axis=1) df_rnd.head() |

1 | df_rnd.describe() |

By construction, there is no correlation between medication and any other covariate.

1 | sns.heatmap(df_rnd.corr(), vmin=-1, vmax=1); |

To get started we use a Poisson regression to estimate the coefficients of our formula for \(\mathbf{E}(t_{recovery})\) from the generated data. Of course we expect to see approximately the same coefficients since Poisson regression assumes the exact same model that generated our data.

1 2 3 | glm = sm.GLM(df_rnd['recovery'], df_rnd[features], family=sm.families.Poisson()) res = glm.fit() res.summary() |

Now we use a randome forest which is a pretty standard machine learning method to estimate the individual effects of the treatment on the patients. We fit the model and predict for each patient the recovery time assuming medication, i.e. medication column is 1, as well as assuming no medication, i.e. medication column is 0. Subsequently we divide the prediction assuming medication by the prediction assuming no medication to get an estimation of the treatment’s effect.

1 2 3 4 | reg = RandomForestRegressor() X = df_rnd[features].as_matrix() y = df_rnd['recovery'].values reg.fit(X, y) |

1 2 3 4 5 | RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False) |

1 2 3 4 5 6 7 8 | X_neg = np.copy(X) # set the medication column to 0 X_neg[:, df_rnd.columns.get_loc('medication')] = 0 X_pos = np.copy(X) # set the medication column to 1 X_pos[:, df_rnd.columns.get_loc('medication')] = 1 preds_rnd = reg.predict(X_pos) / reg.predict(X_neg) |

Let’s take a look at the distribution of individual effects. Even though we are assuming no model by using a random forest, our estimations of the treatment effect look decent.

1 2 3 4 5 6 | ax = sns.distplot(preds_rnd) ax.set_xlabel('treatment effect') ax.set_ylabel('density') plt.axvline(np.mean(preds_rnd), label='mean') plt.axvline(np.exp(-1), color='r', label='truth') plt.legend(); |

### Nonrandomized trial

To make things a bit more interesting, we put now patients on a treatment depending on their sex and severity of the illness. Since men often suffer more than women from the same illness, e.g. man flu, they tend to complain more and thus are more likely to convince the doctor of prescibing a medication. Thereafter we generate the recovery time again and follow the same procedure as before in the randomized trial.

1 2 | def get_medication(sex, age, severity, medication, *args): return int(1/3*sex + 2/3*severity + 0.15*np.random.randn() > 0.8) |

1 2 3 4 | df_obs = df_rnd.copy().drop('recovery', axis=1) df_obs['medication'] = df_obs.apply(lambda x: get_medication(*x), axis=1) df_obs['recovery'] = df_obs.apply(lambda x: rvs_recovery_time(*x), axis=1) df_obs.describe() |

1 | sns.heatmap(df_obs.corr(), vmin=-1, vmax=1); |

1 2 3 | glm = sm.GLM(df_obs['recovery'], df_obs[features], family=sm.families.Poisson()) res = glm.fit() res.summary() |

The first stunning result is that the Poisson regression is still able to correctly estimate the coefficients of our model. This is due to the model dependence and in realistic cases actually a bad thing. In a nutshell, model dependence means that the inference of the causal effect depends on the chosen model. In our case, the assumptions about the relation of the covariates in the Poisson regression extrapolates our data and thus makes our results heavily depend on the Poisson model. Since we also used a Poisson model to generate the data we are lucky but this is in reality rarely the case. Let’s check how the random forest performs.

1 2 3 4 | reg = RandomForestRegressor() X = df_obs[features].as_matrix() y = df_obs['recovery'].values reg.fit(X, y) |

1 2 3 4 5 | RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False) |

1 2 3 4 5 6 | X_neg = np.copy(X) X_neg[:, df_obs.columns.get_loc('medication')] = 0 X_pos = np.copy(X) X_pos[:, df_obs.columns.get_loc('medication')] = 1 preds_no_rnd = reg.predict(X_pos) / reg.predict(X_neg) |

1 2 3 4 5 6 | ax = sns.distplot(preds_no_rnd) ax.set_xlabel('treatment effect') ax.set_ylabel('density') plt.axvline(np.mean(preds_no_rnd), label='mean') plt.axvline(np.exp(-1), color='r', label='truth') plt.legend(); |

The distribution is now quite skewed and we can see that for a lot of our patients their individual treatment effect is heavily underestimated. This due to the fact that the random forest overrates the impact of a patient’s sex which is highly correlated with the medication.

### Inverse probability of treatment weighting

To deminish the impact of other covariates onto the effect of medication we will no calculate the propensity score and use inverse probability of treatment weighting (IPTW). In order to do that we use a classification to predict the probability of a patient to be treated. This can be accomplished by Scikit-Learn’s `predict_proba`

method that is available for most classificators. Don’t be fooled by the name though, in most cases (logistic regression is an exception) the probabilites are not calibrated and cannot be relied on. To fix this, we use the CalibratedClassifierCV in order to get proper probabilities (and it doesn’t hurt applying it for logistic regression too). After that we calculate the inverse probability of treatment weights and pass those as sample weights to the estimator during the fit.

1 2 3 4 5 6 7 8 9 10 | # classifier to estimate the propensity score cls = LogisticRegression(random_state=42) #cls = GaussianNB() # another possible propensity score estimator # calibration of the classifier cls = CalibratedClassifierCV(cls) X = df_obs[features].drop(['medication'], axis=1).as_matrix() y = df_obs['medication'].values cls.fit(X, y) |

1 2 3 4 5 | CalibratedClassifierCV(base_estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=42, solver='liblinear', tol=0.0001, verbose=0, warm_start=False), cv=3, method='sigmoid') |

1 2 | propensity = pd.DataFrame(cls.predict_proba(X)) propensity.head() |

We can see that the propensity scores of our patients in the randomized trial vary a lot as expected.

1 2 | propensity = pd.DataFrame(cls.predict_proba(X)) propensity.head() |

Only for comparison we can also plot the propensity scores of the randomized trail and check if the propensity score is \(\frac{1}{2}\) as expected.

1 2 3 4 5 6 7 8 | X = df_rnd[features].drop(['medication'], axis=1).as_matrix() y = df_rnd['medication'].values cls.fit(X, y) ax = sns.distplot(cls.predict_proba(X)[:,1]); ax.set_xlim(0, 1) ax.set_title("Propensity scores of randomized trial") ax.set_xlabel("propensity scores") ax.set_ylabel('density'); |

We now calculate at this point the inverse probability of treatment weights (IPTWs) with the help of the propensity scores of the nonrandomized trial.

1 2 | # DataFrame's lookup method extracts the column index provided by df2['medication'] for each row df_obs['iptw'] = 1. / propensity.lookup(np.arange(propensity.shape[0]), df_obs['medication']) |

1 | df_obs.describe() |

The poisson regression benefits from using the IPTWs as weights since the Z-scores of the coefficients increase.

1 2 3 4 5 | glm = sm.GLM(df_obs['recovery'], df_obs[features], family=sm.families.Poisson(), freq_weights=df_obs['iptw']) res = glm.fit() res.summary() |

Let’s check how our random forest does with the help of IPTWs.

1 2 3 4 | reg = RandomForestRegressor(random_state=42) X = df_obs[features].as_matrix() y = df_obs['recovery'].values reg.fit(X, y, sample_weight=df_obs['iptw'].values) |

1 2 3 4 5 | RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1, oob_score=False, random_state=42, verbose=0, warm_start=False) |

1 2 3 4 5 6 | X_neg = np.copy(X) X_neg[:, df_obs.columns.get_loc('medication')] = 0 X_pos = np.copy(X) X_pos[:, df_obs.columns.get_loc('medication')] = 1 preds_propensity = reg.predict(X_pos) / reg.predict(X_neg) |

1 2 3 4 5 6 | ax = sns.distplot(preds_propensity) ax.set_xlabel('treatment effect') ax.set_ylabel('density') plt.axvline(np.mean(preds_propensity), label='mean') plt.axvline(np.exp(-1), color='r', label='truth') plt.legend(); |

After taking a brief look at the distribution we see that using IPTW drastically improved the estimation of the treatment’s causal effect. On second glance though, it can also be seen that for a few samples we have over-estimation beyond 1. Looking at those patients it can be seen that all of them are female. This indicates that the causal effect for these cases could not be captured maybe due to the bagging approach in random forests. For most of the patients the estimation of the causal effect is improved though.

A direct comparison is given below, showing the estimations of treatment effects for the *randomized* trail, the *non-randomized* trail and the *non-randomized with IPTW* application.

1 2 3 4 5 6 | sns.distplot(preds_rnd, label='randomized') sns.distplot(preds_no_rnd, label='non-randomized') ax = sns.distplot(preds_propensity, label='non-randomized with IPTW') ax.set_xlabel('treatment effect') ax.set_ylabel('density') plt.legend(); |

The actual trick of IPTW is that sample weights are chosen in such a way that the correlation of other covariates and the medication is decreased. With the help of a weighted correlation this can also be illustrated. Remarkably enough, neither Numpy, Scipy, Pandas nor StatsModels seem to directly provide a weigthed correlation function, only weighted covariance, which we use to define a weighted correlation.

1 2 3 4 5 6 | def weighted_corr(m, w=None): if w is None: w = np.ones(m.shape[0]) cov = np.cov(m, rowvar=False, aweights=w, ddof=0) sigma = np.sqrt(np.diag(cov)) return cov / np.outer(sigma, sigma) |

Here is the original correlation of the nonrandomized trial again by setting all weights to 1.

1 2 3 4 | sel_cols = [col for col in df_obs.columns if col != 'iptw'] orig_corr = weighted_corr(df_obs[sel_cols].as_matrix(), w=np.ones(df_obs.shape[0])) orig_corr = pd.DataFrame(orig_corr, index=sel_cols, columns=sel_cols) orig_corr |