The following is one of two posts published alongside the JustCause framework, which we developed at inovex as a tool to foster good scientific practice in the field of Causal Inference. If you are not familiar with the field yet, consider reading the first article on the topic, which gives a high-level conceptual overview and also dives into the theory behind treatment effect estimation in more depth. Here, I will work through a synthetic example to show the efficacy of causal inference in campaign targeting.

Treatment effect estimation is a field of research spread across a wide range of industries. From the field where the name naturally makes sense—medicine—to the social sciences and econometrics, treatments can be found and studied in many places. In all scenarios we are essentially interested in estimating the effect some form of treatment has on a user, patient or group. Why this is difficult and how it can be done, you’ll learn in this post using a practical example. When you’re done reading, you should have a better understanding of where and why it makes sense to use Causal Inference and how it helps to model a specific sort of problems.

The Campaign Targeting Use-Case

To make the topic more tangible, I want to work along a more or less realistic example: a marketing campaign.

Underlying Data

Imagine we’re running an online shop and have a user database with roughly 40,000 entries. We’ve collected some features from each of them. To be honest, these are not the features you’d expect an online shop to collect. In fact we’re just using these data because they’re part of the public UCI machine learning repository. Still, for the sake of the example, let’s imagine that we had run a marketing campaign last month targeting some of these users according to a hunch of the marketing department. They figured that people with higher balance generally spend more. Thus they were convinced that it is favourable to target these customers.

Head of the pd.DataFrame containing cleaned data from the banking dataset.

Formalising Treatments

Before continuing in our quest to trump the marketing team with some simple causal inference, let’s formalise the problem.

We model a user \(i\) with their \(d\) features \(X_i = (x_{i1}, … x_{id})\). Now the so called treatment \(T_i\) for user \(i\) in our case is whether or not the user received marketing in the last campaign. That is to say, specifically, that \(T_i = 1\) if the user is among the 10 000 users with the highest balance and \(T_i = 0\) otherwise. Note that we omit the index \(i\) for brevity when describing the distributions below.

What we are interested in is the outcome \(Y\), the spending of the user in the online shop in the month after the campaign. Following the Potential Outcomes framework of Neyman & Rubin, the treatment effect \(\tau_i\) of user \(i\) is defined as the difference between the potential outcome \(Y_i(1)\), had the user received marketing, and the the potential outcome \(Y_i(0)\), had he not received it:

\(\tau(x_i) = \tau_i = Y_i(1) – Y_i(0)\).

The two outcomes \(Y(1)\) and \(Y(0)\) are potential in the sense that only one of them is ever realised—factual—while the other remains unobserved, or as Pearl would say, counterfactual.

We denote further by \(Y_{cf}\) the outcomes that are unobserved and by \(Y_f\) or simply \(Y\) the observed outcomes, which are determined by

\(Y_i = Y_i(1) \cdot T_i + Y_i(0) \cdot (1-T_i)\)

The fact that we only ever observe one of the two outcomes for each unit \(i\) is called the Fundamental Problem of Causal Inference by Paul Holland, and it is this FPCI that forces us to use synthetic data in this example. Because in order to show to you, dear reader, the efficacy of the proposed methods for estimating \(\tau_i\), we ought to have ground truth, which is never available for real data. Thus we go ahead and generate outcomes based on a model we make up.

Modelling the Data

Let’s say we model our user behaviour and outcomes as follows:

\(Y(0) = \frac{(85 – X_{age})^2}{5} +  I_{manager} \cdot 150 + MinMax_{(-1000, 10.000)} (X_{balance}) \)

where \(I_{manager}\) is an indicator function for the job feature, yielding one if and only if the feature is equal to ‚management‘ and \(MinMax\) is a scaling function squashing the value in the range from -1,000 to +10,000. Conceptually, our outcome is the purchase volume in our online-shop in the month after the campaign. Thus the treatment effect is the difference in the spending of a customer depending whether they received marketing or not.

The intuition we want to model behind this simple combination of features is that young people generally spend more, managers spend more than other jobs and that people with a higher account balance spend more. Don’t ask me how we know the account balance of our customers, we just do.

Now we define the true treatment effect as

\(\tau = (85 – X_{age}) \cdot 10 + I_{edu} \cdot 200 + I_{highedu} \cdot 100 – I_{married} \cdot 100 + \mathcal{N}(0, 102) \)

where the intuition is that young people are more likely to respond to marketing. The higher your education, the less likely you are to respond to marketing because you know it’s just a hoax anyways. And if you’re married you have to argue with your significant other about the purchase and thus won’t respond as much. It’s obvious, isn’t it? To round it off, we add some gaussian noise, because people are different.

The treated outcome is then simply \(Y(1) = Y(0) + \tau\).

Now, let’s return to the hunch of our marketing department. They figured—somehow correctly—that people with higher balance spend more, and thus assigned marketing to the 10,000 people with the highest balance, ignoring everything else.

The results of this previous campaign is what we have for our study of treatment effects. The data is called observational because we only observe the data post-hoc. If we had instead assigned treatment randomly across all customers, we would have a sort of experimental randomized control trial (RCT), which would enable us to estimate the treatment effects much more precisely (read why this is so, in the other article). But for now, we want to work with this biased data, because that is closer to what we see in the wild. After all, running an RCT is expensive because of the opportunity cost.

In python modelling of treatment and outcome looks like this. The whole notebook, including plots and data preparation, can be found here or viewed here. The data is here.

Targeting the Most Effective Group

For the next campaign our boss has imposed some tighter austerity measures on us and we are only allowed to send marketing to 2000 individuals. Thus, we better choose them wisely. Let’s compare different approaches.

Note: We assume that the response behaviour of the individual hasn’t changed since the last campaign. That is to say, our model of potential outcomes remains the same.

Target Users with Highest Balance

If we stick to the assumption of the first campaign and target the 2000 people with the highest balance we only gain a total of 886,450 €. That is to say, the difference between the scenario without marketing and the one with marketing amounts to about 800k € given the data generating process above. This makes sense if we look at how we modelled treatment effects. We didn’t include account balance in the calculation of $\tau$. So while it is true that people with high account balance tend to spend more in general (Y(0)), they are not responsive to marketing. Essentially, all the benefit we get from targeting the 2000 people with the highest balance is by luck.

Comparing two plots for causal inference

Influence of Scaled Balance on the control outcome as well as the treatment effect. We see that the guess of the marketing department is correct, but also that it is useless for our goal of targeting the maximum treatment effect.

Note that we can only calculate this ground truth because we have synthetic data and know the $\tau_i$ of all instances. We calculate the money earned like so:

Target Users Based on Causal Learning

Now comes the interesting part. We run a very simple T-Learner on the observational data we’ve collected from our previous marketing campaign and use that learner to predict/estimate the treatment effects of all customers. We then assign treatment to the 2000 customers who have the highest estimate of treatment effect. And voila: 1.597.590 € of total gain. That’s almost double the total effect of our previous target.

In order to estimate the effect, the T-Learner, where the T stands for two learners, employs two linear regressions. One tries to learn the outcome of the treated instance and one the outcome for so called control instances given the features. We can write this as estimating an expected value:

\(\mu_0(x) \cong \mathbb{E}[Y \mid X=x, T=0]\)

\(\mu_1(x) \cong \mathbb{E}[Y \mid X=x, T=1]\)

If these estimates are correct, we can approximate the treatment effect on instances with

\(\tau(x) = \mu_1(x) – mu_0(x)\)

It’s really that simple, and yet very powerful in our example. This goes back, not least, to the fact that both the untreated outcome \(Y(0)\) and the treatment effect \(\tau\) essentially are linear combinations of features, which the T-Learner has no struggle learning from the data.

Using our JustCause Framework, this is as simple as:

Target Youngest Users First

Finally, we can compare that to a very informed guess. Namely if we target the 2000 youngest people, we gain a total of 1.463.361 € more than without the marketing campaign. This is pretty good, and it becomes clear why, if we look at the model of $\tau$, where age plays the major and most distinct role. But still, the T-Learner outperformed our best guess without knowing anything about the data generation process that is not in the data.

Comparing two plots for causal inference

We see that age has a slight influence on the untreated outcome in general, but has a direct effect on the treatment effect. Thus targeting for age is a good approach.

What Happened Here?

The T-Learner and the informed guess both fare well, compared to targeting by balance, as our imaginary marketing department recommended. This is because they both rely on the importance of age to target users. And according to the synthetic DGP we defined above, age is the most important driver of treatment effect. The difference is that the T-Learner finds this importance only by looking at the data, while the guess must be informed by some background information. In our case, the difficulty to target the right group lies in the fact that the effect of treatment (that is marketing) is not related to the general behaviour of customers.

Check out the notebook if you’re interested in an uplift plot.

Take-Aways

I hope, after reading the article, you are now aware of the notation and idea behind the Potential Outcomes framework and how it relates to the specific use-case of a marketing campaign study. Furthermore, you should be at least somewhat convinced that employing a simple treatment effect estimation technique can be useful.

If you want to dive deeper into the theory and learn about our framework JustCause, check out the other article.