Causal Modeling

Last lecture we started to see some of the shortcomings in Bayesian Networks (BNs) to answer more nuanced queries of a causal nature.

Instead, we need some stronger causal assumptions about the system in order to answer these interesting queries.

As such, today we take our first steps towards the next wrung in the causal hierarchy -- look at that little fellow climb!

So, as we ascend the causal ladder, let's start by thinking about the new tools we have, the properties of the tools we need, and how we can start to address the issues raised with Bayesian Networks for Causal Queries that we discussed last class.

Although correlation doesn't equal causation, we do assume that there is no correlation without causation, and that we can model in terms of what affects what (assumptions about causal relations between variables) and to what degree (statistical relations).

A causal model is a type of generative model, which are models that assert how variables of interest came to attain the values they did, and are typically specified in two pieces:

  • Causal Graph: depicts which variables are causes (i.e., functions of) which other variables, such that: $$effect = f_{effect}(causes)$$

  • Statistical Model: describes the probabilistic relationships between variables, usually by encoding the degree to which causes influence their effects.

Some notes on the above, very general definition:

  • Generative models are largely the focus of those in the empirical and AI sciences because we care about in-depth analyses on the relationship between variables, either to make some claims that affect policy or the decisions our agents should make inferences over.

  • Generative models differ much in purpose from many traditional machine learning models that are just meant to *work* in some prediction task, like Naive Bayes or Neural Networks.

So, what are some of the endeavors surrounding causal models? Let's take a brief overview!

Causal Endeavors

Causal models appear in two main fields at the intersection of AI, Machine Learning, Statistics, and the Empirical Sciences (even some philosophy as well!):

In Causal Discovery, the objective is to learn the causal graph, generally from some offline dataset, but increasingly from the experience of learning agents as well.

Depicted, causal discovery is concerned with taking some data and being able to assert the cause-effect relationships from it:

The Type of data matters heavily for causal discovery (e.g., if we have associational vs. interventional data), which we'll discuss next class.

Causal discovery is a very difficult problem when analyzing offline, observational data, and generally requires much domain knowledge, compromise, and strong assumptions to automate.

The upshot: intelligent agents who can interact with their environments have an easier time of it, which is why this course compares causality with reinforcement learning!

Personally, I believe this union is a great avenue for further research, so as this course progresses, feel free to share your thoughts, questions, and suggested explorations!

Now, either discovered or assumed, what do we do with causal models?

In Causal Inference, it is assumed that the Causal Model is known / correct, and then is employed in answering causal and (if capable) counterfactual queries of interest.

Causal inference will be the main topic of this course, but there's plenty to be said on both -- we'll see some examples of each as we continue... starting with our first causal model...

Causal Bayesian Networks (CBNs)

Examining our definition for the typical components of a causal model, it seems we already know of one that's close: a Bayesian Network!

BNs already define the notion of a structure and a statistical model, but the structure is only for managing conditional independence relationships, so let's strengthen the claims that this structure makes...

A Causal Bayesian Network (CBN) is a Bayesian Network wherein the structure encodes causal dependence, such that for two variables \(X \rightarrow Y\), we say that \(X\) is a "direct cause" of \(Y\), meaning that \(Y\) responds to manipulation of \(X\) but not vice versa. Succinctly: $$Parents(Y) \rightarrow Y \Rightarrow Y = f_Y(Parents(Y))$$

Intuitively, these relationships reflect the example that \(Rain \rightarrow SidewalkWet\) means that if we were to cause it to rain (rain dance? cloud seeding?), then the propensity of the sidewalk to be wet will change, BUT causing the sidewalk to be wet (e.g., hosing it down) does not affect the likelihood of rain.

Note that the only mechanical change between a BN and a CBN is that for a CBN, we are claiming that the structure indicates the correct direction of edges from causes to effects. In other words, we can defend that there are no observationally equivalent models that should be considered apart from the one we're defending.

A necessary condition for a Causal Bayesian Network is that it exhibits the Global Markov Property, meaning that the CBN structure \(G\) is faithful to the underlying data, such that: $$X \indep Y~|~Z~\Rightarrow~G.dsep(X, Y, Z) = True$$

Crucial to this endeavor, there might be other variables that influence whether or not the sidewalk is wet, with or without rain, and so the capacity of our CBN to answer causal queries depends heavily on what variables in the system we've captured or haven't.

Latent Variables in CBNs

Latent / Exogenous / Omitted Variables are those that affect the observed variables of the system but whose states are not recorded in our dataset / model, and are the reason why (despite worrying about causality), we still need the statistical expressiveness of a CBN's CPTs.

In the Causal Philosophy of Science, the idea of putting a box (the model) around the part of the universe we want to measure and make inferences about should still be, in some way, able to account for the universe that is "outside the box."

As such, latent variables can either be ignored or nuisances depending on where they exist in the model.

In Markovian Models, we assert that no latent variable affects more than one observed variable in the system.

As an example in our Vaping Heart Disease (VHD) model, if it is indeed Markovian, we would assume that there exist some unmodeled latent causes that account for the variability in an effect's response to its cause.

Why do you think these types of latent variables / Markovian models are desirable for causal inference?

Markovian Models are the most ideal for Causal Inference because the CPT probabilities encode latent influences that *only* affect single variables, and so no spurious correlations are introduced between any other variables that do not carry causal information.

That said, we'd be naive to consider only Markovian Models that represent the "blue sky" conditions for causal inference.

Rather, we should have models that are honest enough to determine what causal questions it can answer, and which it cannot.

A Semi-Markovian Model is one in which one or more latent variables are common causes of two or more observed variables in the system, and are known as unobserved confounders (UCs).

Graphically, we represent unobserved confounders using dashed lines, sometimes summarizing those that we don't know about through a confounding arc indicating non-causal dependence between variables.

In our VHD example, suppose we know / suspect that there are other latent variables that mutually affect propensities for Stress and Exercise. The model:

Some notes on the above:

  • Not all causal queries are contaminated by confounding! Part of the tricks in the arsenal of causal inference are to answer important questions *despite* the presence or suspected presence of UCs.

  • We'll need new tools / mechanics to discuss just what causal queries are estimable from the model and data we have, and the ability to both detect and account for unobserved confounding will be a prevalent future topic.

  • Defense of a Markovian vs. Semi-Markovian Model stems largely from the necessary condition of satisfying the Global Markov Condition AND background knowledge about the system.

Warning: If you're thinking: "Hey, correlation doesn't always equal causation, how can we just slap 'Causal' on a Bayesian Network?" you're not alone! A future lecture will address issues with Causal Bayesian Networks, how to defend them, and their shortcomings that demand stronger tools.

In brief: any time we conclude that our BN is causal (a CBN), we must be able to do so on sound, scientific, and defendable grounds.

Some relationships are easier to defend than others, and we'll see later how we can go about doing so.


We'll start this endeavor past the hard part of defending the causal nature of a Bayesian Network and first see why it might be desirable to have a CBN to begin with!

Returning to our Vaping-Heart Disease (VHD) example, for now, let's suppose we're happy with our original BN and are able to (for now, hand-wavingly) defend it as a Markovian CBN. Let's attempt to answer the causal question: "Does vaping cause heart disease?"

Suppose the structure of our CBN is now correct, we still have a problem with the Bayesian conditioning operation for examining the difference in conditional probabilities.

To remind us of this problem:

[Reflect] How can we isolate the effect of \(V \rightarrow H\) using just the probabilistic parameters in our CBN?

I dunno... how about we just chuck the "backdoor" path?

More or less, yeah! That's what we're going to do through what's called an...


I know what you're thinking... and for any Always Sunny fans out there, I have to reference with:

In our context, however...

An intervention is defined as the act of forcing a variable to attain some value, as though through external force; it represents a hypothetical and modular surgery to the system of causes-and-effects.

As such, here's another way of thinking about causal queries:

Intuition 1: Causal queries can be thought of as "What if?" questions, wherein we measure the effect of forcing some variable to attain some value apart from the normal / natural causes that would otherwise decide it.

With this intuition, we can rephrase our question of "Does vaping cause heart disease" as "What if we forced someone to vape? Would that change our belief about their likelihood of having heart disease?"

Obviously, forcing someone to vape is unethical, but it guides how we will approach causal queries, as motivated by the second piece of intuition:

Intuition 2: Because we're asking causal queries in a CBN, wherein edges encode cause-effect relationships, the idea of "forcing" a variable to attain some value modularly effects only the variable being forced, and should not affect any other cause-effect relationships.

So, let's consider how we can start to formalize the notion of these so called interventions.

If an intervention forces a variable to attain some value *apart from its normal causal influences,* what effect would an intervention have on the network's structure?

Since any normal / natural causal influences have no effect on an intervened variable, and we do not wish for information to flow from the intervened variable to any of its causes, we can sever any inbound edges to the intervened-upon variable!

Let's formalize these intuitions... or dare I say... in-do-itions (that'll be funny in a second).

The \(do\) Operator

Because observing evidence is different from intervening on some variable, we use the notation \(do(X = x)\) to indicate that a variable \(X\) has been forced to some value \(X = x\) apart from its normal causes.

Heh... do.

To add to that vocabulary, for artificial agents, an observed decision is sometimes called an "action" and a hypothesized intervention an "act".

"HEY! That's the name of your lab! The ACT Lab!" you might remark... and now you're in on the joke.

Structurally, the effect of an intervention \(do(X = x)\) creates a "mutilated" subgraph denoted \(G_{X=x}\) (abbreviated \(G_x\)), which consists of the structure in the original network \(G\) with all inbound edges to \(X\) removed.

So, returning to our motivating example, an intervention on the Vaping variable would look like the following, structurally:

[Reflect] How do you think this relates to how humans hypothesize, or ask questions of "What if?" Do you ignore causes of a hypothetical and focus only on its outcomes?

This will be an important question later, since intelligent agents may need to predict the outcomes of their actions.

OK, so we've got this mutilated (and yes, that's the formal, gory adjective) subgraph, which seems to represent the intervention's structural effect, but how does an intervention affect the semantics?

To answer that question, we can use Intuition #2 from above, considering that the intervention is modular, and should not tamper with any of the other causal relationships.

Because of this modularity, performing inference in the intervened graph \({G_{V}}\) equates to determining which parameters (i.e., CPTs) from the un-intervened system represented in \(G\) are the same.

This is because, for some intervention on variable \(do(X = x)\) in a Markovian Model:

  • The only causal relationship modified by the intervention \(do(X=x)\) is \(f_X = x\).

  • All causal relationships that are not descendants of \(X\) will remain unaffected.

  • All causal relationships that are descendants of \(X\) already encode their behavior for different values of \(X\).

We'll see why this nice property holds for Markovian models next time, when we remove that restriction and see a more general class of SCMs.

That said, the modularity of a Markovian SCM establishes the following equivalences, if we denote the CPTs in the intervened graph as \(P_{G_{V}}(...)\):

\begin{eqnarray} P(S) &=& P_{G_{V}}(S) \\ P(E | S) &=& P_{G_{V}}(E | S) \\ P(H | do(V = v), E) &=& P_{G_{V}}(H | V = v, E) = P(H | V = v, E) \end{eqnarray}

The only asymmetry is that by forcing \(do(V = v)\), we replace the CPT for \(V\) and instead have \(P_{G_{V}}(V = v) = 1\).

Consider the Markovian Factorization of the original network (repeated below); how would this be effected by an intervention \(do(V = v)\)? $$P(S, V, E, H) = P(S)P(V|S)P(E|S)P(H|V,E) = \text{MF in Original Graph}$$ $$P(S, E, H | do(V = v)) = P_{G_{V}}(S, E, H) = \text{???} = \text{MF in Interventional Subgraph}$$

Since we are forcing \(do(V = v)\) (i.e., \(V = v\) with certainty apart from its usual causes), this is equivalent to removing the CPT for \(V\) since \(P(V = v | do(V = v)) = 1\). As such, we have: $$P(S, E, H | do(V = v)) = P_{G_{V}}(S, E, H) = P(S)P(E|S)P(H|V=v,E)$$

This motivating example is but a special case of the more general rule for the semantic effect of interventions:

In a Markovian CBN, this semantic effect of an intervention leads to what is known as the Truncated Product Formula / Manipulation Rule, which is the Markovian Factorization with all CPTs except the intervened-upon variable's, or formally: $$P(V_0, V_1, ... | do(X = x)) = P_{X=x}(V_0, V_1, ...) = \Pi_{V_i \in V \setminus X} P(V_i | PA(V_i))$$


You knew this was coming, but dreaded another computation from the review -- I empathize, this is why we have computers.

That said, it's nice to see the mechanics of causal inference in action, so let's do so now.

Using the heart-disease BN as a CBN, compute the likelihood of acquiring heart disease *if* an individual started vaping, i.e.: $$P(H = 1 | do(V = 1))$$

As it turns out, the steps for enumeration inference are the same, with the only differences being in the Markovian Factorization.

Step 1: \(Q = \{H\}, e = \{do(V = 1)\}, Y = \{E, S\}\). Want to find: $$P(H = 1 | do(V = 1)) = P_{G_{V=1}}(H = 1)$$

Note: if there was additional *observed* evidence to account for, we can perform the usual Bayesian inference on the mutilated subgraph (but there isn't, so...).

Step 2: Find \(P_{G_{V=1}}(H = 1)\): \begin{eqnarray} P_{G_{V=1}}(H = 1) &=& \sum_{e, s} P_{G_{V=1}}(H = 1, E = e, S = s) \\ &=& \sum_{e, s} P(S = s) P(E = e | S = s) P(H = 1 | V = 1, E = e) \\ &=& P(S = 0) P(E = 0 | S = 0) P(H = 1 | V = 1, E = 0) \\ &+& P(S = 0) P(E = 1 | S = 0) P(H = 1 | V = 1, E = 1) \\ &+& P(S = 1) P(E = 0 | S = 1) P(H = 1 | V = 1, E = 0) \\ &+& P(S = 1) P(E = 1 | S = 1) P(H = 1 | V = 1, E = 1) \\ &=& 0.3*0.6*0.8 + 0.3*0.4*0.6 + 0.7*0.2*0.8 + 0.7*0.8*0.6 \\ &=& 0.664 \end{eqnarray}

Step 3: If there was any observed evidence, we would have to find \(P(e)\) here and then normalize in the next step... but there isn't so...

Step 4: Normalize: but no observed evidence so... nothing to do here either -- we were done in Step 2!

Final answer: \(P(H = 1 | do(V = 1)) = P_{G_{V=1}}(H = 1) = 0.664\)

Let's compare the associational and interventional quantities we've examined today:

  • Associational / Observational: "What are the chances of getting heart disease if we observe someone vaping?" $$P(H = 1 | V = 1) = 0.650$$

  • Causal / Interventional: "What are the chances of getting heart disease *were someone* to vape?" $$P(H = 1 | do(V = 1)) = 0.664$$

Hmm, well, doesn't seem like a huge difference, does it -- a mere ~1.5% is all we have to show for our pains?

Although it may not seem like a lot, keep the following in mind:

  1. This is a small network, and therefore, little chance for things to go completely haywire. In larger systems with different parameterizations, these query outcomes can be much more dramatically different.

  2. Note also the difference in information passed through spurious paths in the network based on the associational vs. causal queries (depicted below).

Note the CPTs for each node updated for observing \(V = 1\) in the original graph \(G\).

Sorry for the quality -- if you want to play around, this is a small Java app called SamIAm from UCLA -- it's a bit dated but gets the job done.

Now, note the CPTs for each node updated (or lack thereof) for intervening on \(do(V = 1)\) in the mutilated subgraph \(G_{V=1}\).

Conceptual Miscellany

How about a few brain ticklers to test our theoretical understanding of interventions?

Q1: Using the heart disease example above, and without performing any inference, would the following equivalence hold? Why or why not? $$P(H = 1 | do(V = 1), S = 1) \stackrel{?}{=} P(H = 1 | V = 1, S = 1)$$

Yes! This is because, by conditioning on \(S = 1\), we would block the spurious path highlighted above by the rules of d-separation. As such, these expressions would be equivalent. This relationship is significant and will be explored more a bit later.

Q2: would it make sense to have a \(do\)-expression as a query? i.e., on the left hand side of the conditioning bar of a probabilistic expression?

No! The do-operator encodes our "What if" and is treated as a special kind of evidence wherein we force the intervened variable to some value; as this is assumed to be done with certainty, there would be no ambiguity for it as a query, and therefore this would be vacuous to write.

Q3: could we hypothesize multiple interventions on some model? What would that look like if so?

Sure! Simply remove all inbound edges to each of the intervened variables, and then apply the truncated product rule to remove each of the intervened variables' CPTs from the product!

And there you have it! Our first steps into Tier 2 of the Causal Hierarchy.

But wait! We didn't *really* answer our original query: "What is the *causal effect* of vaping on heart disease?"

Average Causal Effects

To answer the question of what effect Vaping has on Heart Disease, we can consider a mock experiment wherein we forced 1/2 the population to smoke and the other 1/2 to abstain from smoking, and then examined the difference in heart disease incidence between these two groups!

Sounds ethical to me! </s> </dangling-tags>

We'll talk more about sources of data and the plausible estimation of certain queries of interest in the next class.

That said, this causal query has a particular format...

Average Causal Effects

The Average Causal Effect of some intervention \(do(X = x)\) on some set of query variables \(Y\) is the difference in interventional queries: $$ACE = P(Y|do(X = x)) - P(Y|do(X = x'))~x,x' \in X$$

So, to compute the ACE of Vaping on Heart disease (assuming that the CBN we had modeled above is correct), we can compute the likelihood of attaining heart disease from forcing the population to Vape, vs. their likelihood if we force them to abstain: $$P(H=1|do(V=1)) - P(H=1|do(V=0)) = 0.664 - 0.164 = 0.500$$

Wow! Turns out the ACE of vaping is a \(+50\%\) increase in the chance of attaining heart disease!

Risk Difference

It turns out our original, mistaken attempt to measure the ACE using the Bayesian conditioning operation was not causal, but still pertains to a metric of interest in epidemiology:

The Risk Difference (RD) is used to compute the risk of those already known / observed to meet some evidenced criteria \(X = x\) on some set of query variables \(Y\), and is the difference of associational queries: $$RD = P(Y|X = x) - P(Y|X = x')~x,x' \in X$$

Computing the RD of those who are known to Vape on Heart Disease, we can compute the risks associated with those who Vape vs. those who do not: $$P(H=1|V=1) - P(H=1|V=0) = 0.649 - 0.182 = 0.467$$

Again, although the difference between the ACE and RD is only ~3%, this distinction can be compounded in larger models, and answers a very different, associational question than does its causal counterpart, the ACE.

All of today's (lengthy!) lesson seems to hinge on our models being correct, and as we'll see, on the data supporting them. Next time, we'll think about how we might still be able to answer queries of interest even if there are some quirks in either. Stay tuned!

  PDF / Print