Rate this book

Causal Inference in Python: Applying Causal Inference in the Tech Industry

Name: Causal Inference in Python: Applying Causal Inference in the Tech Industry
Rating: 4.67 (7 reviews)
ISBN: 9781098140250

Matheus Facure

Rate this book

How many buyers will an additional dollar of online marketing bring in? Which customers will only buy when given a discount coupon? How do you establish an optimal pricing strategy? The best way to determine how the levers at our disposal affect the business metrics we want to drive is through causal inference. In this book, author Matheus Facure, senior data scientist at Nubank, explains the largely untapped potential of causal inference for estimating impacts and effects. Managers, data scientists, and business analysts will learn classical causal inference methods like randomized control trials (A/B tests), linear regression, propensity score, synthetic controls, and difference-in-differences. Each method is accompanied by an application in the industry to serve as a grounding example. With this book, you

GenresProgrammingMathematicsComputer ScienceTechnicalBusinessTextbooksNonfiction

406 pages, Paperback

Published August 22, 2023

23 people are currently reading

168 people want to read

About the author

Matheus Facure

2 books6 followers

What do you think?

Rate this book

Friends & Following

Create a free account to discover what your friends think of this book!

Community Reviews

5 stars

24 (72%)

4 stars

7 (21%)

3 stars

2 (6%)

2 stars

0 (0%)

1 star

0 (0%)

Displaying 1 - 7 of 7 reviews

Emre Sevinç

175 reviews430 followers

September 5, 2023

Causality is of course not a new concept, though it feels like the technical and practical interest in it is going through a renaissance recently. You know what I mean if you've already read The Book of Why: The New Science of Cause and Effect and came across discussions around it. Even if you haven't read that book, you should've heard the mantra "correlation is not causation" and wondered "then what is causation, and how does one tease it out of a data set for Pearl's sake!?" Or how to gather and measure that data, to begin with?

If you're in that mood, with a practical bent, anxious about how to establish causality in your daily job, and that job is related to finance, banking, running marketing campaigns for all sorts of companies, setting up experiments to find out if A causes B, then this book hits a sweet spot! The author starts the book by giving a few super practical, industrial examples to motivate why causality matters and why simply predicting things based on good old data science techniques is not enough, and if those examples won't draw your attention to the importance of causality, I don't know what will.

Being an O'Reilly book, "Causal Inference in Python: Applying Causal Inference in the Tech Industry" is geared towards people who need to get things done without writing a Ph.D. thesis, so the author immediately focuses on basics and how you can practically apply them to concrete data sets, using Python and popular Python-based causal data science libraries. I really like that aspect: even though the author expects some basic grounding in data science, statistics, probability, etc. he explains most of the topics succinctly, afterwards demonstrating everything in Python code, applied to realistic data sets.

The book is very practical, but make no mistake, some of the chapters will definitely require critical reading, intensive hacking, and studying: I, for one, will need to re-read parts related to "Noise Reduction Techniques", "Causal Contextual Bandits", "Metalearners", "Difference-in-Differences", "Panel Data", and of course "Causal Discovery" (trying to 'learn' the causal graph from data, etc.).

Another strong point of the book is the references to critical and up-to-date papers in each chapter. The author obviously didn't want to turn the book into a 1000-page academic tome, and opted for providing the reader with those references, and there's definitely a lot of high quality food for thought in those academic papers.

The only quibble about the book is lack of Python code for plots, but I don't consider it a deal breaker.

I would definitely recommend this book to data scientists who want to learn about the exciting field of causality.

Following this book, I also plan to read the following books:

- Causal Inference and Discovery in Python: Unlock the secrets of modern causal machine learning with DoWhy, EconML, PyTorch and more
- Applied Causal Inference
- Causal Inference: What If
- Causal Analysis: Impact Evaluation and Causal Machine Learning with Applications in R
- Elements of Causal Inference: Foundations and Learning Algorithms
- Machine Learning for Causal Inference

And I want to watch the following:

- Causal Data Science Meeting 2022 Keynote by Judea Pearl
- Why we need causal thinking for all data applications: Richard McElreath (Max Planck Institute)
- Causal Data Science Meeting 2021 - Keynote 1: "Causality-inspired Machine Learning – What Can Causality Do For ML?" by Sara Magliacane

causality data-science machine-learning

Daeus

387 reviews3 followers

February 21, 2025

Such an excellent, niche book going in depth with industry examples for data scientists to apply causal inference (with clear visuals, code, intuition, etc). I skimmed through some of the more complex sections (cough cough
effect heterogeneity), but overall great content and well explained concepts. Highly recommend to anyone who does data analysis, especially data science in tech.

Quotes/notes
Causal Inference overview
- "Causal inference is the science of inferring causation from association and understanding when and why they differ." ... "bias is what makes association different from causation." ... "The bias is given by how the treated and control groups differ regardless of treatment."
- "If you take a deeper look at the types of questions you want to answer with causal inference, you will see they are mostly of the 'what if' type. I'm sorry to be the one that says it, but machine learning (ML) is just awful at those types of questions." ... "You can do all sorts of beautiful things with machine learning. The only requirement is to frame your problems as predictive ones."
- "The fundamental problem of causal inference is that you can never observe the same unit with and without treatment." ... "Once you've learned the concept of potential outcome, you can restate the fundamental problem of causal inference: you can never know the individual treatment effect because you only observe one of the potential outcomes." Why we look at average treatment effect (ATE).
- "One way to see causal inference is as a missing data problem. To inference the causal quantities of interest, you must impute the missing potential outcomes."
- "Once you make treated and control groups interchangeable, expressing the causal effect in terms of observable quantities in the data becomes trivial."
- "Whenever you do causal inference without RCTs, you should always ask yourself what would be the perfect experiment to answer your question. Even if that ideal experiment is not feasible, it serves as a valuable benchmark. It often sheds some light on how you can discover the causal effect even without such an experiment."
Confidence intervals
- "you can never be sure that the mean of your experiment matches the true platonic and ideal mean. However, with the standard error, you can create an interval that will contain the true mean in 95% of the experiments you run. In real life, you don't have the luxury of simulating the same experiment with multiple datasets. You often only have one. But you can draw from the idea if simulating multiple experiments to construct a confidence interval."
- "you now know that the conversion rate would follow a normal distribution [CLT], if you could run multiple similar experiments. The best estimate you have for the mean of that (unknown) distribution is the mean from your small experiment. Moreover, the standard error serves as your estimate of the standard deviation of that unknown distribution for the sample mean. So, if you multiply the standard error by 2 [1.96 for a normal distribution] and add and subtract it from the mean of your experiments you will construct a 95% confidence interval for the true mean."
- "overlapping confidence intervals is not enough to say that the difference between the groups is not statistically significant, however, if they didn't overlap, that would mean they are statistically different. In other words, nonoverlappimg confidence intervals is conservative evidence for statistical significance."
Hypothesis testing.
- "Technically speaking using the normal distribution here is not accurate. Instead you should be using the T distribution with degrees of freedom equal to the sample size minus the number of parameters you've estimated (2, since you are comparing two means). However, with sample sizes above 100, the distinction between the two is of little practical importance."
- "The p-value is not P(Ho|data), but rather P(data|Ho)."
- "The probability that a test correctly rejects the null hypothesis is called the power of the test. It is not only a useful concept if you want to figure out the sample size you need for an experiment, but also for detecting issues on poorly run experiments."
- Note: power analysis. Takes 80% power, 95% significance. Assume SE for treatment and control are the same (beforehand), estimate with control, can give you an estimated sample size needed for a treatment effect beforehand.
Graphic causal models
- "Some scientists use the term structural causal model (SCM) to refer to a unifying language of causal inference. These models are composed of graphs and causal equations."
- "The acronym [DAG] stands for directed acyclic graph. The directed part tells you that the edges have a direction, as opposed to undirected graphs, like a social network, for example. The acyclic part tells you that the graph has no loops or cycles. Causal graphs are usually directed and acyclic because causality is nonreversible." .... "This language of graphical models will help you clarify your thinking about causality, as it makes your beliefs about how the world works explicit."
- "explaining away, because one cause already explains the effect, making the other cause less likely."
- "to adjust for confounding bias, you need to adjust for the common cause of the treatment and the outcome." When that's not possible, you can use "surrogate confounders." ... "In many important... research questions, confounders are a major issue, since you can never be sure that you've controlled for all of them.... but... in the industry, you are mostly interested in learning the causal effects of things that your company can control-like prices, customer service, and marketing budget - so that you can optimize them."
- "When you can't measure all common causes.... it is often much more fruitful to shift the discussion from 'Am I measuring all my confounders?' to 'How strong should the unmeasured co-founders be to change my analysis significantly?' That is the main idea behind sensitivity analysis." ... "even when the causal quantity you care about can't be point identified, you can still use observable data to place bounds around it. The process is called partial identification."
Adjusting for Bias
- Causal inference vs ML. Features vs covariates/independent variables, weights vs parameters or coefficients. Target vs outcome or dependent variable.
- "You can think about linear regression in this context as a dimensionality reduction algorithm."
- "If you can predict T [treatment] using other variables, it means it's not random [eg if you can predict credit limits from income, etc (confounder variable), when looking at if credit limits affect default rates]. However you can make T look as good as random once you control for all confounder variables X. To do so, you can use linear regression to predict it from the confounder and then take the residuals of that regression T~ [ie you predict treatment and take differences between prediction and actual for residuals]. By definition, T~ cannot be predicted by the other X variables that you've already used to predict T. Quite elegantly, T~ is a version of the treatment that is not associated (uncorrelated) with any other variable in X."
This took some time to digest, but if we have credit limit as a treatment T, and income as a confounder X, you can use linreg to predict T and take the residuals/difference which rules out that confounder.
- "FWL [Frisch-Waugh-Lovell Theorem]-style orthogonalization is the first major debiasing technique you have at your disposal. Its a simple yet powerful way to make nonexperimental data look as if the treatment had been randomized.... three steps...."1. A debiasing step, where you regression the treatment on confounders X and obtained the treatment residuals T~ = T - T^. 2. A denoising step, where you regress the outcome variable Y on the confounder variables X and obtain the outcome residuals Y~ = Y - Y^. 3. An outcome model where you regress the outcome residual Y~ on the treatment residual T~ to obtain an estimate for the causal effect of T on Y."
- "As to how the FWL theorem works with nonlinear data, it is exactly like before, but now you have to apply the nonlinearity [e.g. Square root] first."
- "it would be great if the credit limit was randomized, as that would make it pretty straightforward to estimate its effect on default rate and customer spend. The thing is that this experiment would be incredibly expensive. You would be giving random credit lines to very risky customers, who would probably default and cause a huge loss." ... "The way around this conundrum is not the ideal randomized controlled trial, but it is the next best thing: stratified or conditional random experiments.... you instead create multiple local experiments, where you draw samples from different distributions, depending on customer covariates.... this sort of experiment [credit score buckets split into groups] explores credit lines that are not too far from what is probably the optimal line, which lowers the cost of the test to a more manageable amount."
- "running a regression with a binary treatment is exactly the same as comparing the average between treated and control group."
- "If you have a ton of groups, adding one dummy for each is not only tedious, but also computationally expensive. You would be creating lots and lots of columns that are mostly zero. You can solve this easily by applying FWL and understanding how regressions orthogonalizes the treatment when it comes to dummies. ... the debiasing step in FWL involves predicting the treatment from the covariates, in this case the dummies.... to get residuals [for the treatment], subtract that group average from the treatment. Since this approach subtracts the average treatment, it is sometimes a referred to as de-meaning the treatment... the parameter estimate you got here is exactly the same as the one you got when adding dummies to your model.... the idea goes by the name fixed effects, since you are controlling for anything that is fixed within a group. It comes from the literature of causal inference with temporal structures (panel data)."
- "Despite the fancy name, MMMs [marketing mix modeling] are just regressions of sales on marketing strategy indicators and some confounders. For example, let's say you want to know the effect of your budget on TV, social media, and search advertisements on your products sales. You can run regression model where each unit i is a day.... salesi = Do + B1TVi + B2Sociali + B3Searchi + D1comletitorSalesi + D2Monthi + D3Trend + Ei.... To account for the fact that you might have increased your marketing budget on a good month, you can adjust for this confounder by including additional controls in your regression. For example, you could include your competitors sales, a dummy for each month, and a trend variable."
- "But what kind of variables should you include in X? Again, its not because adding variables adjusts for them that you want to include everything in your regression model.... you don't want to include common effects (colliders) or mediators, as those would induce selection bias.... there is a bias-variance tradeoff when it comes to including certain variables in your regression."
- "[Linear regression] can be used not only to adjust for confounders, but also to reduce noise. For example, of you have data from a properly randomized A/B test, you don't need to worry about bias. But you can still use a regression as a noise reduction tool. Just include variables that are highly predictive of the outcome (and that don't induce selection bias)..... the most famous one is CUPED... [which] is very similar to just doing the denoising part of the FWL theorem."
- "On one hand, if you want to get rid of all the biases, you must include all the covariates after all, they are confounders that need to be adjusted. On the other hand, adjusting for causes of the treatment will increase the variance of your estimator. .... If you know one of the confounders is a strong predictor of the treatment and a weak predictor of the outcome, you can choose to drop it from the model... Now be warned! This will bias your estimate. But maybe this is a price worth paying if it also decreases variance significantly." ... "sometimes it is worth accepting a bit of bias in order to reduce variance."
- "The core of this chapter was orthogonalization as a means to make treatment look as good as randomly assigned if conditional independence holds... you can adjust for the confounding bias due to X by regressing T on X and obtaining the residuals. Those residuals can be seen as a debased version of the treatment."
Propensity score
- "[propensity weighting]... involves modeling the treatment assignment mechanism and using the model's prediction to reweight the data, instead of building residuals like in orthogonalization." ... "The propensity score can be viewed as a dimensionality reduction technique." ... "Think about it. If they have the exact same probability of receiving the treatment, the only reason one of them did it and the other didn't is by pure chance."
Effect heterogeneity
- This section is very complicated... instead of general impact of treatment, seeing how treatment can affect people differently (conditional average treatment effect).
Panel Data
DID (difference in difference)
- "A panel is a data structure that has repeated observations across time. The fact that you observe the same unit in multiple time periods allows you to see, for the same unit, what happens before and after a treatment takes place. This makes panel data a promising alternative to identifying causal effects when randomization is not possible."
- some divergence of typical DID assumptions: it takes time for the treatment to reach its full effect and staggered adoption (different treatment timings). Fortunately with panel data, unobserved confounders aren't blockers to the average treatment effect on the treated as long as they are constant across time for the same unit.
- "A lot for the promise [of panel data causal inference] comes from the fact that having an extra time dimension allows you to estimate counterfactuals for the treated not only from the control units, but also from the treated units' past."
Synthetic control
- "While DID works great if you have a relatively large number of units N compared to time periods T, it falls short when the reverse is true. In contrast, synthetic control was designed to work with very few, even one, treatment unit. The idea behind it is quite simple: combine the control units in order to craft a synthetic control that would approximately the behavior of the treated units in absence of treatment."
- "Synthetic control is nothing more than a regression that uses the outcome of the control as features to try to predict the average outcome of the treated units. The trick is that it does this by using only the pre-intervention period..... As you can see, you have one weight for each control city [Sao Paulo example]. Usually, regression is used when you have a bunch of units (large N), which allows you to use the uniform as the rows and covariates as the columns. But synthetic control is designed to work when you have relatively few units, but a larger time horizon T(pre). In order to do that, SC quite literally flips the data on its head, using units as if they were covariates. This is why synthetic control is also called horizontal regression." .... "simple regression is not commonly used as a method to build synthetic controls. Because of the relatively large number of columns (control cities), it tends to overfit, not generalize to the post-intervention period. For this reason, the original synthetic control method is not a simple regression, but one that imposes some reasonable and intuitive constraints [all weights are positive and add up to one]." Or use a lasso or ridge regression for regularization.
- "The gradual increase is frequently observed in marketing since it usually requires time for individuals to ask action after seeing an advertisement. Additionally, the effect wearing off can often be attributed to the novelty effect that gradually fades over time."
- "just fit a model to predict pre-treatment outcome of the treated on a bunch of equally pre-treatmemt time series and use that model to make predictions on the post-treatment period.... the result is a set of weights which, when multiplied by the control units, yields a synthetic control: a combination of control units that approximately the behavior of the treated units, at least in the pre-intervention period."
Alternative Experiment Design
- "What if, instead of having to use panel data to identify a treatment effect, you had to design an experiment to gather that data?"
- "Fortunately, the t-test you learned in the previous chapter doesn't make an assumption on how the units were selected, so you can use it here [when designing an experiment using synthetic control design]."
- "If the effect of rising prices dissipates rather quickly once they go back to the normal level, the company can turn the price increases on and off multiple times and do a sequence of before-and-after comparisons. This approach is named switchback experiments, and it is great for when you have just one or a very small number of units. But for it to work, the order of the carryover effect must be small. That is, the effect of the treatment cannot propagate to many periods after the treatment. For instance... food delivery." Eg when you cannot do an A/B test and operate in a small market so cannot do synthetic control experiments.

Errors/Typos from my copy
- Pg53 bottom. Should be: "Appear less than 5% of the time." Not 95%

3 reviews1 follower

May 30, 2023

Finished reading the early version on OReilly because I don’t want to wait for a couple months more.

This book is really a masterpiece that’s filling the gap in the literature between the theory of causal inference in academia and the practical application of causal inference in industry. Absolutely THE BOOK that I’ve been looking for these years.

Despite the typos in the early version, I will definitely recommend this book to any data scientists/ applied scientists/ data enthusiasts who are interested in learning causal inference. Also I will be definitely buying the paperback version in August.

Navid Ghorbanali

10 reviews

April 14, 2025

An excellent introduction to the much neglected topic of Causal Inference. Matheus has a talent for writing and teaching. The only criticism I have is that the book is in need of a 2nd or revised edition, as the field has moved rapidly and has somehow made this book a bit dated, especially in regards to more mature software libraries. Should we eventually get an update, I would have no reservation in giving it a full 5 stars. Nevertheless it still stands as a very good beginner book for this topic.

Cosimo Faeti

8 reviews

February 6, 2024

Thanks Matheus :)

Dirk

165 reviews10 followers

May 27, 2024

Great book for data scientists, good mix of theory and examples and code. Makes complex topics tangible. Covers a good range of methods.

nitpick: I found some minor errata

Josua Naiborhu

61 reviews1 follower

March 17, 2025

reading the first chapter alongside simple instances of causal inference use cases genuinely helps me comprehend it in an intuitive way. The author really knows how to explain it better. It's going to be my go-to resource when it comes to explaining the impact of predictions I would like to present for my clients. i could say this is the book you need to comprehend causal inference in a industry-based perspective. Kudos to the author!

Displaying 1 - 7 of 7 reviews