An encoding generative modeling approach to dimension reduction and covariate adjustment in causal inference with observational studies

Significance Causal inference has been increasingly essential in modern observational studies with rich covariate information. However, it is often challenging to estimate the causal effect with high-dimensional covariates. Here, we introduce an approach by encoding generative modeling (EGM) for handling high-dimensional covariates by a dependency-aware dimension reduction strategy where the key idea is to identify a latent covariate feature set (e.g., latent confounders) that affects both treatment and outcome. EGM provides a flexible and powerful framework for us to develop deep learning-based estimates for the structural equation modeling that describes the causal relations among variables. Comprehensive numerical experiments suggest that the proposed method is effective and scalable in estimating the causal effect of one variable on another under various settings.


B: Proof of Theorem 1 (Bound of excess risk)
We have two assumptions for the function classes FM and DM .
(uniformly equi-continuous) FM is uniformly equi-continuous: ∀ > 0, there exists a δ > 0 such that for any f ∈ FM For commonly used neural network models, the first assumption is satisfied as real-valued weights and continuous activation functions are used in neural networks.The second assumption can also satisfied by bounded input domain and regularization on the weights of neural networks.
Next, by using the triangle inequality, we have  Then by the definition of empirical risk minimizer, we further have En[(X − ĥM,n(Z0, Z2) En[||V − ĝM,n(êM,n(V ))|| 2  2 ] ≤ En[||V − g 0 M,n (e 0 M,n (V ))|| 2  2 ], d(P êM,n (V ) , PZ emp ; AM ) ≤ d(P e 0 M (V ) , PZ emp ; AM ) [3] where (f 0 M , h 0 M , e 0 M , g 0 M ) are the solution of minimizing the true risk R 0 (f, h, e, g) and PZ emp is the empirical distribution of Z. Then we use the triangle inequality again, we have where Next, we rewrite the distribution distance measure term from γM,n as where DM is the class of the binary discriminator networks that classifies the class of the measurable sets AM .Now αM,n,βM,n,γM,n, and ζM,n all have the same format 2 sup F ∈F |(En − E0)F | for some function class F .Now for any b-uniformly bounded function class F , the uniform law of large numbers states that for all n ≥ 1 and δ ≥ 0, we have: with probability at least 1 − 2e The empirical Rademacher complexity term is defined as where σi is i.i.d.drawn from the Rademacher distribution with P (σi = 1) = P (σi = −1) = 1 2 .The Rademacher complexity measures richness of a function class w.r.t. a probability distribution.
Note that F is a scalar function (e.g., a loss function of the output for neural network).For example, F = (Y − f (X, Z0, Z1)) 2 in αM,n term.Then we discuss how to further get upper bound of αM,n, βM,n, γM,n, and ζM,n.
Upper bound for αM,n, βM,n.The squared loss functions given the neural network output in αM,n and βM,n terms satisfy the Lipschitz condition with bounded inputs, which can be upper bounded by Rn(L λ • FM ) where L λ is the λ-Lipschitz squared loss function.Note that the Lipschitz condition is satisfied with bounded domain.
P( sup ) ≥ ( * * ) P( sup [11] where ( * ) is based on a division of δ into two parts with ratio (4 + 2p) : 2 and ( * * ) is based on the fact that Note that if we strictly define F ∈ FM is a scalar output of neural network, such as f , h, ei and gi (the i-th output), we can further use Lemma of Talagrand's contraction principal (1) (see its lemma 4.2 and Theorem 10.2) to get upper bound of

C: Example for Assumption 2
Assumption 2 is expected to hold with a small δ and the dimension of the covariates V can be effectively reduced.This is illustrated by the following simulation study.Let V ∈ V be a continuous random variable that follows a multivariate Gaussian distribution V ∼ N (µ, Σ) where µ ∈ R p and Σ ∈ R p×p .We aim to find encoding function e and generative/decoder function g, which follow the mappings e : V → R q and g : R q → V where q p. First, we factorize the covariance matrix as where the columns of U form the eigenvectors associated with the eigenvalues in diagonal elements of Λ.We further sort all the eigenvalues in descending order as Λ = diag(λ1, ..., λp) where λi ≥ λj for any i < j.
By linear transformation, it is easily proven that T = (U Λ 2 ) −1 (V − µ) follows a standard multivariate Gaussian distribution where T ∼ N (0, I).This linear transformation could be considered as the underlying encoding function where a standard Gaussian distribution is present in the latent space.In the dimension reduction scenario, it is expected that a small fraction of eigenvalues in Σ could explain the majority of the total variation in V .So we design the following generating process.
We set p = 50, q = 13, and the diagonal elements of Λ to be where the first 13 principle components can explain 95.96% of the variation contained in V .To generate V , the mean vector µ is sampled from a uniform distribution µi ∼ U (−1, 1), the covariance matrix Σ is constructed by Equation ( 12) where the columns of U are a set of random orthonormal basis.To construct the features from V for predicting treatment X and outcome Y , we set the three components e 0 0 (V ), e 0 1 (V ), and e 0 2 (V ) in the encoder network e as follows [14] where ti(V ) denotes that i th element of the linear transformation It is easily proven that e 0 k (V ) ∼ N (0, 1) for k ∈ {1, 2, 3}, which satisfies the independent normal distribution in latent space.The treatment and outcome can then be generated based on the features of V as Y =f (X, e 0 0 (V ), e 0 1 (V )) + 1, X =h(e 0 0 (V ), e 0 2 (V )) + 2, [15] where e 0 0 (V ), e 0 1 (V ), and e 0 2 (V ) can be considered as the constructed features from V for predicting X and Y .For implementation, we set the first three parts of encoder e network to be the fixed functions, e 0 0 (•), e 0 1 (•), and e 0 2 (•).The fourth part of encoder e is trainable, which is set to be 10-dimensional.According to the Principal Component Analysis (PCA) (2), the theoretical optimal reconstruction error using a q-dimensional feature is λi. [16] In the above simulation example, Lrec = 1.907.Then we generate N = 50000 i.i.d samples of V , and then use the data to train the above partially fixed encoder-decoder model.To avoid overfitting of neural nets, we additionally generate 10000 hold-out samples of V .As shown in Fig S1, the empirical reconstruction error of the held-out data reaches the minimum (2.339) at iteration 109600.In this simulation, δ in assumption 2 can be as small as 0.432, which only occupies less than 1% of all variation contained in V ( p i=1 λi).
During the CausalEGM model training, we aim to optimize (F, H, E, G) networks to approximate the underlying true functions f 0 , h 0 , e 0 , and g 0 .Then suppose 0 , e 0 1 , e 0 2 , e 0 3 ) denotes the four components of the encoder function e 0 .We want the latent variable Z = (Z0, Z1, Z2, Z3) to have a desired distribution (e.g., standard normal distribution).

Lee et al.
We followed a similar data generation process in (6) as follows: let 1 ∼ N (0, 1), 2 ∼ N (0, 1).The covariates are generated by V = (V1, ..., Vp) ∼ N (0, Σ) where diag(Σ) = 1 and Σi,j = 0.5 for |i − j| = 1.The treatment is generated by X = Φ(3V θ) + 0.75 1 − 0.5 where θj = 1/j 2 .The outcome is generated by Twins.This dataset contains data of 71,345 twins, including their weights (used as treatment), mortality, and 50 other covariates (so p = 50) derived from all births in the USA between 1989-1991.Similar to (7), we first filtered the data by limiting the weight to be less than 2 kilograms.4821 pairs of twins were kept for further analysis.We set the weights as the continuous treatment variable.We then simulate the risk of death (outcome) under a model in which higher weight leads to a lower death rate in general.Let Y be the Bernoulli variable where Y = 1 indicates death and R be the death risk that depends on the covariates.We simulate the outcome as Y (x) ∼ Bernoulli(R(x)), R(x) = − 2 1+e −3X + vγ + where γ ∈ R p×1 and γi ∼ N (0, 0.025 2 ), ∼ N (0, 0.25 2 ).Taking the response variable as R instead of Y, the ADRF is then: µ ACIC 2018.For binary treatment settings, we downloaded the datasets from the 2018 Atlantic Causal Inference Conference (ACIC) competition.This dataset utilizes the Linked Births and Infant Deaths Database (LBIDD) based on real-world medical measurements collected from (8).The LBIDD data is semi-synthetic where 117 measured covariates are given, and the treatment and outcome are simulated based on different data-generating processes.We chose nine datasets by selecting the most complicated generation process (e.g., the highest degree of generation function) with sample sizes ranging from 1,000 to 50,000.The details for the datasets used in the study are provided in Table S1.

F: Baseline Methods
For the continuous treatment setting, three different baselines were used.

Ordinary Least Squares regression (OLS). OLS first fit a linear regression model for Y |(X, V
).For each value of treatment x, the estimated ADRF is then 1 n n i ls(x, vi) where ls is the fitted linear model.

Regression Prediction Estimator (REG).
See (9)(10)(11).The prima facie estimator is an estimator that regresses the outcome on the treatment without considering covariates.REG generalizes the notion of prima facie estimator.It takes the covariates into account when doing regression.Unlike OLS, REG fits a quadratic ADRF: Y (x) = α0 + α1x + α2x 2 .

Double Debiased Machine Learning Estimator (DML).
See (6).DML is a kernel-based machine learning approach that combines a doubly moment function and cross-fitting.Various machine learning methods can be used to estimate the conditional expectation function and conditional density.We used "Lasso", "random forest", and "neural network" provided by the DML toolkit as three variants, denoted as DML(lasso), DML(rf) and DML(nn).
For the binary treatment setting, six baselines were introduced.
CFR. See (12).This method estimated the individual treatment effect (ITE) by utilizing neural networks to learn the lowdimensional representation for covariates and two outcome functions, respectively.An integral probability metric was further introduced to control the balance of distributions in the treated and control group.We use the two variants of CFR for comparison, which are referred to as TAENET and CFRNET.
Dragonnet.See (13).This method used a three-head architecture, which contains a two-head architecture for outcome estimation and a one-head architecture for propensity score estimation.It is noted that Dragonnet uses essentially the same architecture as CFR if the propensity-score head is removed.

CEVAE. See (14)
. This method is a variational autoencoder-based method for estimating the treatment effect where the above CFR architecture (12) was used in the inference network and the latent variables were set to be multivariate normal distribution.

GANITE.
See (15).This method exploited a generative adversarial network (GAN) model for generating the counterfactual outcome through adversarial training.
CausalForest..See (16).This method built random forests to estimate the heterogeneous treatment effect that is applicable in binary treatment settings.Causalforest is an ensemble method that consists of multiple causal trees.
Furthermore, we explored a natural strategy by adding random noise with a moderate level to the discrete covariates to make them more continuous but still distinguishable.We adapted the first dataset from the SI Appendix where Vi has the i.i.d.distribution Vi ∼ Bernoulli(0.5)and i = 1, ..., 10.The generation process is where Vy is the concatenation of V1:K and V (K+5):10 .Then we varied K to set different number of confounders.It is observed that adding a Gaussian noise (N (0, 0.2 2 )) could slightly improve the model performance with different K (Figure S4).To conclude, we can add Gaussian noise to the discrete covariates to mitigate the discreteness if the number of covariates is not big.However, adding noise to all covariates maybe harmful to the model as the total variance could accumulate if the latent confounding variable is generated through the combination of many covariates.In the case of high-dimensional covariates, users shall directly run CausalEGM without adding any noise.

I: Ablation Studies
We conducted detailed experiments to investigate whether the adversarial training in covariate space and the reconstruction error in latent space are necessary.In our model design, adversarial training in latent space is necessary to guarantee the independence of latent variables.The reconstruction of V is also required for ensuring the latent features contain all the information possessed by the original covariates.So we designed experiments using two datasets to quantitatively evaluate the contribution of the adversarial training in covariate space and the reconstruction in latent space.As shown in Table S3, the reconstruction of latent features could help benefit the model training and achieve slightly better performance.Using the adversarial training in covariate space might not improve the model training as the distribution matching in high-dimensional space might be difficult.

J: Robustness and Scalability
To demonstrate the robustness and scalability of CausalEGM, we designed a series of experiments as follows.For the robustness analysis, it is important to evaluate how the dimension for latent features Z will affect the performance of CausalEGM model.We focus on both the total dimension of latent feature Z and also the dimension of the common latent features Z0 that affect both treatment and outcome.For continuous experiments, we choose Hiranos and Imbens dataset for example.On the one hand, the dimension for Z0, Z1, Z2, Z3 is set to be {(k, k, k, 7k) k = 1, 2, ..., 5} where k = 1 is used as default in the main result.On the other hand, we set the dimension for Z0 to be ranging from 1 to 10 while dimensions of other latent features (Zi(i = 1, 2, 3)) are fixed to (1, 1, 7), respectively.It is noted that the performance has a small fluctuation by varying the dimension of either total latent features or only common latent features (Figure S2A-B).We use similar settings in the binary experiment for robustness analysis.We chose a dataset from LBIDD with a sample size equal to 1000 for instance.On the one hand, the dimension for Z0, Z1, Z2, Z3 is set to be {(k, k, 2k, 2k)|k = 1, 2, ..., 5} where k = 3 is used as default.On the other hand, we set the dimension for Z0 to be ranging from 1 to 10 while dimensions of other latent features (Zi(i = 1, 2, 3)) are fixed to (3,6,6).It is observed that the performance does not change significantly by varying the dimension of latent features (Figure S2C-D).Such experiments in both continuous and binary treatment settings demonstrate the robustness of CausalEGM in terms of choosing the latent dimensions.
For the scalability analysis, we are interested in 1) whether CausalEGM can handle datasets with large sample sizes; and 2) whether CausalEGM can handle datasets with a large number of covariates.We designed the following experiments to test the scalability of CausalEGM.For the continuous treatment experiment, we selected Hirano and Imbens dataset.We first change the number of covariates from 50 to 10000 while the sample size is 10000.Note that only OLS, DML(lasso) and CausalEGM can handle covariates more than 1000 while other comparison methods failed (Figure S3 A).Next, we fix the number of covariates to 100 while changing the sample size from 10 3 to 10 6 .Only OLS, REG, DML(lasso), and CausalEGM are able to handle large sample size 10 6 (Figure S3 B).Except for a small sample size situation (e.g., 1000) where CausalEGM achieves comparable performance compared to DML(nn) and DML(rf), CausalEGM consistently outperforms all comparison methods by either changing the number of covariates or sample size.Similarly, for the binary treatment experiment, we chose one of the largest dataset from LBIDD with a sample size equal to 50000.For such a semi-synthetic dataset, we increase the number of covariates by adding new covariates that are linear combinations of existing covariates where the combination coefficients follow the standard normal distribution.We increase the sample size by augmenting the data by randomly repeating the existing samples.We tested the performance of CausalEGM and the best baseline method CausalForest.We first change the number of covariates from 500 to 50000 while the sample size is 50000.Note that the performance of CausalForest first increases a little and then decreases while CausalEGM is generally more robust when changing the number of covariates (Figure S3 C).Next, we fix the number of covariates to the original 177 while changing the sample size from 50000 to 5000000.Note that CausalForest failed when the sample size increases to 1 million while CausalForest is capable of handling extremely large datasets with more than 5 million samples (Figure S3D).Note that we benchmarked all methods using the Stanford Sherlock computing cluster where the memory usage for each method is limited to 50 GB and running time is limited to 7 days in the scalability experiments.To sum up, CausalEGM can handle significantly larger datasets than CausalForest.12 of 16 Qiao Liu, Zhongren Chen and Wing Hung Wong Information TextA: Proof of identifiability of µ(x)µ(x) = E[Y (x)] = E[f0(x, Zy, U2)|Z0 = z0)]pZ 0 (z0)dz0 = E[f0(X, Zy, U2)|X = x, Z0 = z0]pZ 0 (z0)dz0 x) − f (y)| < whenever d(x, y) < δ.Note that we consider the scalar function or each scalar output of a vector function in FM .(b-uniformly bounded) Assume FM and DM are b-uniformly bounded: there exists some b > 0 such that for any f ∈ FM or f ∈ DM , ||f ||∞ ≤ b.

Fig. S4 .
Fig. S4.Model performance with different number of confounders using the original discrete covariates (CausalEGM) or adding Gaussian noise to the discrete covariates (CausalEGM+).
FM , we can get upper bound of the term ζM,n by pRn(L λ • FM ) since each component in the summation is a squared loss function, which satisfies the Lipschitz condition under bounded domain.In order to bound ζM,n, we specifically consider the scale of Rademacher complexity of scaler squared loss function and vector squared loss function in bounding ζM,n.