gFormulaMI - G-formula for Causal Inference via Multiple Imputation
Jonathan Bartlett
gFormulaMI.Rmd
Introduction
In this vignette we introduce the functionality of the
gFormulaMI
. The package implements an approach to
constructing a G-formula estimator using multiple imputation methods and
software, as described by Bartlett et al (2023).
First we use the simulated dataset provided in the package to
demonstrate how to impute potential outcomes using
gFormulaImpute
. We then show how the resulting imputed
datasets can be analysed using the syntheticPool
function.
Lastly, we illustrate how the functions can be used when the dataset has
missing values to begin with (i.e. regular missing data).
Imputing using gFormulaImpute
First, we load the package
The first few rows of the simulated dataset that ships with the package look like
head(simDataFullyObs)
#> l0 a0 l1 a1 l2 a2 y
#> 1 -0.6332580 1 0.59545963 1 2.7644551 1 3.4091077
#> 2 0.1182498 0 0.62444135 0 -0.9677095 1 -0.8586307
#> 3 0.4833364 1 -0.53836723 0 -0.7088896 0 -0.4378790
#> 4 0.6775931 0 0.57759241 1 2.0477487 1 3.8760967
#> 5 0.4278554 1 1.50188037 1 3.4155576 1 4.4027209
#> 6 0.6437421 1 -0.01322485 1 2.2178911 1 4.1302417
Here the l
variables correspond to a confounder measured
at baseline (l0
) and two follow-up time points
(l1
and l2
). The time-varying binary treatment
variable is stored in a0
, a1
and
a2
. The final outcome variable is y
.
Next we use gFormulaImpute
to impute potential outcomes
under two treatment regimes of interest. To do this,
gFormulaImpute
uses the mice
multiple
imputation function from the mice
package. If passed a
regular data frame, as here, gFormulaImpute
expects (and
requires) there to be no missing values in the data frame. To run, we
have to tell gFormulaImpute
which variables correspond to
the time-varying treatment and the treatment regime(s) that we want it
to create imputations for:
set.seed(7626)
#impute synthetic datasets under two regimes of interest
imps <- gFormulaImpute(data=simDataFullyObs,M=10,trtVars=c("a0","a1","a2"),
trtRegimes=list(c(0,0,0),c(1,1,1)))
#> [1] "Input data is a regular data frame."
#> [1] "Variables imputed using:"
#> l0 a0 l1 a1 l2 a2 y regime
#> "norm" "" "norm" "" "norm" "" "norm" ""
#> [1] "Predictor matrix is set to:"
#> l0 a0 l1 a1 l2 a2 y regime
#> l0 0 0 0 0 0 0 0 0
#> a0 1 0 0 0 0 0 0 0
#> l1 1 1 0 0 0 0 0 0
#> a1 1 1 1 0 0 0 0 0
#> l2 1 1 1 1 0 0 0 0
#> a2 1 1 1 1 1 0 0 0
#> y 1 1 1 1 1 1 0 0
#> regime 1 1 1 1 1 1 1 0
Here we have called gFormulaImpute
requesting 10
imputations to be generated for two regimes of interest, corresponding
to no treatment at each time point (0,0,0) and treatment at each time
point (1,1,1).
Imputation model specification
The printed output above tells us what imputation methods have been
used to impute (simulate) the time-varying confounders and outcome. By
default, gFormulaImpute
specifies to impute numeric
variables using normal linear regression, which is why here
l0
, l1
, l2
and y
are
being imputed using method norm
. If there had been binary
factors as time-varying confounders, these would be imputed using
logistic regression. Factor time-varying confounders which are unordered
are imputed using polytomous regression and a proportional odds model is
used for ordered factors. If you want to modify these defaults, you can
pass a method
vector to gFormulaImpute
, which
specifies which of the imputation methods available in the
mice
package to use when imputing each column of the data
frame.
The output also shows the predictorMatrix
argument used
when calling the mice
imputation function. This shows, in a
given row, which other variables (indicated by a 1) will be used as
covariates in the imputation model for the variable labelled in that
row. Thus here, l1
is being imputed using l0
and a0
. For variables which are not being imputed (as
indicated by an empty entry in the vector of imputation methods
printed), the corresponding row in the predictor matrix has no effect on
anything. In particular, the treatment variables a0
,
a1
and a2
are not being imputed based on a
regression model.
You will notice that the printed predictorMatrix
has 1s
in the lower diagonal and 0s in the upper diagonal. This is because
g-formula and hence gFormulaImpute
, imputes sequentially
forwards in time, using the past (i.e. to the left) variables as
covariates. Because of this, the data frame you pass to
gFormulaImpute
must have the variables ordered in time as
in the example above, so that the correct covariates are used in each
model. If you wish to modify the predictor matrix used, you can pass a
custom one using the predictorMatrix
argument of
gFormulaImpute
.
Note that unlike in the usual missing data situation, no iteration is
required when gFormulaImpute
calls the mice
imputation function, and thus when gFormulaImpute
calls
mice
it sets its maxit
argument to 1.
Number of imputations
In our call above we asked gFormulaImpute
to create 10
imputations. Later when we analyse the imputations, the special pooling
rules we will apply can, with a small number of imputations, produce
negative variances for parameter estimates. To avoid this, we recommend
using at least 25 imputations in general (we used 10 here for the sake
of speed).
Number of individuals to simulate
In our call above we did not specify the nSim
argument.
This argument specifies how many individuals to simulate longitudinal
histories for in each of the synthetic imputed datasets. Since we did
not specify a value, the default is to simulate the same number as the
number of rows in the data frame passed to gFormulaImpute
(here 5,000). If more than one treatment regime is specified,
nSim
rows are simulated for each regime.
Analysing the imputations
gFormulaImpute
creates a set of synthetic imputed
datasets. That is, the imputed datasets created contain imputed
(simulated) longitudinal histories based on the fitted models, under the
treatment regimes specified. The original rows in the data frame passed
to gFormulaImpute
do not appear in the imputed datasets
returned to us. The first few rows of the first imputed dataset are:
head(mice::complete(imps))
#> l0 a0 l1 a1 l2 a2 y regime
#> 1 0.6177218 0 0.6397720 0 -0.0003761256 0 0.5996190 1
#> 2 0.2044563 0 -0.6073790 0 -1.7856679957 0 -1.7246571 1
#> 3 0.4327859 0 -0.3925652 0 -0.7555212806 0 -1.5840459 1
#> 4 0.6059553 0 2.4764117 0 0.7312018102 0 0.6342567 1
#> 5 0.2985589 0 0.9223022 0 0.0734605439 0 -1.1511625 1
#> 6 -0.1763264 0 0.6122369 0 0.8838681514 0 2.0858243 1
We see that in the first rows of the first imputed dataset,
a0
, a1
and a2
are always zero,
because the first treatment regime we specified to impute for was
(0,0,0). We also have an additional variable, called
regime
. This factor variable records which of the specified
treatment regimes each row corresponds to. Thus here, in the first few
rows of the data frame, the regime is 1.
To analyse the imputed datasets we first run our desired analysis of
the imputed longitudinal histories. Here we will simply compare the mean
of the final outcome variable y
between the two regimes we
have imputed for. To do this, we fit a linear model with y
as the dependent variable and regime
as covariate:
Because the imputed datasets we have analysed are fully synthetic, we
cannot (or rather should not) pool our estimates using Rubin’s standard
combination rules (e.g. as implemented in pool
in the
mice
package). Doing so results in variances which are
larger than they should be. Instead we use the synthetic imputation
combination rules developed by Raghunathan et al (2003), implemented in
the syntheticPool
function:
syntheticPool(fits)
#> Estimate Within Between Total df
#> (Intercept) -0.02071539 0.0008125695 0.001763004 0.001126735 3.038045
#> factor(regime)2 2.96593502 0.0016251389 0.004203106 0.002998278 3.784951
#> 95% CI L 95% CI U p
#> (Intercept) -0.1267874 0.08535658 5.803156e-01
#> factor(regime)2 2.8104386 3.12143146 1.304839e-06
Here the intercept corresponds to the mean of y
under
our first regime (0,0,0). The factor(regime)2
coefficient
corresponds to difference in the mean of y
between the
second regime and the first. Here we see that the second regime has a
mean outcome about 3 higher than the first. As well as the point
estimate, we see the within, between and total imputation variances.
Here we can see that the total variances are less than the between,
which never happens with Rubin’s regular pooling rules. This is because
in the synthetic pooling rules of Raghunthan et al (2003), the total
variance is the between minus the within imputation variance (plus a
correction for the finite number of imputations performed).
Datasets with missing values
Often the longitudinal dataset we have has some missing values. In
this context, one approach is to use multiple imputation to impute these
missing values using imputation software, assuming missing at random.
Having imputed these, we can then pass the imputed datasets to
gFormulaImpute
to perform the synthetic imputation
step.
To illustrate these steps, let’s now create a new dataset from the simulated one, making some values missing:
simDataMissing <- simDataFullyObs
simDataMissing$l0[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$l1[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$l2[runif(nrow(simDataMissing))<0.2] <- NA
simDataMissing$y[runif(nrow(simDataMissing))<0.2] <- NA
Here we have simply made around 20% of values missing in
l0
, l1
, l2
and y
,
completely randomly.
Next, we impute the simDataMissing
data frame using the
mice
function:
simDataMissingImps <- mice::mice(simDataMissing,m=10,printFlag=FALSE)
In a real substantive analysis we should take more care to think
about how we specify the imputation models. For more discussion of this
point, see Bartlett et al
(2023). In particular, note that the default behaviour of the
mice
function is to impute numeric variables using the
predictive mean matching method, rather than normal linear regression
(as in the gFormulaImpute
function).
We can now pass our multiply imputed datasets to
gFormulaImpute
to create the synthetic imputations as
before, but this time let’s try and generate 20 synthetic imputated
datasets:
imps2 <- gFormulaImpute(data=simDataMissingImps,M=20,trtVars=c("a0","a1","a2"),
trtRegimes=list(c(0,0,0),c(1,1,1)))
#> [1] "Input data is a mice created multiple imputation object."
#> [1] "Value passed to M being ignored."
#> [1] "Number of synthetic imputations to be generated set to 10 as in mids object passed to gFormulaImpute."
#> [1] "Variables imputed using:"
#> l0 a0 l1 a1 l2 a2 y regime
#> "norm" "" "norm" "" "norm" "" "norm" ""
#> [1] "Predictor matrix is set to:"
#> l0 a0 l1 a1 l2 a2 y regime
#> l0 0 0 0 0 0 0 0 0
#> a0 1 0 0 0 0 0 0 0
#> l1 1 1 0 0 0 0 0 0
#> a1 1 1 1 0 0 0 0 0
#> l2 1 1 1 1 0 0 0 0
#> a2 1 1 1 1 1 0 0 0
#> y 1 1 1 1 1 1 0 0
#> regime 1 1 1 1 1 1 1 0
The output looks much the same as the first time we called
gFormulaImpute
, apart from the first few lines of output.
gFormulaImpute
can see that there are only 10 imputations
of the original dataset, and it refuses to generate a different number
of imputations to the number in the mids
multiple
imputation dataset object we have passed to it. The imps2
object thus contains 10 imputations, which can be analysed using
syntheticPool
in the same way as before.
References
Bartlett JW, Olarte Parra C, Daniel RM. G-formula for causal inference via multiple imputation. 2023 arXiv 2301.12026
Raghunathan TE, Reiter JP, Rubin DB. 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1), p.1-16.