G-formula multiple imputation — gFormulaImpute • gFormulaMI

gFormulaImpute creates multiple imputed synthetic datasets of longitudinal histories under specified treatment regimes of interest, based on the G-formula.

Usage

gFormulaImpute(
  data,
  M = 50,
  trtVars,
  trtRegimes,
  nSim = NULL,
  micePrintFlag = FALSE,
  silent = FALSE,
  method = NULL,
  predictorMatrix = NULL,
  missingDataCheck = TRUE
)

Arguments

data: The observed data frame
M: The number of imputed datasets to generate
trtVars: A vector of variable names indicating the time-varying treatment variables
trtRegimes: A vector specifying the treatment regime of interest, or a list of vectors specifying the treatment regimes of interest
nSim: The number of individuals to simulate in each imputed dataset. Defaults to number of individuals in observed data
micePrintFlag: TRUE/FALSE specifying whether the output from the call(s) to mice should be printed
silent: TRUE/FALSE indicating whether to print output to console (FALSE) or not (TRUE)
method: An optional method argument to pass to mice. If specified, this should be a named vector of methods, indicating which imputation method to use for each time-varying confounder variable and final outcome. If not specified, the default is to impute continuous variables using normal linear regression (norm), binary variables using logistic regression (logreg), polytomous regression for unordered factors and proportional odds model for ordered factors
predictorMatrix: An optional predictor matrix to specify which variables to use as predictors in the imputation models. The default is to impute sequentially, i.e. impute using all variables to the left of the variable being imputed as covariates
missingDataCheck: TRUE/FALSE indicating whether gFormulaMI checks, when passed a regular data frame, whether there any missing values.

Value

an S3 object of class mids (multiply imputed dataset)

Details

gFormulaImpute creates multiple imputed synthetic datasets of longitudinal histories under specified treatment regimes of interest, based on the G-formula, as described by Bartlett et al (2025) doi:10.1177/09622802251316971 . Specifically, to the observed data frame, an additional nSim rows are added in which all variables are set to missing, except the time-varying treatment variables. The latter are set to the values as specified in the trtRegimes argument. If multiple treatment regimes are specified, nSim rows are added for each of the specified treatment regimes.

gFormulaImpute uses the mice package to impute the potential outcome values of the time-varying confounders and outcome in the synthetic datasets. Imputation is performed sequentially from left to right in the data frame. As such, the variables must be ordered in time in the input data frame, with the time-varying confounders at each time followed by the corresponding treatment variable at that time.

For the data argument, gFormulaImpute expects either a fully observed (complete) data frame, or else a set of multiple imputation stored in an object of class mids (from the mice package).

Unlike with Rubin's regular multiple imputation pooling rules, it is possible for the pooling rules developed by Raghunathan et al (2003) to give negative variance estimates. The probability of this occurring is reduced by increasing M and/or nSim.

gFormulaImpute returns an object of class mids. This can be analysed using the same methods that imputed datasets from mice can be analysed with (see examples). However, Rubin's standard pooling rules are not valid for analysis of the synthetic datasets. Instead, the synthetic variance estimator of Raghunathan et al (2003) must be used, as implemented in the syntheticPool function.

The development of the gFormulaMI package was supported by a grant from the UK Medical Research Council (MR/T023953/1).

References

Bartlett JW, Olarte Parra C, Granger E, Keogh RH., van Zwet EW and Daniel RM, 2025. G-formula with multiple imputation for causal inference with incomplete data. Statistical Methods in Medical Research.

Raghunathan TE, Reiter JP, Rubin DB. 2003. Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1), p.1-16.

Author

Jonathan Bartlett jonathan.bartlett1@lshtm.ac.uk

Examples

set.seed(7626)
#impute synthetic datasets under two regimes of interest
imps <- gFormulaImpute(data=simDataFullyObs,M=10,
                        trtVars=c("a0","a1","a2"),
                        trtRegimes=list(c(0,0,0),c(1,1,1)))
#> [1] "Input data is a regular data frame."
#> [1] "Variables imputed using:"
#>     l0     a0     l1     a1     l2     a2      y regime 
#> "norm"     "" "norm"     "" "norm"     "" "norm"     "" 
#> [1] "Predictor matrix is set to:"
#>        l0 a0 l1 a1 l2 a2 y regime
#> l0      0  0  0  0  0  0 0      0
#> a0      1  0  0  0  0  0 0      0
#> l1      1  1  0  0  0  0 0      0
#> a1      1  1  1  0  0  0 0      0
#> l2      1  1  1  1  0  0 0      0
#> a2      1  1  1  1  1  0 0      0
#> y       1  1  1  1  1  1 0      0
#> regime  1  1  1  1  1  1 1      0
#fit linear model to final outcome with regime as covariate
fits <- with(imps, lm(y~factor(regime)))
#pool results using Raghunathan et al 2003 rules
syntheticPool(fits)
#>                    Estimate       Within     Between       Total       df
#> (Intercept)     -0.02071539 0.0008125695 0.001763004 0.001126735 3.038045
#> factor(regime)2  2.96593502 0.0016251389 0.004203106 0.002998278 3.784951
#>                   95% CI L   95% CI U            p
#> (Intercept)     -0.1267874 0.08535658 5.803156e-01
#> factor(regime)2  2.8104386 3.12143146 1.304839e-06