13. Standardization and the parametric G-formula: Stata

library(Statamarkdown)
/***************************************************************
Stata code for Causal Inference: What If by Miguel Hernan & Jamie Robins
Date: 10/10/2019
Author: Eleanor Murray 
For errors contact: ejmurray@bu.edu
***************************************************************/

Program 13.1

  • Estimating the mean outcome within levels of treatment and confounders: Data from NHEFS
  • Section 13.2
use ./data/nhefs-formatted, clear

/* Estimate the the conditional mean outcome within strata of quitting 
smoking and covariates, among the uncensored */
glm wt82_71 qsmk sex race c.age##c.age ib(last).education ///
  c.smokeintensity##c.smokeintensity c.smokeyrs##c.smokeyrs ///
  ib(last).exercise ib(last).active c.wt71##c.wt71 ///
  qsmk##c.smokeintensity
predict meanY
summarize meanY

/*Look at the predicted value for subject ID = 24770*/
list meanY if seqn == 24770

/*Observed mean outcome for comparison */
summarize wt82_71
note: 1.qsmk omitted because of collinearity.
note: smokeintensity omitted because of collinearity.

Iteration 0:  Log likelihood = -5328.5765  

Generalized linear models                         Number of obs   =      1,566
Optimization     : ML                             Residual df     =      1,545
                                                  Scale parameter =    53.5683
Deviance         =  82763.02862                   (1/df) Deviance =    53.5683
Pearson          =  82763.02862                   (1/df) Pearson  =    53.5683

Variance function: V(u) = 1                       [Gaussian]
Link function    : g(u) = u                       [Identity]

                                                  AIC             =   6.832154
Log likelihood   = -5328.576456                   BIC             =   71397.58

------------------------------------------------------------------------------------
                   |                 OIM
           wt82_71 | Coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------------+----------------------------------------------------------------
              qsmk |   2.559594   .8091486     3.16   0.002      .973692    4.145496
               sex |  -1.430272   .4689576    -3.05   0.002    -2.349412   -.5111317
              race |   .5601096   .5818888     0.96   0.336    -.5803714    1.700591
               age |   .3596353   .1633188     2.20   0.028     .0395364    .6797342
                   |
       c.age#c.age |   -.006101   .0017261    -3.53   0.000    -.0094841   -.0027178
                   |
         education |
                1  |    .194977   .7413692     0.26   0.793     -1.25808    1.648034
                2  |   .9854211   .7012116     1.41   0.160    -.3889285    2.359771
                3  |   .7512894   .6339153     1.19   0.236    -.4911617    1.993741
                4  |   1.686547   .8716593     1.93   0.053    -.0218744    3.394967
                   |
    smokeintensity |   .0491365   .0517254     0.95   0.342    -.0522435    .1505165
                   |
  c.smokeintensity#|
  c.smokeintensity |  -.0009907    .000938    -1.06   0.291    -.0028292    .0008479
                   |
          smokeyrs |   .1343686   .0917122     1.47   0.143     -.045384    .3141212
                   |
        c.smokeyrs#|
        c.smokeyrs |  -.0018664   .0015437    -1.21   0.227    -.0048921    .0011592
                   |
          exercise |
                0  |  -.3539128   .5588587    -0.63   0.527    -1.449256    .7414301
                1  |  -.0579374   .4316468    -0.13   0.893    -.9039497    .7880749
                   |
            active |
                0  |   .2613779   .6845577     0.38   0.703     -1.08033    1.603086
                1  |  -.6861916   .6739131    -1.02   0.309    -2.007037    .6346539
                   |
              wt71 |   .0455018   .0833709     0.55   0.585    -.1179022    .2089058
                   |
     c.wt71#c.wt71 |  -.0009653   .0005247    -1.84   0.066    -.0019937    .0000631
                   |
              qsmk |
Smoking cessation  |          0  (omitted)
    smokeintensity |          0  (omitted)
                   |
              qsmk#|
  c.smokeintensity |
Smoking cessation  |   .0466628   .0351448     1.33   0.184    -.0222197    .1155453
                   |
             _cons |  -1.690608   4.388883    -0.39   0.700    -10.29266    6.911444
------------------------------------------------------------------------------------

(option mu assumed; predicted mean wt82_71)

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       meanY |      1,566      2.6383    3.034683  -10.87582   9.876489

      +----------+
      |    meanY |
      |----------|
 960. | .3421569 |
      +----------+

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     wt82_71 |      1,566      2.6383    7.879913  -41.28047   48.53839

Program 13.2

  • Standardizing the mean outcome to the baseline confounders
  • Data from Table 2.2
  • Section 13.3
clear
input str10 ID L A Y
"Rheia"     0 0 0 
"Kronos"    0 0 1 
"Demeter"   0 0 0 
"Hades"     0 0 0 
"Hestia"    0 1 0 
"Poseidon"  0 1 0 
"Hera"      0 1 0 
"Zeus"      0 1 1 
"Artemis"   1 0 1
"Apollo"    1 0 1
"Leto"      1 0 0
"Ares"      1 1 1
"Athena"    1 1 1
"Hephaestus" 1 1 1
"Aphrodite" 1 1 1
"Cyclope"   1 1 1
"Persephone" 1 1 1
"Hermes"    1 1 0
"Hebe"      1 1 0
"Dionysus"  1 1 0 
end

/* i. Data set up for standardization: 
 - create 3 copies of each subject first, 
 - duplicate the dataset and create a variable `interv` which indicates
which copy is the duplicate (interv =1) */
expand 2, generate(interv)

/* Next, duplicate the original copy (interv = 0) again, and create
another variable 'interv2' to indicate the copy */
expand 2 if interv == 0, generate(interv2)

/* Now, change the value of 'interv' to -1 in one of the copies so that
there are unique values of interv for each copy */
replace interv = -1  if interv2 ==1
drop interv2

/* Check that the data has the structure you want: 
 - there should be 1566 people in each of the 3 levels of interv*/
tab interv

/* Two of the copies will be for computing the standardized result
for these two copies (interv = 0 and interv = 1), set the outcome to
missing and force qsmk to either 0 or 1, respectively.
You may need to edit this part of the code for your outcome and exposure variables */
replace Y = . if interv != -1
replace A = 0 if interv == 0
replace A = 1 if interv == 1

/* Check that the data has the structure you want: 
for interv = -1, some people quit and some do not; 
for interv = 0 or 1, noone quits or everyone quits, respectively */
by interv, sort: summarize A

*ii.Estimation in original sample*
*Now, we do a parametric regression with the covariates we want to adjust for*
*You may need to edit this part of the code for the variables you want.*
*Because the copies have missing Y, this will only run the regression in the
*original copy.*
*The double hash between A & L creates a regression model with A and L and a 
* product term between A and L*
regress Y A##L

*Ask Stata for expected values - Stata will give you expected values for all 
* copies, not just the original ones*
predict predY, xb

*Now ask for a summary of these values by intervention*
*These are the standardized outcome estimates: you can subtract them to get the
* standardized difference*
by interv, sort: summarize predY

*iii.OPTIONAL: Output standardized point estimates and difference*
*The summary from the last command gives you the standardized estimates*
*We can stop there, or we can ask Stata to calculate the standardized difference
* and display all the results in a simple table*
*The code below can be used as-is without changing any variable names*
*The option "quietly" asks Stata not to display the output of some intermediate
* calculations*
*You can delete this option if you want to see what is happening step-by-step*
quietly summarize predY if(interv == -1)
matrix input observe = (-1,`r(mean)')
quietly summarize predY if(interv == 0)
matrix observe = (observe \0,`r(mean)')
quietly summarize predY if(interv == 1)
matrix observe = (observe \1,`r(mean)')
matrix observe = (observe \., observe[3,2]-observe[2,2]) 

*Add some row/column descriptions and print results to screen*
matrix rownames observe = observed E(Y(a=0)) E(Y(a=1)) difference
matrix colnames observe = interv value
matrix list observe 

*to interpret these results:*
*row 1, column 2, is the observed mean outcome value in our original sample*
*row 2, column 2, is the mean outcome value if everyone had not quit smoking*
*row 3, column 2, is the mean outcome value if everyone had quit smoking*
*row 4, column 2, is the mean difference outcome value if everyone had quit 
* smoking compared to if everyone had not quit smoking*
             ID          L          A          Y
  1. "Rheia"         0 0 0 
  2. "Kronos"        0 0 1 
  3. "Demeter"       0 0 0 
  4. "Hades"         0 0 0 
  5. "Hestia"        0 1 0 
  6. "Poseidon"      0 1 0 
  7. "Hera"          0 1 0 
  8. "Zeus"          0 1 1 
  9. "Artemis"       1 0 1
 10. "Apollo"        1 0 1
 11. "Leto"          1 0 0
 12. "Ares"          1 1 1
 13. "Athena"        1 1 1
 14. "Hephaestus" 1 1 1
 15. "Aphrodite" 1 1 1
 16. "Cyclope"       1 1 1
 17. "Persephone" 1 1 1
 18. "Hermes"        1 1 0
 19. "Hebe"          1 1 0
 20. "Dionysus"      1 1     0 
 21. end

(20 observations created)

(20 observations created)

(20 real changes made)



  Expanded observation |
                  type |      Freq.     Percent        Cum.
-----------------------+-----------------------------------
                    -1 |         20       33.33       33.33
  Original observation |         20       33.33       66.67
Duplicated observation |         20       33.33      100.00
-----------------------+-----------------------------------
                 Total |         60      100.00

(40 real changes made, 40 to missing)

(13 real changes made)

(7 real changes made)


--------------------------------------------------------------------------------------
-> interv = -1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
           A |         20         .65    .4893605          0          1

--------------------------------------------------------------------------------------
-> interv = Original

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
           A |         20           0           0          0          0

--------------------------------------------------------------------------------------
-> interv = Duplicat

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
           A |         20           1           0          1          1

      Source |       SS           df       MS      Number of obs   =        20
-------------+----------------------------------   F(3, 16)        =      1.07
       Model |  .833333333         3  .277777778   Prob > F        =    0.3909
    Residual |  4.16666667        16  .260416667   R-squared       =    0.1667
-------------+----------------------------------   Adj R-squared   =    0.0104
       Total |           5        19  .263157895   Root MSE        =    .51031

------------------------------------------------------------------------------
           Y | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         1.A |   1.05e-16   .3608439     0.00   1.000    -.7649549    .7649549
         1.L |   .4166667    .389756     1.07   0.301    -.4095791    1.242912
             |
         A#L |
        1 1  |  -5.83e-17   .4959325    -0.00   1.000     -1.05133     1.05133
             |
       _cons |        .25   .2551552     0.98   0.342    -.2909048    .7909048
------------------------------------------------------------------------------



--------------------------------------------------------------------------------------
-> interv = -1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       predY |         20          .5     .209427        .25   .6666667

--------------------------------------------------------------------------------------
-> interv = Original

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       predY |         20          .5     .209427        .25   .6666667

--------------------------------------------------------------------------------------
-> interv = Duplicat

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       predY |         20          .5     .209427        .25   .6666667












observe[4,2]
               interv      value
  observed         -1  .50000001
 E(Y(a=0))          0  .50000001
 E(Y(a=1))          1  .50000001
difference          .          0

Program 13.3

  • Standardizing the mean outcome to the baseline confounders:
  • Data from NHEFS
  • Section 13.3
use ./data/nhefs-formatted, clear

*i.Data set up for standardization: create 3 copies of each subject*
*first, duplicate the dataset and create a variable 'interv' which indicates
* which copy is the duplicate (interv =1)
expand 2, generate(interv)

*next, duplicate the original copy (interv = 0) again, and create another
* variable 'interv2' to indicate the copy
expand 2 if interv == 0, generate(interv2)

*now, change the value of 'interv' to -1 in one of the copies so that there are
* unique values of interv for each copy*
replace interv = -1  if interv2 ==1
drop interv2 

*check that the data has the structure you want: there should be 1566 people in
* each of the 3 levels of interv*
tab interv

*two of the copies will be for computing the standardized result*
*for these two copies (interv = 0 and interv = 1), set the outcome to missing
* and force qsmk to either 0 or 1, respectively*
*you may need to edit this part of the code for your outcome and exposure variables*
replace wt82_71 = . if interv != -1
replace qsmk = 0 if interv == 0
replace qsmk = 1 if interv == 1

*check that the data has the structure you want: for interv = -1, some people
* quit and some do not; for interv = 0 or 1, noone quits or everyone quits, respectively*
by interv, sort: summarize qsmk

*ii.Estimation in original sample*
*Now, we do a parametric regression with the covariates we want to adjust for*
*You may need to edit this part of the code for the variables you want.*
*Because the copies have missing wt82_71, this will only run the regression in 
* the original copy*
regress wt82_71 qsmk sex race c.age##c.age ib(last).education ///
c.smokeintensity##c.smokeintensity c.smokeyrs##c.smokeyrs ///
ib(last).exercise ib(last).active c.wt71##c.wt71 qsmk#c.smokeintensity

*Ask Stata for expected values - Stata will give you expected values for all 
* copies, not just the original ones*
predict predY, xb

*Now ask for a summary of these values by intervention*
*These are the standardized outcome estimates: you can subtract them to get the
* standardized difference*
by interv, sort: summarize predY

/* iii.OPTIONAL: Output standardized point estimates and difference
- The summary from the last command gives you the 
standardized estimates
- We can stop there, or we can ask Stata to calculate the 
standardized difference and display all the results 
in a simple table
- The code below can be used as-is without changing any
variable names
- The option `quietly` asks Stata not to display the output of 
some intermediate calculations
- You can delete this option if you want to see what is 
happening step-by-step */
quietly summarize predY if(interv == -1)
matrix input observe = (-1,`r(mean)')
quietly summarize predY if(interv == 0)
matrix observe = (observe \0,`r(mean)')
quietly summarize predY if(interv == 1)
matrix observe = (observe \1,`r(mean)')
matrix observe = (observe \., observe[3,2]-observe[2,2]) 

* Add some row/column descriptions and print results to screen
matrix rownames observe = observed E(Y(a=0)) E(Y(a=1)) difference
matrix colnames observe = interv value
matrix list observe 

/* To interpret these results:
- row 1, column 2, is the observed mean outcome value 
in our original sample
- row 2, column 2, is the mean outcome value 
if everyone had not quit smoking
- row 3, column 2, is the mean outcome value 
if everyone had quit smoking
- row 4, column 2, is the mean difference outcome value 
if everyone had quit smoking compared to if everyone 
had not quit smoking */

/* Addition due to way Statamarkdown works 
i.e. each code chunk is a separate Stata session */
mata observe = st_matrix("observe")
mata mata matsave ./data/observe observe, replace

*drop the copies*
drop if interv != -1
gen meanY_b =.
qui save ./data/nhefs_std, replace
(1,566 observations created)

(1,566 observations created)

(1,566 real changes made)



  Expanded observation |
                  type |      Freq.     Percent        Cum.
-----------------------+-----------------------------------
                    -1 |      1,566       33.33       33.33
  Original observation |      1,566       33.33       66.67
Duplicated observation |      1,566       33.33      100.00
-----------------------+-----------------------------------
                 Total |      4,698      100.00

(3,132 real changes made, 3,132 to missing)

(403 real changes made)

(1,163 real changes made)


--------------------------------------------------------------------------------------
-> interv = -1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        qsmk |      1,566    .2573436    .4373099          0          1

--------------------------------------------------------------------------------------
-> interv = Original

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        qsmk |      1,566           0           0          0          0

--------------------------------------------------------------------------------------
-> interv = Duplicat

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
        qsmk |      1,566           1           0          1          1

      Source |       SS           df       MS      Number of obs   =     1,566
-------------+----------------------------------   F(20, 1545)     =     13.45
       Model |   14412.558        20    720.6279   Prob > F        =    0.0000
    Residual |  82763.0286     1,545  53.5683033   R-squared       =    0.1483
-------------+----------------------------------   Adj R-squared   =    0.1373
       Total |  97175.5866     1,565  62.0930266   Root MSE        =     7.319

------------------------------------------------------------------------------------
           wt82_71 | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------------+----------------------------------------------------------------
              qsmk |   2.559594   .8091486     3.16   0.002     .9724486     4.14674
               sex |  -1.430272   .4689576    -3.05   0.002    -2.350132   -.5104111
              race |   .5601096   .5818888     0.96   0.336    -.5812656    1.701485
               age |   .3596353   .1633188     2.20   0.028     .0392854    .6799851
                   |
       c.age#c.age |   -.006101   .0017261    -3.53   0.000    -.0094868   -.0027151
                   |
         education |
                1  |    .194977   .7413692     0.26   0.793    -1.259219    1.649173
                2  |   .9854211   .7012116     1.41   0.160     -.390006    2.360848
                3  |   .7512894   .6339153     1.19   0.236    -.4921358    1.994715
                4  |   1.686547   .8716593     1.93   0.053    -.0232138    3.396307
                   |
    smokeintensity |   .0491365   .0517254     0.95   0.342     -.052323    .1505959
                   |
  c.smokeintensity#|
  c.smokeintensity |  -.0009907    .000938    -1.06   0.291    -.0028306    .0008493
                   |
          smokeyrs |   .1343686   .0917122     1.47   0.143     -.045525    .3142621
                   |
        c.smokeyrs#|
        c.smokeyrs |  -.0018664   .0015437    -1.21   0.227    -.0048944    .0011616
                   |
          exercise |
                0  |  -.3539128   .5588587    -0.63   0.527    -1.450114    .7422889
                1  |  -.0579374   .4316468    -0.13   0.893     -.904613    .7887381
                   |
            active |
                0  |   .2613779   .6845577     0.38   0.703    -1.081382    1.604138
                1  |  -.6861916   .6739131    -1.02   0.309    -2.008073    .6356894
                   |
              wt71 |   .0455018   .0833709     0.55   0.585    -.1180303    .2090339
                   |
     c.wt71#c.wt71 |  -.0009653   .0005247    -1.84   0.066    -.0019945    .0000639
                   |
              qsmk#|
  c.smokeintensity |
Smoking cessation  |   .0466628   .0351448     1.33   0.184    -.0222737    .1155993
                   |
             _cons |  -1.690608   4.388883    -0.39   0.700     -10.2994    6.918188
------------------------------------------------------------------------------------



--------------------------------------------------------------------------------------
-> interv = -1

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       predY |      1,566      2.6383    3.034683  -10.87582   9.876489

--------------------------------------------------------------------------------------
-> interv = Original

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       predY |      1,566    1.756213    2.826271  -11.83737   6.733498

--------------------------------------------------------------------------------------
-> interv = Duplicat

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
       predY |      1,566    5.273587    2.920532  -9.091126    11.0506












observe[4,2]
               interv      value
  observed         -1  2.6382998
 E(Y(a=0))          0  1.7562131
 E(Y(a=1))          1  5.2735873
difference          .  3.5173742


(saving observe[4,2])
file ./data/observe.mmat saved

(3,132 observations deleted)

(1,566 missing values generated)

Program 13.4

  • Computing the 95% confidence interval of the standardized means and their difference: Data from NHEFS
  • Section 13.3
*Run program 13.3 to obtain point estimates, and then the code below*

capture program drop bootstdz

program define bootstdz, rclass
use ./data/nhefs_std, clear

preserve

* Draw bootstrap sample from original observations
bsample 
        
/* Create copies with each value of qsmk in bootstrap sample.
First, duplicate the dataset and create a variable `interv` which
indicates which copy is the duplicate (interv =1)*/
expand 2, generate(interv_b)

/* Next, duplicate the original copy (interv = 0) again, and create
another variable `interv2` to indicate the copy*/
expand 2 if interv_b == 0, generate(interv2_b)

/* Now, change the value of interv to -1 in one of the copies so that
there are unique values of interv for each copy*/
replace interv_b = -1  if interv2_b ==1
drop interv2_b

/* Two of the copies will be for computing the standardized result.
For these two copies (interv = 0 and interv = 1), set the outcome to
missing and force qsmk to either 0 or 1, respectively*/
replace wt82_71 = . if interv_b != -1
replace qsmk = 0 if interv_b == 0
replace qsmk = 1 if interv_b == 1

* Run regression
regress wt82_71 qsmk sex race c.age##c.age ib(last).education ///
  c.smokeintensity##c.smokeintensity c.smokeyrs##c.smokeyrs ///
  ib(last).exercise ib(last).active c.wt71##c.wt71 ///
  qsmk#c.smokeintensity

/* Ask Stata for expected values.
Stata will give you expected values for all copies, not just the
original ones*/
predict predY_b, xb
summarize predY_b if interv_b == 0
return scalar boot_0 = r(mean)
summarize predY_b if interv_b == 1
return scalar boot_1 = r(mean)
return scalar boot_diff = return(boot_1) - return(boot_0)
drop meanY_b

restore

end

/* Then we use the `simulate` command to run the bootstraps as many
times as we want.
Start with reps(10) to make sure your code runs, and then change to
reps(1000) to generate your final CIs.*/
simulate EY_a0=r(boot_0) EY_a1 = r(boot_1) ///
  difference = r(boot_diff), reps(10) seed(1): bootstdz

/* Next, format the point estimate to allow Stata to calculate our
standard errors and confidence intervals*/
  
* Addition: read back in the observe matrix  
mata mata matuse ./data/observe, replace
mata st_matrix("observe", observe)

matrix pe = observe[2..4, 2]'
matrix list pe

/* Finally, the bstat command generates valid 95% confidence intervals
under the normal approximation using our bootstrap results.
The default results use a normal approximation to calcutlate the
confidence intervals.
Note, n contains the original sample size of your data before censoring*/
bstat, stat(pe) n(1629) 
 12. 

      Command: bootstdz
        EY_a0: r(boot_0)
        EY_a1: r(boot_1)
   difference: r(boot_diff)

Simulations (10): .........10 done

(loading observe[4,2])




pe[1,3]
           r2         r3         r4
c2  1.7562131  5.2735873  3.5173742


Bootstrap results                                        Number of obs = 1,629
                                                         Replications  =    10

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             | coefficient  std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
       EY_a0 |   1.756213   .2157234     8.14   0.000     1.333403    2.179023
       EY_a1 |   5.273587   .4999001    10.55   0.000     4.293801    6.253374
  difference |   3.517374    .538932     6.53   0.000     2.461087    4.573662
------------------------------------------------------------------------------