ROUGH NOTES: [let me know if you spot any errors– there might be a couple!] Often, in randomized control trial where individuals are randomly assigned to treatment and control conditions, covariates are included to improve precision by reducing error and improving statistical power. However, when binary outcomes are used (e.g., patient recovers or not), there are several additional concerns that have gone unnoticed by many applied researchers.
Take a simulated example where our true model data generating process is (to keep things simple, the intercept is zero and the parameters are both set to 1):
\[log(\frac{p}{1 - p}) = 1 + 1 \times (tr) + 1 \times (x2)\]
library(summarytools)
library(stargazer)
pp <- function(x) exp(x) / (1 + exp(x)) #convert logit to prob
set.seed(2468)
ns <- 10000
tr <- rbinom(ns, 1, .50) #50% treat / 50% control
x2 <- rnorm(ns, 0 , 2) #uncorrelated x2
ystar <- 1 + 1 * tr + 1 * x2 #all get a unit weight to keep it simple
y <- rbinom(ns, 1, pp(ystar))
ctable(y, tr, prop = 'c')
## Cross-Tabulation, Column Proportions
## y * tr
##
## ------- ---- --------------- --------------- ----------------
## tr 0 1 Total
## y
## 0 1756 ( 35.0%) 1165 ( 23.3%) 2921 ( 29.2%)
## 1 3254 ( 65.0%) 3825 ( 76.7%) 7079 ( 70.8%)
## Total 5010 (100.0%) 4990 (100.0%) 10000 (100.0%)
## ------- ---- --------------- --------------- ----------------
In our example, think about y = 1 as recovered from illness and y = 0 as did not recover from illness. In the treatment group, 77% recovered vs. 65%. This is a difference of around 12 percentage points or a higher rate of recovery by a factor of 1.18 (77/65).
Note that the two predictor variables are not correlated with each other:
df <- data.frame(y, tr, x2)
round(cor(df), 3)
## y tr x2
## y 1.000 0.129 0.569
## tr 0.129 1.000 -0.017
## x2 0.569 -0.017 1.000
When running a logistic regression model, notice how the coefficient changes even if x2
is not correlated with tr
.
reduced <- glm(y ~ tr, family = binomial)
full <- update(reduced, . ~ . + x2)
stargazer(reduced, full, star.cutoffs = c(.05, .01, .001),
no.space = T, type = 'text')
##
## ================================================
## Dependent variable:
## ------------------------------
## y
## (1) (2)
## ------------------------------------------------
## tr 0.572*** 1.005***
## (0.045) (0.059)
## x2 1.040***
## (0.023)
## Constant 0.617*** 1.019***
## (0.030) (0.041)
## ------------------------------------------------
## Observations 10,000 10,000
## Log Likelihood -5,956.977 -3,883.476
## Akaike Inf. Crit. 11,917.950 7,772.952
## ================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
The true effect should be around 1.0 (these are in logit units). In this basic example, there are several things to notice:
The coefficient for
tr
is underestimated in the first model (remember, I specified that the value should be 1).The coefficients for
tr
(treatment status) change with the inclusion ofx2
.x2
is not your typical confounder which is associated both with y and the treatment status.The standard error increases in the 2nd model– it does not decrease as one might expect.
As a result, this makes a few things problematic– even in the presence of randomization.
We might conclude that
x2
is a suppressor variable– but it is not!This makes comparing results across models even within the same sample problematic. Results can change as long as variables predict the outcome.
We will never have all predictors of interest in a model (w/c is why we don’t get an \(R^2\) of 1.00).
Another question that might be raised is if model 1 is really incorrect. We know that the coefficient is lower than 1 but then by looking at the crosstabs, this actually matches what was shown earlier:
pp(.617) #control:: 65%
## [1] 0.6495359
pp(.617 + .572) #treatment:: 77%
## [1] 0.7665622
If we take the model 2 coefficients (and holding x2
at its average of zero), the coefficients are correct (controlling for x2
) but then this is not reflected in the crosstabs:
pp(1) #control:: 73%
## [1] 0.7310586
pp(1 + 1) #treatment: 88%
## [1] 0.8807971
Why does that happen with LRMs?
In a standard OLS model, we have the intercept, slope, and the residual.
\[y = \beta_0 + \beta_1 x_1 + e\]
The residuals have a mean of 0 and variance, \(\sigma^2_e\). The residual variance depends on the model and students of OLS regression know that if a predictor is added to the model, the residual variance decreases (even if it is a useless variable; w/c is why we at times use adjusted R2 vs R2).
However, in a logistic regression model, the variance of the error term is fixed to \(\frac{\pi^2}{3}\) or 3.29, which is the variance of a logistic distribution. Adding predictors to the model does not change the residual variance. As a result, the added variability is absorbed in the other parts of the model (at times referred to as an issue or rescaling or unobserved heterogeneity).
Over three decades ago, Winship and Mare (1984, p. 517) stated:
Adding new independent variables to an equation alters the variance of Y and thus the remaining coefficients in the model, even if the new independent variables are uncorrelated with the original independent variables.
Mood (2010) more recently indicated (based off what WM wrote) that what is actually being estimated (\(b_1\)) in the first model is:
\[b_1 \approx \beta_1 \times \frac{\sqrt{3.29}}{\sqrt{3.29 + \beta^2_2var(x_2)}} = 1.0 \times \frac{\sqrt{3.29}}{\sqrt{3.29 + 1^2 (4) }}\] As long as the denominator is greater than \(\sqrt{3.29}\), the estimated coefficient with the missing \(b_2\) will be underestimated. The underestimation depends on effect and variability of \(x_2\).
The reason why coefficients may change when variables are added to the model may be due to:
- real confounding (the control variable is correlated with BOTH the treatment and the outcome) and/or
- rescaling.
In the full model (with both tr
and x2
), the variability of the latent y variable (which causes y) is: \(1 * .25 + 1 * 4 + 3.29 = 7.54\). The .25 is the variability of the treatment variable which is .5 * (1 - .5). In the reduced model (w/c omits x2
), the variability is: \(1 * .25 + 3.29 = 3.54\). The variability– even though the same sample is being used– differs between models. This is an approximation? Does not result in the latent y var exactly.
How then do we compare results across models?
If we stick with logistic regression, we have a few options.
Y standardization
Y standardization has been around for a while (Winship & Mare, 1984). Involves taking the standard deviation of the latent y and using this as a scaling factor to create standardized coefficients for the predictors. Here’s a small function ystd
that will compute the y standardized variables to be used with the model coefficients. Basically, this takes the variance of the predicted logits from the model estimated, adds the constant of 3.29 which results in the total variance of the outcome. Then take the square root of the variance which is the standard deviation. Divide the coefficients by the SD of y* (y* is referred to as the latent y; above which some threshold, y = 1, if else, y = 0).
ystd <- function(x){
evar <- (pi ^ 2) / 3 #3.29 #constant
vr <- var(predict(x)) + evar #variance of predicted logits + evar
sdy <- sqrt(vr) #getting the sd
coef(x) / sdy
}
ystd(reduced)
## (Intercept) tr
## 0.3359345 0.3115061
ystd(full)
## (Intercept) tr x2
## 0.3628906 0.3578878 0.3701445
The resulting coefficients can be interpreted as the usual standardized beta– for a one standard deviation increase in the predictor, results in a std coef change in the outcome (i.e., .31 and .36). However, for binary predictors, that doesn’t make much sense to me (e.g., a SD coefficient change in gender, race, etc.) since you have a unit change or nothing.
Karlson, Holm, & Breen (khb) method
Again, this involves comparing the full and reduced models. Two ways to think about it. We need to estimate the reduced model but include \(x_2\) without really including it. A way to do that is to get the residuals of \(x_2\) regressed on treatment (can just use OLS) and then include that residual as a predictor (don’t interpret it) in the model. The output below (.93) is much closer to 1.0.
res <- resid(lm(x2 ~ tr))
reduced2 <- glm(y ~ tr + res, family = binomial)
stargazer(reduced, reduced2, full,
star.cutoffs = c(.05, .01, .001),
no.space = T, type = 'text')
##
## ==================================================
## Dependent variable:
## --------------------------------
## y
## (1) (2) (3)
## --------------------------------------------------
## tr 0.572*** 0.933*** 1.005***
## (0.045) (0.058) (0.059)
## res 1.040***
## (0.023)
## x2 1.040***
## (0.023)
## Constant 0.617*** 1.034*** 1.019***
## (0.030) (0.041) (0.041)
## --------------------------------------------------
## Observations 10,000 10,000 10,000
## Log Likelihood -5,956.977 -3,883.476 -3,883.476
## Akaike Inf. Crit. 11,917.950 7,772.952 7,772.952
## ==================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Another way to think about it is to get the predicted logits (the estimated y*) from the full model and then just use that as the outcome in a linear regression (since the predicted logits are not anymore ones and zeroes but are continuous). [I am not sure about the standard errors in these models, just the point estimates/coefficients: reduced2
is fine though].
predy <- predict(full) #gets logits, do not use fitted which returns prob
reduced3 <- glm(predy ~ tr)
stargazer(reduced, reduced2, reduced3, full,
star.cutoffs = c(.05, .01, .001),
no.space = T, type = 'text',
omit.stat = c('f', 'll', 'ser', 'rsq'))
##
## ===========================================================
## Dependent variable:
## -----------------------------------------
## y predy y
## logistic normal logistic
## (1) (2) (3) (4)
## -----------------------------------------------------------
## tr 0.572*** 0.933*** 0.933*** 1.005***
## (0.045) (0.058) (0.042) (0.059)
## res 1.040***
## (0.023)
## x2 1.040***
## (0.023)
## Constant 0.617*** 1.034*** 1.034*** 1.019***
## (0.030) (0.041) (0.030) (0.041)
## -----------------------------------------------------------
## Observations 10,000 10,000 10,000 10,000
## Akaike Inf. Crit. 11,917.950 7,772.952 43,161.080 7,772.952
## ===========================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
Use the khb package
Another way is to use the khb
package. Compare results with what was computed earlier.
library(khb)
compareModels(reduced, full, method = 'naive') #no adjustments
## Reduced Full perc
## (Intercept) 0.61685 1.0194 -65.253
## tr 0.57199 1.0053 -75.756
## x2 1.0397
##
## Pseudo R2 0.023544 0.49965
## Dev. 11914 7767
## Null 12080 12080
## Chisq 166.47 4313.5
## Sig *** ***
## Dl 1 2
## BIC 11923 7785.4
compareModels(reduced, full, method = 'ystand')
## Reduced Full perc
## (Intercept) 0.33593 0.36289 -8.0242
## tr 0.31151 0.35789 -14.8895
## x2 0.37014
##
## Pseudo R2 0.023544 0.49965
## Dev. 11914 7767
## Null 12080 12080
## Chisq 166.47 4313.5
## Sig *** ***
## Dl 1 2
## BIC 11923 7785.4
compareModels(reduced, full, method = 'khb')
## Reduced Full perc
## (Intercept) 1.03378 1.0194 1.3950
## tr 0.93291 1.0053 -7.7599
## Resid(x2) 1.03973
## x2 1.0397
##
## Pseudo R2 0.49965 0.49965
## Dev. 7767 7767
## Null 12080 12080
## Chisq 4313.5 4313.5
## Sig *** ***
## Dl 2 2
## BIC 7785.4 7785.4
k <- khb(reduced, full) #to compare and get SEs
print(k)
## KHB method
## Model type: glm lm (logit)
## Variables of interest: tr
## Z variables (mediators): x2
##
## Summary of confounding
## Ratio Percentage Rescaling
## tr 0.92799 -7.75989 1.631
## ------------------------------------------------------------------
## tr :
## Estimate Std. Error z value Pr(>|z|)
## Reduced 0.932912 0.058321 15.9963 < 2e-16 ***
## Full 1.005305 0.058673 17.1340 < 2e-16 ***
## Diff -0.072393 0.041898 -1.7278 0.08402 .
The kb
results indicate that the difference between the full and reduced model shown are not statistically significant (p = .08; i.e., not different from zero).
Use alternative procedures
NOTE: if we use alternative procedures such as a linear probability model (LPM) or a modified Poisson regression (I’m not adjusting for overdisperson in the example, just focusing on the regression coefficients), these are not subject to the same issues. In particular, the LPM works as we would expect where the coefficient for tr
is relatively unchanged and the standard errors decrease. I’ve also shown this in a prior study (Huang, 2019).
reduced.ols <- glm(y ~ tr, family = gaussian) #normal
full.ols <- update(reduced.ols, . ~ . + x2)
reduced.poisson <- glm(y ~ tr, family = poisson) #normal
full.poisson <- update(reduced.poisson, . ~ . + x2)
stargazer(reduced.ols, full.ols,
reduced.poisson, full.poisson,
star.cutoffs = c(.05, .01, .001),
no.space = T, type = 'text')
##
## =============================================================
## Dependent variable:
## -------------------------------------------
## y
## normal Poisson
## (1) (2) (3) (4)
## -------------------------------------------------------------
## tr 0.117*** 0.126*** 0.166*** 0.180***
## (0.009) (0.007) (0.024) (0.024)
## x2 0.129*** 0.182***
## (0.002) (0.006)
## Constant 0.650*** 0.648*** -0.432*** -0.502***
## (0.006) (0.005) (0.018) (0.018)
## -------------------------------------------------------------
## Observations 10,000 10,000 10,000 10,000
## Log Likelihood -6,226.338 -4,209.938 -9,500.248 -9,023.055
## Akaike Inf. Crit. 12,456.670 8,425.875 19,004.500 18,052.110
## =============================================================
## Note: *p<0.05; **p<0.01; ***p<0.001
If you compare the results above to the crosstabs shown at the start, we find that the treatment group is around 12 percentage points higher than the control (shown using OLS) and the the rate of recovery of the treatment group is higher by a factor of ~1.2 (exp(.18)
).
References
Huang, F. (2019). Alternatives to logistic regression models with binary outcomes. Journal of Experimental Education. doi: 10.1080/00220973.2019.1699769
Karlson, KB, Holm, A, & Breen, R (2012). Comparing regression coefficients between same-sample nested mModels using logit and probit: A new method. Sociological Methodology, 42(1), pp 286-313.
Mood, C. (2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26, 67-82. https://doi.org/10.1093/esr/jcp006
Studer, M. (2014). khb: Comparing nonlinear regression models. R package version 0.1.
Winship, C., & Mare, R. D. (1984). Regression models with ordinal variables. American Sociological Review, 49, 512-525.