Logistic regression is a modeling technique that has attracted a lot of attention, especially from folks interested in classification, machine learning, and prediction using binary outcomes. One of the neat things about using R is that users can revisit commonly used procedures and figure out how they work.
What follows are some logistic regression notes (this is not on interpreting results). Even though I’ve written about how other alternatives might be simpler than logistic regression or that there are challenges when comparing coefficients across models, it is interesting to see how the procedure works.
I show a few things:
- How to use some matrices for getting logistic regression results (in terms of point estimates and standard errors);
- How to compute cluster robust standard errors too;
- How to manually run iteratively weighted least squares to get the same results from scratch.
- I added a section later on to do this using Fisher Scoring as well.
First, create a (clustered) dataset
set.seed(1234) #clustered example:: just for sim
pp <- function(x) exp(x) / (1 + exp(x))
NG <- 20 #number of groups
GS <- 15 #group size
total <- NG * GS
tr <- rep(sample(c(0,1), NG, 1), each = GS)
w1 <- rep(rnorm(NG), each = GS)
e2 <- rep(rnorm(NG, 0, .5), each = GS)
school <- rep(1:NG, each = GS)
x1 <- rnorm(total)
x2 <- rbinom(total, 1, .5)
ystar <- 1 + tr * .5 + w1 * .3 + x1 * .6 + x2 * -.4 + e2
y <- rbinom(total, 1, pp(ystar))
dat <- data.frame(y, school, tr, w1, x1, x2)
sel <- sample(total, 100) #randomly select data to remove
dat <- dat[-sel, ] #remove to create unbalanced data
Run a logistic regression model predicting y
(tr
and w1
are actually level-2 variables but we’ll ignore the clustering first).
summary(dat)
## y school tr w1
## Min. :0.00 Min. : 1.00 Min. :0.000 Min. :-1.44820
## 1st Qu.:0.00 1st Qu.: 6.00 1st Qu.:0.000 1st Qu.:-0.91120
## Median :1.00 Median :11.00 Median :1.000 Median :-0.49069
## Mean :0.65 Mean :10.71 Mean :0.725 Mean :-0.26632
## 3rd Qu.:1.00 3rd Qu.:16.00 3rd Qu.:1.000 3rd Qu.: 0.08187
## Max. :1.00 Max. :20.00 Max. :1.000 Max. : 2.41584
## x1 x2
## Min. :-2.85576 Min. :0.0
## 1st Qu.:-0.63885 1st Qu.:0.0
## Median :-0.09820 Median :0.5
## Mean :-0.02651 Mean :0.5
## 3rd Qu.: 0.51302 3rd Qu.:1.0
## Max. : 2.91914 Max. :1.0
nrow(dat)
## [1] 200
m1 <- glm(y ~ tr + w1 + x1 + x2,
data = dat,
family = binomial)
summary(m1)$coef #results are in logits
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.23069482 0.3482386 0.66246188 0.507675259
## tr 0.93184838 0.3465455 2.68896434 0.007167408
## w1 0.60952221 0.2296481 2.65415706 0.007950681
## x1 0.57411101 0.1852548 3.09903484 0.001941522
## x2 -0.01320503 0.3218338 -0.04103058 0.967271514
Extract matrices
The standard formula for the coefficients in linear regression is \((X'X)^{-1}(X'y)\). However, in a generalized linear model, a difference is that \(y\) is now \(z\) (see computation) and weights are estimated as well. The GzLM formula for the betas is now: \((X'WX)^{-1}(X'Wz)\).
The explanation of the matrices is shown at the latter part of this post (when computing this manually) which describes what goes into the computation. We’ll use the m1
object just created and extract elements from it that can be used to cobble together the same results.
X <- model.matrix(m1) #the X matrix
y <- m1$y #the outcome
eta <- X %*% coef(m1) #predicted values
mu <- as.vector(1 / (1 + exp(-eta))) #transformed predicted values
z <- eta + (y - mu) * (1 / (mu * (1 - mu)))
Ws <- diag(weights(m1, 'working'))
After the matrices are created, can use the formulas to get the same results (compare below):
### point estimates (similar)
solve(t(X) %*% Ws %*% X) %*%
t(X) %*% Ws %*% z
## [,1]
## (Intercept) 0.23069529
## tr 0.93184794
## w1 0.60952156
## x1 0.57411052
## x2 -0.01320591
coef(m1)
## (Intercept) tr w1 x1 x2
## 0.23069482 0.93184838 0.60952221 0.57411101 -0.01320503
For standard errors, this is just the square root of the diagonal of \((X'WX)^{-1}\).
sqrt(diag(solve(t(X) %*% Ws %*% X)))
## (Intercept) tr w1 x1 x2
## 0.3482386 0.3465455 0.2296481 0.1852548 0.3218338
sqrt(diag(vcov(m1))) #same
## (Intercept) tr w1 x1 x2
## 0.3482386 0.3465455 0.2296481 0.1852548 0.3218338
We can also compute cluster robust standard errors
I’ve written about robust SEs in another post but in this case, the residuals are different as seen in the syntax. I saw in this post that you can extract the weighted residuals shown below.
X <- model.matrix(m1) * sqrt(weights(m1, "working"))
u <- residuals(m1, "working") * sqrt(weights(m1, "working")) #residuals
cdata <- data.frame(cluster = dat$school, r = u)
(m <- length(table(cdata$cluster))) #number of clusters
## [1] 20
n <- nobs(m1) #number of observations
k <- m1$rank
dfa <- (m/(m-1)) * ((n-1)/(n-k)) #degrees of freedom adjustment
gs <- names(table(cdata$cluster)) #cluster names
u <- matrix(NA, nrow = m, ncol = k) #clusters x rank:: creating a new u matrix
for(i in 1:m){
u[i,] <- t(cdata$r[cdata$cluster == gs[i]]) %*% X[dat$school == gs[i], 1:k]
}
br <- solve(t(X) %*% X) #the bread
mt <- t(u) %*% u #the meat
(clvc <- (br %*% mt %*% br * dfa)) #cluster robust VCOV manually computed
## (Intercept) tr w1 x1 x2
## (Intercept) 0.161369126 -0.137857215 0.017878858 -0.001318135 -0.054932212
## tr -0.137857215 0.184883609 -0.004615105 0.001738463 -0.004407959
## w1 0.017878858 -0.004615105 0.053945127 0.021829090 -0.012319716
## x1 -0.001318135 0.001738463 0.021829090 0.051276709 0.005750240
## x2 -0.054932212 -0.004407959 -0.012319716 0.005750240 0.129468425
library(sandwich) #to 'automatically' compute
vcovCL(m1, cluster = dat$school, type = 'HC1')
## (Intercept) tr w1 x1 x2
## (Intercept) 0.161369126 -0.137857215 0.017878858 -0.001318135 -0.054932212
## tr -0.137857215 0.184883609 -0.004615105 0.001738463 -0.004407959
## w1 0.017878858 -0.004615105 0.053945127 0.021829090 -0.012319716
## x1 -0.001318135 0.001738463 0.021829090 0.051276709 0.005750240
## x2 -0.054932212 -0.004407959 -0.012319716 0.005750240 0.129468425
all.equal(clvc, vcovCL(m1, cluster = dat$school, type = 'HC1'))
## [1] TRUE
Manually performing a logistic regression
A GLM requires several quantities (in this case, shown are the ones for a binomial family using a logit link– this will differ based on both the link and the family)1:
- the link function: \(g(\mu)=log(\frac{\mu}{1 - \mu})\) known as the log odds
- The inverse link: \(g^{-1}(\mu) =\frac{1}{1 + exp(-\eta)}\)
- the variance function: \(V(\mu) = \mu(1 - \mu)\)
- The first derivative: \(g'(\mu) = \frac{1}{\mu(1 - \mu)}\)
The generic GLM algorithm uses the following computations (substitute the values from above):
- \(diag[\frac{1}{V(\mu)g'(\mu)^2}]\) : weights
- \(z = \eta + (y - \mu)g'(\mu)\) : working response
- \(\eta = X\beta\) : linear prediction
- \(\mu = g^{-1}(\eta)\) : the fitted values
X <- model.matrix(~tr + w1 + x1 + x2, data = dat) #just to quickly get this
#this is not the same as the weighted one
tol <- 1e-6 #tolerance: if change is low, then stop
y <- dat$y
mu <- (y + mean(y))/2 #for convergence
#eta <- log(mu) ## Poisson link
eta <- log(mu / (1 - mu)) #logit
dev <- 0 #deviance
delta.dev <- 1
i <- 1 #iteration number
## run the iterative procedure to get the weights
while(i < 50 & abs(delta.dev) > tol) { #repeat until deviance change is minimal
# or number of iterations is exceeded (e.g., 50 here)
tmp <- mu * (1 - mu) #since I use this over and over again
W <- diag(1 / (tmp * (1/tmp)^2))
z <- eta + (y - mu) * (1 / tmp)
b <- solve(t(X) %*% W %*% X) %*% t(X) %*% W %*% z #these are the cofficients
eta <- X %*% b #update eta
mu <- as.vector( 1 / (1 + exp(-eta)))
dev0 <- dev
#dev <- -2 * sum(y * log(y/mu) + (1 - y) * (1 -log(y/mu)), na.rm = T)
dev <- -2 * sum((y * log(mu)) + ((1 - y) * log(1 - mu))) #deviance
delta.dev <- dev - dev0 #assess change in deviance for logistic reg
cat(i, dev, "::")
i <- i + 1 #increment by 1
}
## 1 234.3943 ::2 233.4394 ::3 233.4389 ::4 233.4389 ::
The solution was reached after five iterations. Note: the deviance (which is -2 log likelihood) is the same using our manual method and the built-in function.
deviance(m1) #using the results from glm
## [1] 233.4389
dev #manually computed
## [1] 233.4389
logLik(m1) * -2 #showing this is -2 x log likelihood
## 'log Lik.' 233.4389 (df=5)
To compute the standard errors:
serror <- sqrt(diag(solve(t(X) %*% W %*% X)))
# compare results
data.frame(B.manual = as.numeric(b), se.manual = serror)
## B.manual se.manual
## (Intercept) 0.23069482 0.3482383
## tr 0.93184838 0.3465451
## w1 0.60952221 0.2296474
## x1 0.57411101 0.1852545
## x2 -0.01320503 0.3218335
summary(m1)$coef[,1:2] #using the glm function
## Estimate Std. Error
## (Intercept) 0.23069482 0.3482386
## tr 0.93184838 0.3465455
## w1 0.60952221 0.2296481
## x1 0.57411101 0.1852548
## x2 -0.01320503 0.3218338
Standard errors are the same (to the sixth decimal place) as those estimated using the glm
function.
Here’s a final short example
Just using the small, built-in mtcars
dataset:
data(mtcars)
table(mtcars$am) #the car transmission; just a toy outcome
##
## 0 1
## 19 13
m2 <- glm(am ~ mpg + wt, family = binomial, data = mtcars)
Computing this ‘manually’ using extracted matrices (object is now m2
for model 2):
X <- model.matrix(m2)
eta <- X %*% coef(m2)
mu <- as.vector(1 / (1 + exp(-eta)))
zz <- eta + (mtcars$am - mu) * (1 / (mu * (1 - mu)))
Ws <- diag(weights(m2, 'working'))
estimate <- solve(t(X) %*% Ws %*% X) %*% t(X) %*% Ws %*% zz #est
se <- sqrt(diag(solve(t(X) %*% Ws %*% X))) #SE
z <- estimate / se
pv <- 2 * pt(-abs(z), df = Inf) #get the p values
Put it all together and compare results:
data.frame(estimate, se, z, pv)
## estimate se z pv
## (Intercept) 25.8865522 12.193529 2.122975 0.03375598
## mpg -0.3241639 0.239499 -1.353508 0.17589325
## wt -6.4161717 2.546606 -2.519500 0.01175218
summary(m2)$coef #based on GLM function
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 25.8865540 12.193529 2.122975 0.03375597
## mpg -0.3241639 0.239499 -1.353509 0.17589322
## wt -6.4161721 2.546606 -2.519500 0.01175217
Based on the notes from Dr. Wolfgang Wiedermann’s GLM course↩︎