Logistic regression (from scratch) using matrices

Last updated on Jan 21, 2023 8 min read logistic regression, R, robust standard errors

Logistic regression is a modeling technique that has attracted a lot of attention, especially from folks interested in classification, machine learning, and prediction using binary outcomes. One of the neat things about using R is that users can revisit commonly used procedures and figure out how they work.

What follows are some logistic regression notes (this is not on interpreting results). Even though I’ve written about how other alternatives might be simpler than logistic regression or that there are challenges when comparing coefficients across models, it is interesting to see how the procedure works.

I show a few things:

How to use some matrices for getting logistic regression results (in terms of point estimates and standard errors);
How to compute cluster robust standard errors too;
How to manually run iteratively weighted least squares to get the same results from scratch.
I added a section later on to do this using Fisher Scoring as well.

First, create a (clustered) dataset

set.seed(1234) #clustered example:: just for sim
pp <- function(x) exp(x) / (1 + exp(x))
NG <- 20 #number of groups
GS <- 15 #group size
total <- NG * GS
tr <- rep(sample(c(0,1), NG, 1), each = GS)
w1 <- rep(rnorm(NG), each = GS)
e2 <- rep(rnorm(NG, 0, .5), each = GS)
school <- rep(1:NG, each = GS)
x1 <- rnorm(total)
x2 <- rbinom(total, 1, .5)
ystar <- 1 + tr * .5 + w1 * .3 + x1 * .6 + x2 * -.4 + e2
y <- rbinom(total, 1, pp(ystar))
dat <- data.frame(y, school, tr, w1, x1, x2)
sel <- sample(total, 100) #randomly select data to remove
dat <- dat[-sel, ] #remove to create unbalanced data

Run a logistic regression model predicting y (tr and w1 are actually level-2 variables but we’ll ignore the clustering first).

summary(dat)

##        y            school            tr              w1          
##  Min.   :0.00   Min.   : 1.00   Min.   :0.000   Min.   :-1.44820  
##  1st Qu.:0.00   1st Qu.: 6.00   1st Qu.:0.000   1st Qu.:-0.91120  
##  Median :1.00   Median :11.00   Median :1.000   Median :-0.49069  
##  Mean   :0.65   Mean   :10.71   Mean   :0.725   Mean   :-0.26632  
##  3rd Qu.:1.00   3rd Qu.:16.00   3rd Qu.:1.000   3rd Qu.: 0.08187  
##  Max.   :1.00   Max.   :20.00   Max.   :1.000   Max.   : 2.41584  
##        x1                 x2     
##  Min.   :-2.85576   Min.   :0.0  
##  1st Qu.:-0.63885   1st Qu.:0.0  
##  Median :-0.09820   Median :0.5  
##  Mean   :-0.02651   Mean   :0.5  
##  3rd Qu.: 0.51302   3rd Qu.:1.0  
##  Max.   : 2.91914   Max.   :1.0

nrow(dat)

## [1] 200

m1 <- glm(y ~ tr + w1 + x1 + x2,
          data = dat, 
          family = binomial)
summary(m1)$coef #results are in logits

##                Estimate Std. Error     z value    Pr(>|z|)
## (Intercept)  0.23069482  0.3482386  0.66246188 0.507675259
## tr           0.93184838  0.3465455  2.68896434 0.007167408
## w1           0.60952221  0.2296481  2.65415706 0.007950681
## x1           0.57411101  0.1852548  3.09903484 0.001941522
## x2          -0.01320503  0.3218338 -0.04103058 0.967271514

Extract matrices

The standard formula for the coefficients in linear regression is \((X'X)^{-1}(X'y)\). However, in a generalized linear model, a difference is that \(y\) is now \(z\) (see computation) and weights are estimated as well. The GzLM formula for the betas is now: \((X'WX)^{-1}(X'Wz)\).

The explanation of the matrices is shown at the latter part of this post (when computing this manually) which describes what goes into the computation. We’ll use the m1 object just created and extract elements from it that can be used to cobble together the same results.

X <- model.matrix(m1) #the X matrix
y <- m1$y #the outcome
eta <- X %*% coef(m1) #predicted values
mu <- as.vector(1 / (1 + exp(-eta))) #transformed predicted values
z <- eta + (y - mu) * (1 / (mu * (1 - mu)))
Ws <- diag(weights(m1, 'working'))

After the matrices are created, can use the formulas to get the same results (compare below):

### point estimates (similar)
solve(t(X) %*% Ws %*% X) %*% 
  t(X) %*% Ws %*% z

##                    [,1]
## (Intercept)  0.23069529
## tr           0.93184794
## w1           0.60952156
## x1           0.57411052
## x2          -0.01320591

coef(m1)

## (Intercept)          tr          w1          x1          x2 
##  0.23069482  0.93184838  0.60952221  0.57411101 -0.01320503

For standard errors, this is just the square root of the diagonal of \((X'WX)^{-1}\).

sqrt(diag(solve(t(X) %*% Ws %*% X)))

## (Intercept)          tr          w1          x1          x2 
##   0.3482386   0.3465455   0.2296481   0.1852548   0.3218338

sqrt(diag(vcov(m1))) #same

## (Intercept)          tr          w1          x1          x2 
##   0.3482386   0.3465455   0.2296481   0.1852548   0.3218338

We can also compute cluster robust standard errors

I’ve written about robust SEs in another post but in this case, the residuals are different as seen in the syntax. I saw in this post that you can extract the weighted residuals shown below.

X <- model.matrix(m1) * sqrt(weights(m1, "working"))
u <- residuals(m1, "working")  * sqrt(weights(m1, "working")) #residuals

cdata <- data.frame(cluster = dat$school, r = u)
(m <- length(table(cdata$cluster))) #number of clusters

## [1] 20

n <- nobs(m1) #number of observations
k <- m1$rank
dfa <- (m/(m-1)) * ((n-1)/(n-k)) #degrees of freedom adjustment
gs <- names(table(cdata$cluster)) #cluster names
u <- matrix(NA, nrow = m, ncol = k) #clusters x rank:: creating a new u matrix

for(i in 1:m){
  u[i,] <- t(cdata$r[cdata$cluster == gs[i]]) %*% X[dat$school == gs[i], 1:k]
} 

br <- solve(t(X) %*% X) #the bread
mt <- t(u) %*% u #the meat
(clvc <- (br %*% mt %*% br * dfa)) #cluster robust VCOV manually computed

##              (Intercept)           tr           w1           x1           x2
## (Intercept)  0.161369126 -0.137857215  0.017878858 -0.001318135 -0.054932212
## tr          -0.137857215  0.184883609 -0.004615105  0.001738463 -0.004407959
## w1           0.017878858 -0.004615105  0.053945127  0.021829090 -0.012319716
## x1          -0.001318135  0.001738463  0.021829090  0.051276709  0.005750240
## x2          -0.054932212 -0.004407959 -0.012319716  0.005750240  0.129468425

library(sandwich) #to 'automatically' compute
vcovCL(m1, cluster = dat$school, type = 'HC1')

##              (Intercept)           tr           w1           x1           x2
## (Intercept)  0.161369126 -0.137857215  0.017878858 -0.001318135 -0.054932212
## tr          -0.137857215  0.184883609 -0.004615105  0.001738463 -0.004407959
## w1           0.017878858 -0.004615105  0.053945127  0.021829090 -0.012319716
## x1          -0.001318135  0.001738463  0.021829090  0.051276709  0.005750240
## x2          -0.054932212 -0.004407959 -0.012319716  0.005750240  0.129468425

all.equal(clvc, vcovCL(m1, cluster = dat$school, type = 'HC1'))

## [1] TRUE

Manually performing a logistic regression

A GLM requires several quantities (in this case, shown are the ones for a binomial family using a logit link– this will differ based on both the link and the family)¹:

the link function: \(g(\mu)=log(\frac{\mu}{1 - \mu})\) known as the log odds
The inverse link: \(g^{-1}(\mu) =\frac{1}{1 + exp(-\eta)}\)
the variance function: \(V(\mu) = \mu(1 - \mu)\)
The first derivative: \(g'(\mu) = \frac{1}{\mu(1 - \mu)}\)

The generic GLM algorithm uses the following computations (substitute the values from above):

\(diag[\frac{1}{V(\mu)g'(\mu)^2}]\) : weights
\(z = \eta + (y - \mu)g'(\mu)\) : working response
\(\eta = X\beta\) : linear prediction
\(\mu = g^{-1}(\eta)\) : the fitted values

X <- model.matrix(~tr + w1 + x1 + x2, data = dat) #just to quickly get this
#this is not the same as the weighted one
tol <- 1e-6 #tolerance: if change is low, then stop
y <- dat$y
mu <- (y + mean(y))/2 #for convergence
#eta <- log(mu) ## Poisson link
eta <- log(mu / (1 - mu)) #logit

dev <- 0 #deviance
delta.dev <- 1
i <- 1 #iteration number

## run the iterative procedure to get the weights

while(i < 50 & abs(delta.dev) > tol) { #repeat until deviance change is minimal 
# or number of iterations is exceeded (e.g., 50 here)
  
  tmp <- mu * (1 - mu) #since I use this over and over again
  W <- diag(1 / (tmp * (1/tmp)^2))
  z <- eta + (y - mu) * (1 / tmp) 
  b <- solve(t(X) %*% W %*% X) %*% t(X) %*% W %*% z #these are the cofficients
  eta <- X %*% b #update eta
  mu <- as.vector( 1 / (1 + exp(-eta)))
  dev0 <- dev
  #dev <- -2 * sum(y * log(y/mu) + (1 - y) * (1 -log(y/mu)), na.rm = T)
  dev <- -2 * sum((y * log(mu)) + ((1 - y) * log(1 - mu))) #deviance
  delta.dev <- dev - dev0 #assess change in deviance for logistic reg
  cat(i, dev, "::")
  i <- i + 1 #increment by 1
}

## 1 234.3943 ::2 233.4394 ::3 233.4389 ::4 233.4389 ::

The solution was reached after five iterations. Note: the deviance (which is -2 log likelihood) is the same using our manual method and the built-in function.

deviance(m1) #using the results from glm

## [1] 233.4389

dev #manually computed

## [1] 233.4389

logLik(m1) * -2 #showing this is -2 x log likelihood

## 'log Lik.' 233.4389 (df=5)

To compute the standard errors:

serror <- sqrt(diag(solve(t(X) %*% W %*% X)))
# compare results
data.frame(B.manual = as.numeric(b), se.manual = serror)

##                B.manual se.manual
## (Intercept)  0.23069482 0.3482383
## tr           0.93184838 0.3465451
## w1           0.60952221 0.2296474
## x1           0.57411101 0.1852545
## x2          -0.01320503 0.3218335

summary(m1)$coef[,1:2] #using the glm function

##                Estimate Std. Error
## (Intercept)  0.23069482  0.3482386
## tr           0.93184838  0.3465455
## w1           0.60952221  0.2296481
## x1           0.57411101  0.1852548
## x2          -0.01320503  0.3218338

Standard errors are the same (to the sixth decimal place) as those estimated using the glm function.

Here’s a final short example

Just using the small, built-in mtcars dataset:

data(mtcars)
table(mtcars$am) #the car transmission; just a toy outcome

## 
##  0  1 
## 19 13

m2 <- glm(am ~ mpg + wt, family = binomial, data = mtcars)

Computing this ‘manually’ using extracted matrices (object is now m2 for model 2):

X <- model.matrix(m2)
eta <- X %*% coef(m2)
mu <- as.vector(1 / (1 + exp(-eta)))
zz <- eta + (mtcars$am - mu) * (1 / (mu * (1 - mu)))
Ws <- diag(weights(m2, 'working'))
estimate <- solve(t(X) %*% Ws %*% X) %*% t(X) %*% Ws %*% zz #est
se <- sqrt(diag(solve(t(X) %*% Ws %*% X))) #SE
z <- estimate / se
pv <- 2 * pt(-abs(z), df = Inf) #get the p values

Put it all together and compare results:

data.frame(estimate, se, z, pv)

##               estimate        se         z         pv
## (Intercept) 25.8865522 12.193529  2.122975 0.03375598
## mpg         -0.3241639  0.239499 -1.353508 0.17589325
## wt          -6.4161717  2.546606 -2.519500 0.01175218

summary(m2)$coef #based on GLM function

##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) 25.8865540  12.193529  2.122975 0.03375597
## mpg         -0.3241639   0.239499 -1.353509 0.17589322
## wt          -6.4161721   2.546606 -2.519500 0.01175217

Based on the notes from Dr. Wolfgang Wiedermann’s GLM course↩︎

logistic regression