Gradient descent example 1

Random notes. Regression based techniques often involve finding a maximum (e.g., the maximum likelihood) or a minimum (e.g., least squares or mean square error) value. Gradient descent is an iterative optimization algorithm used to find the minimum of a function (or gradient ascent to find the maximum).

The algorithm for solving for \(\theta_j\) looks like:

\[\theta_j = \theta_j - \alpha\frac{\partial{}}{\partial{\theta_j} }J(\theta)\]

  • \(\alpha\) is the learning rate (smaller step size takes more iterations)

  • \(J(\theta)\) is the loss or cost function

1. Create a dataset

set.seed(1233)
ns <- 500
x1 <- rnorm(ns)
err <- rnorm(ns)
y <- x1 * .5 + err
dat <- data.frame(y, x1)

2. Initialize the algorithm

Specify the:

  • design matrix (X),
  • learning rate (here .05),
  • tolerance level (tol), and
  • number of maximum iterations (iter).
X <- model.matrix(~x1, data = dat) #design matrix
k <- ncol(X) #number of predictors
lr <- .05 #this is the learning rate
b <- rnorm(k) #random 
pred <- X %*% b #predicted values
mse <- sum(y - X %*% b)^2 / ns #initial mean square error
tol <- 1e-16 #tolerance level
iter <- 1000 #number of max iterations

3. Run the algorithm

Iterate until the MSE doesn’t change much (based on the tolerance level).

for (i in 1:iter){
  
  dint <- (-2 / ns) * sum(y - pred) # compute the partial der with respect to the intercept
  dslope <- (-2/ns) * sum((y - pred) * x1) # compute the partial der with respect to the slope                
        
  b[1] <- b[1] - lr * dint #update dMSE / dint
  b[2] <- b[2] - lr * dslope #update dMSE / db1
  pred <- X %*% b
  
  msenew <- sum(y - pred)^2 / ns
  conv <- abs(mse - msenew)
  mse <- msenew
  if (conv < tol) (break)
}

4. Compare the results:

Results are the same

cat("final iteration:", i, "coefs:", b, "convergence:", conv, "\n") #based on the iterations
final iteration: 196 coefs: -0.07235628 0.4475963 convergence: 9.312627e-17 
solve(t(X) %*% X) %*% t(X) %*% y #the usual way
                   [,1]
(Intercept) -0.07235628
x1           0.44759628
lm(y ~ x1) #using built in function

Call:
lm(formula = y ~ x1)

Coefficients:
(Intercept)           x1  
   -0.07236      0.44760  

5. Using matrix notation (to make it scalable)

set.seed(246)
ns <- 1000
x1 <- rnorm(ns)
err <- rnorm(ns)
y <- x1 * .5 + err
dat <- data.frame(y, x1)
X <- model.matrix(~x1, data = dat)
k <- ncol(X)
lr <- .05
b <- rnorm(k)
pred <- X %*% b
mse <- sum(y - X %*% b)^2 / ns
tol <- 1e-16
iter <- 1000
for (i in 1:iter){
  
  grad <- (-2 / ns) * (t(X) %*% (y - pred))
  b <- b - lr * grad
  pred <- X %*% b
  
  msenew <- sum(y - pred)^2 / ns
  conv <- abs(mse - msenew)
  mse <- msenew
  if (conv < tol) (break)
  
}
cat("final iteration:", i, "coefs:", b, "convergence:", conv, "\n")
final iteration: 198 coefs: 0.0324858 0.4794015 convergence: 8.405179e-17 
solve(t(X) %*% X) %*% t(X) %*% y
                 [,1]
(Intercept) 0.0324858
x1          0.4794015

6. Another test…

Using matrix notation and testing it with a more complex formula (e.g., including interactions).

### scaling up

set.seed(12345)
ns <- 1000
x1 <- rnorm(ns)
x2 <- rnorm(ns)
err <- rnorm(ns)
y <- x1 * .5 + .5 * x2 + .5 * x1 * x2 + err
dat <- data.frame(y, x1, x2)
X <- model.matrix(~x1 + x2 + x1 * x2, data = dat)
k <- ncol(X)
lr <- .05
b <- rnorm(k)
pred <- X %*% b
mse <- sum(y - X %*% b)^2 / ns
tol <- 1e-16
iter <- 1000
for (i in 1:iter){
  
  grad <- (-2 / ns) * (t(X) %*% (y - pred))
  b <- b - lr * grad
  pred <- X %*% b
  
  msenew <- sum(y - pred)^2 / ns
  conv <- abs(mse - msenew)
  mse <- msenew
  if (conv < tol) (break)
  
}

cat("final iteration:", i, "coefs:", b, "convergence:", conv, "\n")
final iteration: 217 coefs: -0.02600438 0.4901473 0.4951696 0.5210631 convergence: 9.040568e-17 

Just putting it all together

u <- pred - y #residual
num <- t(u) %*% u #numerator
den <- ns - k #denominator
rse <- sqrt(num/den) #residual standard error (manually)

se <- sqrt(diag(as.numeric(rse^2) * (solve(t(X) %*% X))))
t <- b / se
pv <- 2 * pt(-abs(t), df = ns - k)
data.frame(b = b, se = se, t = t, pv = pv)
                      b         se          t           pv
(Intercept) -0.02600438 0.03036152 -0.8564913 3.919321e-01
x1           0.49014733 0.03044746 16.0981346 5.325422e-52
x2           0.49516958 0.03009428 16.4539432 5.398765e-54
x1:x2        0.52106309 0.02960410 17.6010419 1.383041e-60
#solve(t(X) %*% X) %*% t(X) %*% y
summary(lm(y ~ x1 + x2 + x1 * x2))$coef
               Estimate Std. Error    t value     Pr(>|t|)
(Intercept) -0.02600438 0.03036152 -0.8564913 3.919321e-01
x1           0.49014733 0.03044746 16.0981347 5.325420e-52
x2           0.49516958 0.03009428 16.4539432 5.398765e-54
x1:x2        0.52106309 0.02960410 17.6010419 1.383041e-60

This is of course just an example. We don’t really run our regressions this way… (e.g., it can be inefficient, need to specify the learning rate hyperparameter).

— END

Next
Previous
comments powered by Disqus