Random notes. Regression based techniques often involve finding a maximum (e.g., the maximum likelihood) or a minimum (e.g., least squares or mean square error) value. Gradient descent is an iterative optimization algorithm used to find the minimum of a function (or gradient ascent to find the maximum).
The algorithm for solving for \(\theta_j\) looks like:
\[\theta_j = \theta_j - \alpha\frac{\partial{}}{\partial{\theta_j} }J(\theta)\]
\(\alpha\) is the learning rate (smaller step size takes more iterations)
\(J(\theta)\) is the loss or cost function
1. Create a dataset
set.seed(1233)
ns <- 500
x1 <- rnorm(ns)
err <- rnorm(ns)
y <- x1 * .5 + err
dat <- data.frame(y, x1)
2. Initialize the algorithm
Specify the:
- design matrix (X),
- learning rate (here .05),
- tolerance level (
tol
), and - number of maximum iterations (
iter
).
X <- model.matrix(~x1, data = dat) #design matrix
k <- ncol(X) #number of predictors
lr <- .05 #this is the learning rate
b <- rnorm(k) #random
pred <- X %*% b #predicted values
mse <- sum(y - X %*% b)^2 / ns #initial mean square error
tol <- 1e-16 #tolerance level
iter <- 1000 #number of max iterations
3. Run the algorithm
Iterate until the MSE doesn’t change much (based on the tolerance level).
for (i in 1:iter){
dint <- (-2 / ns) * sum(y - pred) # compute the partial der with respect to the intercept
dslope <- (-2/ns) * sum((y - pred) * x1) # compute the partial der with respect to the slope
b[1] <- b[1] - lr * dint #update dMSE / dint
b[2] <- b[2] - lr * dslope #update dMSE / db1
pred <- X %*% b
msenew <- sum(y - pred)^2 / ns
conv <- abs(mse - msenew)
mse <- msenew
if (conv < tol) (break)
}
4. Compare the results:
Results are the same
cat("final iteration:", i, "coefs:", b, "convergence:", conv, "\n") #based on the iterations
final iteration: 196 coefs: -0.07235628 0.4475963 convergence: 9.312627e-17
solve(t(X) %*% X) %*% t(X) %*% y #the usual way
[,1]
(Intercept) -0.07235628
x1 0.44759628
lm(y ~ x1) #using built in function
Call:
lm(formula = y ~ x1)
Coefficients:
(Intercept) x1
-0.07236 0.44760
5. Using matrix notation (to make it scalable)
set.seed(246)
ns <- 1000
x1 <- rnorm(ns)
err <- rnorm(ns)
y <- x1 * .5 + err
dat <- data.frame(y, x1)
X <- model.matrix(~x1, data = dat)
k <- ncol(X)
lr <- .05
b <- rnorm(k)
pred <- X %*% b
mse <- sum(y - X %*% b)^2 / ns
tol <- 1e-16
iter <- 1000
for (i in 1:iter){
grad <- (-2 / ns) * (t(X) %*% (y - pred))
b <- b - lr * grad
pred <- X %*% b
msenew <- sum(y - pred)^2 / ns
conv <- abs(mse - msenew)
mse <- msenew
if (conv < tol) (break)
}
cat("final iteration:", i, "coefs:", b, "convergence:", conv, "\n")
final iteration: 198 coefs: 0.0324858 0.4794015 convergence: 8.405179e-17
solve(t(X) %*% X) %*% t(X) %*% y
[,1]
(Intercept) 0.0324858
x1 0.4794015
6. Another test…
Using matrix notation and testing it with a more complex formula (e.g., including interactions).
### scaling up
set.seed(12345)
ns <- 1000
x1 <- rnorm(ns)
x2 <- rnorm(ns)
err <- rnorm(ns)
y <- x1 * .5 + .5 * x2 + .5 * x1 * x2 + err
dat <- data.frame(y, x1, x2)
X <- model.matrix(~x1 + x2 + x1 * x2, data = dat)
k <- ncol(X)
lr <- .05
b <- rnorm(k)
pred <- X %*% b
mse <- sum(y - X %*% b)^2 / ns
tol <- 1e-16
iter <- 1000
for (i in 1:iter){
grad <- (-2 / ns) * (t(X) %*% (y - pred))
b <- b - lr * grad
pred <- X %*% b
msenew <- sum(y - pred)^2 / ns
conv <- abs(mse - msenew)
mse <- msenew
if (conv < tol) (break)
}
cat("final iteration:", i, "coefs:", b, "convergence:", conv, "\n")
final iteration: 217 coefs: -0.02600438 0.4901473 0.4951696 0.5210631 convergence: 9.040568e-17
Just putting it all together
u <- pred - y #residual
num <- t(u) %*% u #numerator
den <- ns - k #denominator
rse <- sqrt(num/den) #residual standard error (manually)
se <- sqrt(diag(as.numeric(rse^2) * (solve(t(X) %*% X))))
t <- b / se
pv <- 2 * pt(-abs(t), df = ns - k)
data.frame(b = b, se = se, t = t, pv = pv)
b se t pv
(Intercept) -0.02600438 0.03036152 -0.8564913 3.919321e-01
x1 0.49014733 0.03044746 16.0981346 5.325422e-52
x2 0.49516958 0.03009428 16.4539432 5.398765e-54
x1:x2 0.52106309 0.02960410 17.6010419 1.383041e-60
#solve(t(X) %*% X) %*% t(X) %*% y
summary(lm(y ~ x1 + x2 + x1 * x2))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.02600438 0.03036152 -0.8564913 3.919321e-01
x1 0.49014733 0.03044746 16.0981347 5.325420e-52
x2 0.49516958 0.03009428 16.4539432 5.398765e-54
x1:x2 0.52106309 0.02960410 17.6010419 1.383041e-60
This is of course just an example. We don’t really run our regressions this way… (e.g., it can be inefficient, need to specify the learning rate hyperparameter).
— END