# Using maximum likelihood

Notes to self. Often, discussions related to maximum likelihood refer to (A) estimating parameters likely to have come from some (B) probability distribution (PD) by (C) maximizing a likelihood function (from Wikipedia).

In many cases, (A) is the unknown which we want to figure out (e.g., means, variances, regression coefficients). We make an assumption about the PD where the data come from and determine the likelihood or the log likelihood of observing the datum. Get the product of all the likelihoods (multiplicative) or the sum of the log likelihoods (additive) and you will get the total likelihood.

An example will help:

# Simulate some data

set.seed(123)
x <- rnorm(10, mean = 110, sd = 3) #make up some data
x
##  [1] 108.3186 109.3095 114.6761 110.2115 110.3879 115.1452 111.3827 106.2048
##  [9] 107.9394 108.6630
mean(x)
## [1] 110.2239
sd(x)
## [1] 2.861352

We have a vector with 10 elements. We can easily get the mean of this by just typing mean(x) but this is for illustrative purposes.

We create a likelihood function llike based on a normal distribution using:

$f(x; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}$

This take the sums of the log likelihood of whatever data is in x with a specified mean (mu) and standard deviation sigma– the later two we “don’t” know. There is also the built-in function of dnorm which we can use to check if we are getting the same results.

llike <- function(x, mu, sigma){
sum(log((1 / (sigma * sqrt(2 * pi))) * exp(-.5 * ((x - mu) / sigma)^2)))
}

llike(x, 100, 10) #testing function to see if this works
## [1] -37.81005
#just checking if we specify m = 100, sd = 10
sum(dnorm(x, mean = 100, sd = 10, log = T)) #using built-in function
## [1] -37.81005

We get the same result using a specified mean of 100 and an SD of 10.

We can take a series of means (mu below) and see where the sum of the ll is at the highest.

mu <- seq(100, 120, .1) #set of values to test
ll <- sapply(mu, function(z)
llike(x, z, 10) #find the log likelihood
)

plot(ll ~ mu) #plot 

mu[which(ll == max(ll))] #which mu had the highest point
## [1] 110.2

Based on the results, it peaks around 110.

We can use another function to find the maximum point (optimizing involves either finding the maximum or minimum). A lot of these ideas come from the very useful and accessible book of Hilbe and Robinson (2013) Methods of Statistical Model Estimation.

## another function to optimize
bb <- function(p, x){
sum(dnorm(x, mean = p[1], sd = p[2], log = T))
}

## finding the maximum likelihood using 'optim'
fit <- optim(c(mean = 100, sd = 10), #starting values or guesses
fn = bb, x = x,
control = list(fnscale = -1)) #-1 to maximize
fit$par #contain the mean and the SD ## mean sd ## 110.224295 2.715141 mean(x) ## [1] 110.2239 sd(x) #this is the actual variance ## [1] 2.861352 sqrt(sum((x - mean(x))^2) / length(x)) #this is the ML variance ## [1] 2.714517 # (SS / n), not n - 1 # based on ML # the variance is under estimated # What about using ML in a regression? I have not seen examples showing this. I’ve only seen examples of the first kind shown (e.g., getting the mean). First, create some data with two variables (x1 and x2): ## simulate some data set.seed(2468) ns <- 1000 x1 <- rnorm(ns) x2 <- rnorm(ns) e1 <- rnorm(ns) y <- 10 + 1 * x1 + .5 * x2 + e1 dat <- data.frame(y = y, x1 = x1, x2 = x2) Prepare the formula and some matrices: fml <- formula('y ~ x1 + x2') df <- model.frame(fml, dat) y <- model.response(df) X <- model.matrix(fml, df) Create another function to optimize: ## log likelihood for continuous variables ll.cont <- function(p, X, y){ len <- length(p) betas <- p mu <- X %*% betas sum((y - mu)^2) #minimize this, sum of squared residuals } In this case, we want to minimize the sums of squared residuals (error)– that’s what the best fitting line does. start <- ncol(X) #how many start values fit <- optim(p = rep(1, start), #repeat 1 fn = ll.cont, X = X, y = y) #this minimizes the SSR fit$par #compare with results below 
## [1] 9.9952812 1.0151535 0.5098497
## values are for the intercept / Bx1 / Bx2

Compare to using our standard lm function:

t1 <- lm(y ~ x1 + x2, data = df)
summary(t1)$coef ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 9.9950936 0.03144700 317.83937 0.000000e+00 ## x1 1.0153582 0.03116854 32.57638 4.166280e-159 ## x2 0.5096043 0.03093340 16.47424 4.091831e-54 Instead of minimizing the error, what if we want to maximize the likelihoods again? ll.cont.mx <- function(p, X, y){ len <- length(p) betas <- p[-len] mu <- X %*% betas sds <- p[len] sum(dnorm(y, mean = mu, sd = sds, log = T)) #maximize this #sum((y - mu)^2) #minimize this } start <- ncol(X) + 1 #add one w/c is the variance fit <- optim(p = rep(1, start), #starting values fn = ll.cont.mx, #function to use X = X, y = y, #pass this to our hessian = T, #to get standard errors control = list(fnscale = -1, #to maximize reltol = 1e-10) #change in deviance to converge ) b <- fit$par[1:3] #the betas
ll <- fit$value #the log likelihood conv <- fit$convergence #value of zero means convergence
ses <- sqrt(diag(solve(-fit$hessian)))[1:3] #get the SE from the inverse of the Hessian matrix df <- data.frame(b, ses, t = b/ses) rownames(df) <- colnames(X) df #putting it all together  ## b ses t ## (Intercept) 9.9951172 0.03139866 318.32944 ## x1 1.0153670 0.03112063 32.62681 ## x2 0.5095777 0.03088585 16.49874 ll #the log likelihood ## [1] -1411.179 t1 <- lm(y ~ x1 + x2, data = df) summary(t1)$coef
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 9.9950936 0.03144700 317.83937  0.000000e+00
## x1          1.0153582 0.03116854  32.57638 4.166280e-159
## x2          0.5096043 0.03093340  16.47424  4.091831e-54
logLik(t1)
## 'log Lik.' -1411.179 (df=4)

Similar coefficients, standard errors, and the same log likelihood.

— END