Using maximum likelihood

Notes to self. Often, discussions related to maximum likelihood refer to (A) estimating parameters likely to have come from some (B) probability distribution (PD) by (C) maximizing a likelihood function (from Wikipedia).

In many cases, (A) is the unknown which we want to figure out (e.g., means, variances, regression coefficients). We make an assumption about the PD where the data come from and determine the likelihood or the log likelihood of observing the datum. Get the product of all the likelihoods (multiplicative) or the sum of the log likelihoods (additive) and you will get the total likelihood.

An example will help:

Simulate some data

set.seed(123)
x <- rnorm(10, mean = 110, sd = 3) #make up some data
x
##  [1] 108.3186 109.3095 114.6761 110.2115 110.3879 115.1452 111.3827 106.2048
##  [9] 107.9394 108.6630
mean(x)
## [1] 110.2239
sd(x)
## [1] 2.861352

We have a vector with 10 elements. We can easily get the mean of this by just typing mean(x) but this is for illustrative purposes.

We create a likelihood function llike based on a normal distribution using:

\(f(x; \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})^2}\)

This take the sums of the log likelihood of whatever data is in x with a specified mean (mu) and standard deviation sigma– the later two we “don’t” know. There is also the built-in function of dnorm which we can use to check if we are getting the same results.

llike <- function(x, mu, sigma){
  sum(log((1 / (sigma * sqrt(2 * pi))) * exp(-.5 * ((x - mu) / sigma)^2)))
}

llike(x, 100, 10) #testing function to see if this works
## [1] -37.81005
#just checking if we specify m = 100, sd = 10
sum(dnorm(x, mean = 100, sd = 10, log = T)) #using built-in function
## [1] -37.81005

We get the same result using a specified mean of 100 and an SD of 10.

We can take a series of means (mu below) and see where the sum of the ll is at the highest.

mu <- seq(100, 120, .1) #set of values to test
ll <- sapply(mu, function(z)
  llike(x, z, 10) #find the log likelihood
)

plot(ll ~ mu) #plot 

mu[which(ll == max(ll))] #which mu had the highest point
## [1] 110.2

Based on the results, it peaks around 110.

We can use another function to find the maximum point (optimizing involves either finding the maximum or minimum). A lot of these ideas come from the very useful and accessible book of Hilbe and Robinson (2013) Methods of Statistical Model Estimation.

## another function to optimize
bb <- function(p, x){
  sum(dnorm(x, mean = p[1], sd = p[2], log = T))
}

## finding the maximum likelihood using 'optim'
fit <- optim(c(mean = 100, sd = 10), #starting values or guesses
      fn = bb, x = x,
      control = list(fnscale = -1)) #-1 to maximize
fit$par #contain the mean and the SD
##       mean         sd 
## 110.224295   2.715141
mean(x)
## [1] 110.2239
sd(x) #this is the actual variance
## [1] 2.861352
sqrt(sum((x - mean(x))^2) / length(x)) #this is the ML variance
## [1] 2.714517
# (SS / n), not n - 1
# based on ML
# the variance is under estimated

What about using ML in a regression?

I have not seen examples showing this. I’ve only seen examples of the first kind shown (e.g., getting the mean).

First, create some data with two variables (x1 and x2):

## simulate some data
set.seed(2468)
ns <- 1000
x1 <- rnorm(ns)
x2 <- rnorm(ns)
e1 <- rnorm(ns)
y <- 10 + 1 * x1 + .5 * x2 + e1
dat <- data.frame(y = y, x1 = x1, x2 = x2)

Prepare the formula and some matrices:

fml <- formula('y ~ x1 + x2')
df <- model.frame(fml, dat)
y <- model.response(df)
X <- model.matrix(fml, df)

Create another function to optimize:

## log likelihood for continuous variables
ll.cont <- function(p, X, y){
  len <- length(p)
  betas <- p
  mu <- X %*% betas
  sum((y - mu)^2) #minimize this, sum of squared residuals
}

In this case, we want to minimize the sums of squared residuals (error)– that’s what the best fitting line does.

start <- ncol(X) #how many start values
fit <- optim(p = rep(1, start), #repeat 1 
      fn = ll.cont, 
      X = X, y = y) #this minimizes the SSR
fit$par #compare with results below 
## [1] 9.9952812 1.0151535 0.5098497
## values are for the intercept / Bx1 / Bx2

Compare to using our standard lm function:

t1 <- lm(y ~ x1 + x2, data = df)
summary(t1)$coef
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 9.9950936 0.03144700 317.83937  0.000000e+00
## x1          1.0153582 0.03116854  32.57638 4.166280e-159
## x2          0.5096043 0.03093340  16.47424  4.091831e-54

Instead of minimizing the error, what if we want to maximize the likelihoods again?

ll.cont.mx <- function(p, X, y){
  len <- length(p)
  betas <- p[-len]
  mu <- X %*% betas
  sds <- p[len]
  sum(dnorm(y, mean = mu, sd = sds, log = T)) #maximize this
  #sum((y - mu)^2) #minimize this
}

start <- ncol(X) + 1 #add one w/c is the variance
fit <- optim(p = rep(1, start), #starting values
             fn = ll.cont.mx, #function to use
             X = X, y = y, #pass this to our
             hessian = T, #to get standard errors
             control = list(fnscale = -1, #to maximize
             reltol = 1e-10) #change in deviance to converge
)
b <- fit$par[1:3] #the betas
ll <- fit$value #the log likelihood
conv <- fit$convergence #value of zero means convergence
ses <- sqrt(diag(solve(-fit$hessian)))[1:3] #get the SE from the inverse of the Hessian matrix
df <- data.frame(b, ses, t = b/ses)       
rownames(df) <- colnames(X)
df #putting it all together   
##                     b        ses         t
## (Intercept) 9.9951172 0.03139866 318.32944
## x1          1.0153670 0.03112063  32.62681
## x2          0.5095777 0.03088585  16.49874
ll #the log likelihood
## [1] -1411.179
t1 <- lm(y ~ x1 + x2, data = df)
summary(t1)$coef
##              Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 9.9950936 0.03144700 317.83937  0.000000e+00
## x1          1.0153582 0.03116854  32.57638 4.166280e-159
## x2          0.5096043 0.03093340  16.47424  4.091831e-54
logLik(t1)
## 'log Lik.' -1411.179 (df=4)

Similar coefficients, standard errors, and the same log likelihood.

— END

Next
Previous
comments powered by Disqus