Using FIML and MI in R

Last updated on Jan 21, 2023 8 min read missing data, data management, evaluation

Using FIML in R (Part 2)

A recurring question that I get asked is how to handle missing data when researchers are interested in performing a multiple regression analysis. There are so many excellent articles, books, and websites that discuss the theory and rationale behind what can be done. Often, what is recommended is to either use full information likelihood (FIML) or multiple imputation (MI). Many excellent articles explain in detail how these work. The purpose though of this post is to show how (again) to just run these models in R (the examples I show here are just for single-level data). I had already shown some of this before over here though I am adding to those notes to show some comparability with Mplus results. I will use lavaan for getting FIML results.

Read in the data

For purposes of comparability, I will just use the High School and Beyond demo data (n = 200) found on the UCLA Statistical Computing website which shows how to use FIML with Mplus. We first read in the complete data which we can use later when comparing results when using the dataset with missing data.

library(dplyr) #for selecting and using the pipe
hsbnomiss <- read.csv('https://stats.idre.ucla.edu/wp-content/uploads/2016/02/hsbdemo.dat', 
    header = F)
#note when data prepared for Mplus, there are no headers
#indicating the variable names 
hsbnomiss2 <- select(hsbnomiss, 2, 6:8) 
#only select columns we want
names(hsbnomiss2) <- c('female', 'read', 'write', 'math') 
#name the columns
head(hsbnomiss2)

##   female read write math
## 1      1   34    35   41
## 2      0   34    33   41
## 3      0   39    39   44
## 4      0   37    37   42
## 5      0   39    31   40
## 6      1   42    36   42

str(hsbnomiss2)

## 'data.frame':    200 obs. of  4 variables:
##  $ female: int  1 0 0 0 0 1 0 0 1 0 ...
##  $ read  : int  34 34 39 37 39 42 31 50 39 34 ...
##  $ write : int  35 33 39 37 31 36 36 31 41 37 ...
##  $ math  : int  41 41 44 42 40 42 46 40 33 46 ...

Run the regression (complete data)

In this example, we want to predict write using female, read, and math.

nomiss1 <- lm(write ~ female + read + math, data = hsbnomiss2)
summary(nomiss1)$coef %>% round(3)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   11.896      2.863   4.155        0
## female         5.443      0.935   5.822        0
## read           0.325      0.061   5.355        0
## math           0.397      0.066   5.986        0

We can also do this using lavaan and the sem function. Note, in this case, the formula I specified is in between quotes. lavaan is often used for cfa and sem where the interrelationships between variables and latent factors are specified. Since this is just a regression with all observed variables, we can specify this in just one line (representing the formula).

library(lavaan)
nomiss2 <- sem('write ~ female + read + math', data = hsbnomiss2)
summary(nomiss2) #lot more information

## lavaan 0.6-5 ended normally after 17 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                          4
##                                                       
##   Number of observations                           200
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   write ~                                             
##     female            5.443    0.926    5.881    0.000
##     read              0.325    0.060    5.409    0.000
##     math              0.397    0.066    6.047    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .write            42.368    4.237   10.000    0.000

These results provide a benchmark of what the results should be when data are not missing.

Read in missing data

hsbwmiss <- read.csv('https://stats.idre.ucla.edu/wp-content/uploads/2017/04/hsbmis2.dat', 
    header = F)
#missing data are coded as -9999, recode to NA
hsbwmiss[hsbwmiss == -9999] <- NA
#I know you can do this in dplyr using some command
#but this is quick and basic
hsbwmiss2 <- dplyr::select(hsbwmiss, 2, 8:10)
names(hsbwmiss2) <- c('female', 'read', 'write', 'math')

We can just look at the patterns of missing data quickly too using the mice package.

library(mice)
md.pattern(hsbwmiss2, rotate.names = T)

##    write read math female    
## 76     1    1    1      1   0
## 39     1    1    1      0   1
## 34     1    1    0      1   1
## 14     1    1    0      0   2
## 19     1    0    1      1   1
## 9      1    0    1      0   2
## 6      1    0    0      1   2
## 3      1    0    0      0   3
##        0   37   57     65 159

The missing data patterns show a lot of missing data. We can run a naive regression and compare results to the complete case analysis. Also shows that only 48% of respondents have complete data. NOTE: There is no threshold as to what is considered an unacceptable amount of missing data.

wmiss1 <- lm(write ~ female + read + math, data = hsbwmiss2)
library(stargazer) #to show side-by-side output

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer

stargazer(nomiss1, wmiss1, 
          star.cutoffs = c(.05, .01, .001), 
  no.space = T, type = 'text')

## 
## ==================================================================
##                                  Dependent variable:              
##                     ----------------------------------------------
##                                         write                     
##                               (1)                    (2)          
## ------------------------------------------------------------------
## female                     5.443***                7.247***       
##                             (0.935)                (1.620)        
## read                       0.325***                 0.167         
##                             (0.061)                (0.107)        
## math                       0.397***                0.447***       
##                             (0.066)                (0.117)        
## Constant                   11.896***              17.160***       
##                             (2.863)                (4.964)        
## ------------------------------------------------------------------
## Observations                  200                     76          
## R2                           0.526                  0.468         
## Adjusted R2                  0.519                  0.446         
## Residual Std. Error    6.575 (df = 196)        6.966 (df = 72)    
## F Statistic         72.518*** (df = 3; 196) 21.111*** (df = 3; 72)
## ==================================================================
## Note:                                *p<0.05; **p<0.01; ***p<0.001

Regressions here show different results with reading not being predictive anymore of writing and the strength of the female coefficient increasing (and SEs are much higher).

Run the model accounting for missing data

We will now use FIML to account for the missing data. We will again use the sem function but will some additional options:

wmiss2 <- sem('write ~ female + read + math', data = hsbwmiss2, 
  missing = 'fiml', fixed.x = F)

We specify missing = "fiml" which then uses fiml to account for the missing data. In addition, we include: fixed.x = F. FIML works by estimating the relationships of the variables with each other and requires estimating the means and variances of the variables. If fixed.x = T (the default), then the variances and covariances are fixed and are based on the existing sample values and are not estimated. You can specify the means and variances to be estimated in the model but requires more typing.

View the results:

summary(wmiss2)

## lavaan 0.6-5 ended normally after 46 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                         14
##                                                       
##   Number of observations                           200
##   Number of missing patterns                         8
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Observed
##   Observed information based on                Hessian
##   Standard errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   write ~                                             
##     female            5.436    1.130    4.809    0.000
##     read              0.298    0.073    4.080    0.000
##     math              0.401    0.078    5.117    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   female ~~                                           
##     read             -0.255    0.462   -0.551    0.582
##     math              0.080    0.451    0.177    0.860
##   read ~~                                             
##     math             68.122    9.634    7.071    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .write            12.949    3.013    4.298    0.000
##     female            0.591    0.041   14.384    0.000
##     read             51.898    0.776   66.855    0.000
##     math             52.724    0.756   69.714    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .write            41.622    4.744    8.773    0.000
##     female            0.238    0.029    8.321    0.000
##     read            107.510   11.881    9.049    0.000
##     math             95.591   10.940    8.737    0.000

The summary presents more information and shows that the analysis is based on the 200 observations. The intercept here is shown under Intercepts –> .write. Results are much closer to the original results with no missing data (shown again below):

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   11.896      2.863   4.155        0
female         5.443      0.935   5.822        0
read           0.325      0.061   5.355        0
math           0.397      0.066   5.986        0

If you just want the coefficients without the summary statistics (but still showing all the variances and covariances indicated by ~~ and the means denoted by ~:

parameterestimates(wmiss2)

##       lhs op    rhs     est     se      z pvalue ci.lower ci.upper
## 1   write  ~ female   5.436  1.130  4.809  0.000    3.221    7.651
## 2   write  ~   read   0.298  0.073  4.080  0.000    0.155    0.442
## 3   write  ~   math   0.401  0.078  5.117  0.000    0.247    0.554
## 4   write ~~  write  41.622  4.744  8.773  0.000   32.324   50.920
## 5  female ~~ female   0.238  0.029  8.321  0.000    0.182    0.294
## 6  female ~~   read  -0.255  0.462 -0.551  0.582   -1.160    0.651
## 7  female ~~   math   0.080  0.451  0.177  0.860   -0.804    0.964
## 8    read ~~   read 107.510 11.881  9.049  0.000   84.223  130.796
## 9    read ~~   math  68.122  9.634  7.071  0.000   49.240   87.004
## 10   math ~~   math  95.591 10.940  8.737  0.000   74.148  117.034
## 11  write ~1         12.949  3.013  4.298  0.000    7.044   18.854
## 12 female ~1          0.591  0.041 14.384  0.000    0.511    0.672
## 13   read ~1         51.898  0.776 66.855  0.000   50.377   53.420
## 14   math ~1         52.724  0.756 69.714  0.000   51.242   54.206

You can compare the results with those derived using Mplus. Results are comparable:

MODEL RESULTS

                                                    Two-Tailed
                    Estimate       S.E.  Est./S.E.    P-Value

 WRITE    ON
    FEMALE             5.435      1.121      4.847      0.000
    READ               0.298      0.072      4.168      0.000
    MATH               0.401      0.077      5.236      0.000

 Intercepts
    WRITE             12.950      2.951      4.388      0.000

 Residual Variances
    WRITE             41.622      4.716      8.825      0.000

Try estimating the model using MI (see my previous post). Are results similar to the complete case results?

imp <- mice(hsbwmiss2, m = 50, seed = 1234)
mi1 <- with(imp, lm(write ~ female + read + math))
round(summary(pool(mi1)), 3)

missing data