Working with missing data in large-scale assessments (with plausible values)

Working with missing data in large-scale assessments (with plausible values)

Apr 17, 2025 · 6 min read

This is the syntax for accounting for missing data/imputing data with large scale assessments (with plausible values). This accompanies the article:

Huang, F., & Keller, B. (2025). Working with missing data in large-scale assessments. Large-scale Assessments in Education. doi: 10.1186/s40536-025-00248-9

The article is open access but you can get the updated, corrected version here (as of 2025.04.16).

There is a slight adjustment (for the original article) when it comes to parallel processing (an additional package needs to be loaded). rblimp has also been updated to work with factor variables. This example uses the Belgian PISA dataset. Read the article as it explains the steps.

1. Read in the data and load the required libraries

Download the data from the TIMSS website.

comb <- rio::import("https://github.com/flh3/pubdata/raw/refs/heads/main/miscdata/belgium.rds") #combined original dataset
wmiss <- rio::import("https://github.com/flh3/pubdata/raw/refs/heads/main/miscdata/belgiumwmiss.rds") #dataset with additional missing data

library(MLMusingR) # for mixPV and summary for pooling
library(WeMix) # for mix
library(tidyr) # to convert from wide to long
library(dplyr) # data management
library(rblimp) #for creating the imputations

2. Fit the models

Done using the complete (90%) and missing info (82%) datasets. Done for comparison purposes (how do the results differ when we impute the data?)

l1a <- mixPV(pv1math + pv2math + pv3math + pv4math +
  pv5math + pv6math + pv7math + pv8math + #note correction, pv8
  pv9math + pv10math ~ gender + escs + immig2 +
  stubeha + lackstaff + (1|cntschid),
  weights = c('w_fstuwt', 'w_schgrnrabwt'),
  data = comb, mc = TRUE) 
# can add the option mc = TRUE to make this faster (multi core)
l1b <- mixPV(pv1math + pv2math + pv3math + pv4math +
  pv5math + pv6math + pv7math + pv8math + #note correction, pv8
  pv9math + pv10math ~ gender + escs + immig2 +
  stubeha + lackstaff + (1|cntschid),
  weights = c('w_fstuwt', 'w_schgrnrabwt'),
  data = wmiss, mc = TRUE) 

3. Investigate missingness

MLMusingR::nmiss(wmiss) 

Percent missing per variable:
      pv1math       pv2math       pv3math       pv4math       pv5math       pv6math       pv7math 
   0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
      pv8math       pv9math      pv10math       stubeha     teachbeha        gender          escs 
   0.00000000    0.00000000    0.00000000    0.03445428    0.03445428    0.00000000    0.12235988 
       immig2     lackstaff      w_fstuwt w_schgrnrabwt      cntschid 
   0.02595870    0.05699115    0.00000000    0.00000000    0.00000000 

Percent complete cases: 0.8154572 
(Minimum) number to impute: 18 

Perform some necessary data management:

tall <- pivot_longer(wmiss, pv1math:pv10math, values_to = 'math')
m <- 20 #number of datasets to impute
ns <- nrow(wmiss) #count how many observations there are
tall$.pv <- as.numeric(gsub("[^0-9]", "", tall$name)) #extract the 
  # numeric value 
nopv <- length(table(as.character(tall$.pv))) #the number of 
  # plausible values
UPDATE: rblimp used to require all numeric variables only (at the time of acceptance of the manuscript). However, this has been updated (in version 0.2.7) and there is no need to convert factors into numeric variables anymore. We can use the tall dataset as-is and this is used in the imputation.

4. Impute the data

Remember, you have to download and install Blimp separately (which performs the multiple imputations). You can get it here https://www.appliedmissingdata.com/blimp.

Then, you will have to install rblimp:

# install.packages('remotes')
remotes::install_github('blimp-stats/rblimp')

# set_blimp("/mnt/c/Blimp/blimp") ### NOTE:: I need this for linux
mymodel <- rblimp(
  data = tall, #this is different, new version of rblimp
  nominal = 'gender immig2 lackstaff',
  # ordinal = '',
  clusterid = 'cntschid',
  fixed = 'gender w_fstuwt', 
  model = 'math ~ gender escs immig2 
           stubeha lackstaff w_fstuwt', 
  options = 'labels',
  seed = 1234,
  nimps = ) |> 
  by_group('.pv') 

Again, read the article which explains the different options. This will take a while. Suggest using 2 in the nimps option to start and see if everything works properly first. The by_group option is important when there are plausible values.

After imputing, inspect the rhat values (see article)

mymodel |> sapply(\(x) tail(x@psr, 1) |> max(na.rm = TRUE))

     1      2      3      4      5      6      7      8      9     10 
1.0480 1.0394 1.0401 1.0959 1.0312 1.0609 1.0489 1.0337 1.0401 1.0610 

# lapply(mymodel, psr)

5. Extract the data

And perform some data management.

impdat <- lapply(mymodel, as.mitml) |>
  unlist(recursive = FALSE) |>
  do.call(rbind, args = _)

impdat$.implist <- rep(1:(m * nopv), each = ns)
UPDATE: The original manuscript required copying the variable attributes of factors back into the numeric variables. This is not needed anymore with the latest version of rblimp which works with factors.
## Create a list for analysis
alldat <- split(impdat, impdat$.implist)

6. Analyze the data

Specify the model:

modfit <- function(x) {
  mix(math ~ gender + escs + immig2 + stubeha +
  lackstaff + (1|cntschid),
  weights = c('w_fstuwt', 'w_schgrnrabwt'),
  data = x)
}

Fit the model, two ways:

Method 1: Serial operation (slow)

# allres <- lapply(alldat, FUN = modfit)

Method 2: Parallel computation (much faster but still requires around 5 minutes)

library(parallel)
library(doParallel) ### NEED THIS!
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)
clusterExport(cl, list('modfit', 'mix'))
allres <- parLapply(cl, alldat, fun = modfit)
stopCluster(cl)

7 Pool results

Need to change the class for the summary function to work:

class(allres) <- 'mixPV'
summary(allres)

Results of multilevel analyses with 200 plausible values.
Number of observations: 8475 

Estimates for random effects: 
                     estimate std.error statistic   df Pr(>t)    
cntschid.(Intercept)  2878.60    377.84      7.62 1782 <2e-16 ***
Residual              4353.26    163.94     26.55  454 <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimates for fixed effects: 
                        estimate std.error statistic   df Pr(>t)    
(Intercept)               496.88     10.74     46.28 4252 <2e-16 ***
genderFemale              -19.50      2.31     -8.42  950 <2e-16 ***
escs                       18.88      1.72     10.95  485 <2e-16 ***
immig2Second-Generation   -22.31      3.62     -6.16 1747 <2e-16 ***
immig2First-Generation    -26.35      4.60     -5.73 1330 <2e-16 ***
stubeha                   -37.68      5.54     -6.81 5192 <2e-16 ***
lackstaffVery little       28.12     13.97      2.01 4055  0.044 *  
lackstaffTo some extent     9.23     14.20      0.65 4573  0.516    
lackstaffA lot             21.61     17.98      1.20 2498  0.229    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing results side by side:

source("https://raw.githubusercontent.com/flh3/pubdata/refs/heads/main/mixPV/wemix_modelsummary.R")
library(modelsummary)
modelsummary(list('90% comp' = l1a, '82% comp' = l1b, 'MI' = allres), stars = TRUE)
90% comp82% compMI
(Intercept)499.886***520.934***496.875***
(10.312)(9.069)(10.737)
genderFemale-20.890***-18.840***-19.495***
(2.428)(2.458)(2.315)
escs17.674***16.991***18.878***
(1.510)(1.625)(1.724)
immig2Second-Generation-22.650***-21.346***-22.306***
(3.872)(4.086)(3.622)
immig2First-Generation-27.046***-24.955***-26.349***
(4.877)(4.972)(4.598)
stubeha-36.355***-32.648***-37.677***
(5.522)(4.961)(5.536)
lackstaffVery little27.385*12.01628.116*
(13.242)(12.071)(13.967)
lackstaffTo some extent9.347-2.1669.227
(13.758)(11.475)(14.198)
lackstaffA lot22.9428.45921.613
(15.653)(14.019)(17.975)
cntschid.(Intercept)2714.714***2137.460***2878.599***
(362.636)(370.142)(377.836)
Residual4385.549***3885.879***4353.262***
(179.614)(165.660)(163.942)
Nobs767669118475
N.pv1010200
AICbar1195567.295450461061170.1578461329601.33171994
BICbar1195643.699842891061245.407411861329678.8253552
  • p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

END