Working with missing data in large-scale assessments (without plausible values)

Working with missing data in large-scale assessments (without plausible values)

Apr 17, 2025 · 8 min read

This is the syntax for accounting for missing data/imputing data with large scale assessments (without plausible values). This is Appendix A and accompanies the article:

Huang, F., & Keller, B. (2025). Working with missing data in large-scale assessments. Large-scale Assessments in Education. doi: 10.1186/s40536-025-00248-9

The syntax for fitting the models with plausible values can be found here. The article is open access but you can get the updated, corrected version here (as of 2025.04.16).

The following example is if multiple plausible values are not going to be used (e.g., some scale as the outcome, belonging, etc.). For this example, only one plausible value is going to be used (only as an example for a continuous variable). Researchers may be interested in predicting some outcome like bullying or school belonging that does not use plausible values. If users really want to use a variable that has plausible value as an outcome, ALL of the values should be used properly. See full article for the syntax for multiple imputation using all the plausible values. To learn more about plausible values, see here.

Load in the required packages

library(dplyr) #for data management
library(rblimp) #for imputing
library(WeMix) #for fitting models
library(doParallel) #for parallel computing
library(modelsummary) #to output models side by side
library(tidyr) #to convert to tall format
library(MLMusingR) #for the mixPV function

1. LOAD DATA

The following example uses the Belgian 2018 PISA dataset.

A dataset is saved into the comb object (this has 90% complete data):

# source("https://raw.githubusercontent.com/flh3/pubdata/main/mixPV/mixPVv2.R") #if MLMusingR is not installed
comb <- rio::import("https://github.com/flh3/pubdata/raw/refs/heads/main/miscdata/belgium.rds",
 trust = TRUE)

Afterwards, the model of interest is fit:

## using dataset as-is
l1a <- mix(pv1math ~ gender + escs + immig2 + (1|cntschid) +
  stubeha + lackstaff,
  weights = c('w_fstuwt', 'w_schgrnrabwt'),
  data = comb)

Warning in mix(pv1math ~ gender + escs + immig2 + (1 | cntschid) + stubeha +
: There were 799 rows with missing data. These have been removed.

summary(l1a) # minimal missing data

Call:
mix(formula = pv1math ~ gender + escs + immig2 + (1 | cntschid) + 
    stubeha + lackstaff, data = comb, weights = c("w_fstuwt", 
    "w_schgrnrabwt"))

Variance terms:
 Level    Group        Name Variance Std. Error Std.Dev.
     2 cntschid (Intercept)     2548      275.9    50.48
     1 Residual                 4463      112.3    66.81
Groups:
 Level    Group n size mean wgt sum wgt
     2 cntschid    270    5.791    1564
     1      Obs   7676   13.817  106058

Fixed Effects:
                        Estimate Std. Error t value
(Intercept)              499.148     10.483  47.617
genderFemale             -20.906      1.899 -11.008
escs                      18.054      1.216  14.845
immig2Second-Generation  -20.033      3.492  -5.737
immig2First-Generation   -25.541      3.759  -6.795
stubeha                  -35.424      5.414  -6.543
lackstaffVery little      31.221     12.900   2.420
lackstaffTo some extent   11.141     13.851   0.804
lackstaffA lot            19.328     15.533   1.244

lnl= -598666.14 
Intraclass Correlation= 0.3634 

A second dataset- wmiss is loaded which has additional missing values in the escs variable:

# this dataset has more missing data
wmiss <- rio::import("https://github.com/flh3/pubdata/raw/refs/heads/main/miscdata/belgiumwmiss.rds",
 trust = TRUE)
MLMusingR::nmiss(wmiss) # how much missing data?

Percent missing per variable:
      pv1math       pv2math       pv3math       pv4math       pv5math 
   0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
      pv6math       pv7math       pv8math       pv9math      pv10math 
   0.00000000    0.00000000    0.00000000    0.00000000    0.00000000 
      stubeha     teachbeha        gender          escs        immig2 
   0.03445428    0.03445428    0.00000000    0.12235988    0.02595870 
    lackstaff      w_fstuwt w_schgrnrabwt      cntschid 
   0.05699115    0.00000000    0.00000000    0.00000000 

Percent complete cases: 0.8154572 
(Minimum) number to impute: 18 

l1b <- mix(pv1math ~ gender + escs + immig2 + (1|cntschid) +
 stubeha + lackstaff,
 weights = c('w_fstuwt', 'w_schgrnrabwt'),
 data = wmiss)

Warning in mix(pv1math ~ gender + escs + immig2 + (1 | cntschid) + stubeha +
: There were 1564 rows with missing data. These have been removed.

summary(l1b)

Call:
mix(formula = pv1math ~ gender + escs + immig2 + (1 | cntschid) + 
    stubeha + lackstaff, data = wmiss, weights = c("w_fstuwt", 
    "w_schgrnrabwt"))

Variance terms:
 Level    Group        Name Variance Std. Error Std.Dev.
     2 cntschid (Intercept)     1701     200.06    41.24
     1 Residual                 3671      80.92    60.59
Groups:
 Level    Group n size mean wgt sum wgt
     2 cntschid    268    5.729    1535
     1      Obs   6911   13.768   95153

Fixed Effects:
                        Estimate Std. Error t value
(Intercept)              524.990      7.739  67.835
genderFemale             -17.952      1.778 -10.098
escs                      16.946      1.248  13.580
immig2Second-Generation  -18.519      3.288  -5.632
immig2First-Generation   -22.359      3.587  -6.233
stubeha                  -29.170      3.876  -7.526
lackstaffVery little      11.851     10.241   1.157
lackstaffTo some extent   -2.132      9.863  -0.216
lackstaffA lot             2.408     11.286   0.213

lnl= -527776.77 
Intraclass Correlation= 0.3167 

2. IMPUTING

Prior to imputing the data, we need to get our data ready.

ns <- nrow(wmiss) #how many observations 
m <- 20 #number of imputations
UPDATE: rblimp used to require all numeric variables only (at the time of acceptance of the manuscript). However, this has been updated (in version 0.2.7) and there is no need to convert factors into numeric variables anymore. We can use the wmiss dataset as-is and this is used in the imputation.

We are now ready to impute 20 datasets. We do not need to do this per plausible value.

Remember, you have to download and install Blimp separately (which performs the multiple imputations). You can get it here https://www.appliedmissingdata.com/blimp.

The rblimp package also needs to be installed using: remotes::install_github('blimp-stats/rblimp')

mymodel2 <- rblimp(
  data = wmiss, #this is different from the manuscript
  nominal = 'gender immig2 lackstaff',
  # ordinal = '',
  clusterid = 'cntschid',
  fixed = 'gender w_fstuwt', 
  model = 'pv1math ~ gender escs immig2 
           stubeha lackstaff w_fstuwt', 
  options = 'labels',
  seed = 1234,
  nimps = m
)

This takes some time. A lot of output is printed (not shown). We can check the PSR values.

mymodel2@psr %>% tail(1) %>% max(., na.rm = TRUE)

[1] 1.048

This seems fine (value is < 1.1). We can then proceed to create the stacked dataset (called tmp).

tmp <- as.mitml(mymodel2) |>
  do.call(rbind, args = _)

We create an imputation counter:

tmp$.implist <- rep(1:m, each = ns)
UPDATE: The original manuscript required copying the variable attributes of factors back into the numeric variables. This is not needed anymore with the latest version of rblimp.

We can check how many observations there are per imputation (we have 20 imputations):

table(tmp$.implist)


   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
8475 8475 8475 8475 8475 8475 8475 8475 8475 8475 8475 8475 8475 8475 8475 
  16   17   18   19   20 
8475 8475 8475 8475 8475 

Afterwards, we can create a list with the 20 datasets for final analysis:

alld2 <- split(tmp, tmp$.implist)

3. FITTING THE MODELS

We will use parallel processing here to speed up the computation:

First, write out our function. This is where you will modify the model to fit what the substantive (or analytic) model of interest is:

fnc <- function(x) {
  mix(pv1math ~ gender + escs + immig2 + stubeha +
      lackstaff  + (1|cntschid),
  weights = c('w_fstuwt', 'w_schgrnrabwt'),
  data = x)
}

Here, x is a placeholder.

We use the following statements to fit the fnc function above to our imputed datasets:

## fitting models using multiple cores
st <- Sys.time() #optional
cl <- makeCluster(detectCores() - 1) #use all cores less one so the computer can still be used for other things...
registerDoParallel(cl)
clusterExport(cl, list('fnc', 'mix')) #need to specify the functiosn to use
mires <- parLapply(cl, alld2, fun = fnc)
Sys.time() - st #optional

Time difference of 49.27839 secs

stopCluster(cl)

Afterwards, we need to pool results. This is done using the summary function in the mixPV function. We need to have the output as a mivPV class (although this was fit with mix), to “fool” the summary function to combine the results using Rubin’s rules:

class(mires) <- 'mixPV'
summary(mires) #this does the pooling 

Results of multilevel analyses with 20 plausible values.
Number of observations: 8475 

Estimates for random effects: 
                     estimate std.error statistic   df Pr(>t)    
cntschid.(Intercept)  2721.34    318.77      8.54 1992 <2e-16 ***
Residual              4423.45    103.34     42.81 7676 <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Estimates for fixed effects: 
                        estimate std.error statistic   df Pr(>t)    
(Intercept)              496.620    10.838    45.821 1205 <2e-16 ***
genderFemale             -19.312     1.848   -10.450 8055 <2e-16 ***
escs                      20.044     1.277    15.701  507 <2e-16 ***
immig2Second-Generation  -19.260     3.244    -5.937 3847 <2e-16 ***
immig2First-Generation   -24.198     3.492    -6.929 6472 <2e-16 ***
stubeha                  -35.947     5.250    -6.847 4331 <2e-16 ***
lackstaffVery little      30.873    13.690     2.255  708  0.024 *  
lackstaffTo some extent    8.356    14.465     0.578 1006  0.564    
lackstaffA lot            17.722    17.473     1.014  495  0.311    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can compare results side by side using modelsummary. Need a function so mix objects will display too (that’s why the source function is used below):

source("https://raw.githubusercontent.com/flh3/pubdata/refs/heads/main/mixPV/wemix_modelsummary.R")
modelsummary(list('90% comp' = l1a, '82% comp' = l1b, 'MI' = mires), stars = TRUE)
90% comp82% compMI
(Intercept)499.148***524.990***496.620***
(10.483)(7.739)(10.838)
genderFemale-20.906***-17.952***-19.312***
(1.899)(1.778)(1.848)
escs18.054***16.946***20.044***
(1.216)(1.248)(1.277)
immig2Second-Generation-20.033***-18.519***-19.260***
(3.492)(3.288)(3.244)
immig2First-Generation-25.541***-22.359***-24.198***
(3.759)(3.587)(3.492)
stubeha-35.424***-29.170***-35.947***
(5.414)(3.876)(5.250)
lackstaffVery little31.221*11.85130.873*
(12.900)(10.241)(13.690)
lackstaffTo some extent11.141-2.1328.356
(13.851)(9.863)(14.465)
lackstaffA lot19.3282.40817.722
(15.533)(11.286)(17.473)
cntschid.(Intercept)2547.812***1701.117***2721.342***
(275.856)(200.061)(318.770)
Residual4463.001***3670.943***4423.448***
(112.308)(80.921)(103.337)
Nobs767669118475
lnl-598666.141908678-527776.771623781
N.grps270268
icc.x0.3634117317555970.31666002745425
N.pv20
AICbar1331427.46293097
BICbar1331504.95656622
  • p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001

Results of the first and third models are similar. Model 2, with additional missing data in escs have different results– for the level-2 coefficients (see lackstaff results).

– END