Data management

Using FIML in R for Multilevel Data

Using FIML in R with Multilevel Data (Part 3) A recurring question that I get asked is how to use full information maximum likelihood (FIML) when performing a multiple regression analysis BUT this time, accounting for nesting or clustered data structure. For this example, I use the the leadership dataset in the mitml package (Grund et al., 2021). We’ll also use lavaan (Roseel, 2012) to estimate the two-level model. The chapter of Grund et al.

Analyzing international large scale assessments using R

Notes for students. International large scale assessments (ILSAs) have several characteristics that should be accounted for to generate correct results. The characteristics that are important to account for include: 1) the use of plausible values; 2) the use of weights; and 3) the cluster sampling used. Some of the datasets that this applies to include TIMSS, PISA, PIRLS, and even NAEP (not international). Certain packages can also be used (e.

Weights in large scale assessments

In class, I’ve talked about using weights in large scale assessments. I’ve provided a bit of intuition about using weights and why they are important. Here’s some R syntax to go along with the example I discussed. Imagine there are two schools in one school district. You are asked what is the average score on some measure of students in the district. There are only two schools (school A and B) and their size varies (School A = 100, School B = 1,000).

Using FIML and MI in R

Using FIML in R (Part 2) A recurring question that I get asked is how to handle missing data when researchers are interested in performing a multiple regression analysis. There are so many excellent articles, books, and websites that discuss the theory and rationale behind what can be done. Often, what is recommended is to either use full information likelihood (FIML) or multiple imputation (MI). Many excellent articles explain in detail how these work.

Why Weight?

I’ve spoken a bit about how using weights is important when analyzing national/statewide datasets. The weights are used so the sample generalizes to a particular population (note: we are interested in making inferences about the population, not the sample). This is important because at times, in national datasets, certain subpopulations (e.g., Hispanic or Asian students) are oversampled to ensure that the sample size is large enough for subgroup analysis. Without using weights, certain groups may be overrepresented (or underrepresented).

Correct standard errors?

The other day in class, while talking about instances (e.g., analyzing clustered data or heteroskedastic residuals) where adjustments are required to the standard errors of a regression model, a student asked: how do we know what the ‘true’ standard error should be in the first place– which is necessary to know if it is too high or too low. This short simulation illustrates that, over repeated sampling from a specified population, the standard deviaton of the regression coefficients can be used as the true standard errors.

Missing Data (Rough) Notes

Create some missing data Impute missing data Selecting the imputation method manually Analyze (imputed results) Pool results (using Rubin’s rules) Creating nicer output Example Others: Extracting datasets Using Full Information Maximum Likelihood library(mice) #for imputation library(summarytools) #for freq library(dplyr) #other data management dat <- rio::import("") summary(dat) ## Price Mileage Make Model ## Min. : 8639 Min. : 266 Length:804 Length:804 ## 1st Qu.:14273 1st Qu.:14624 Class :character Class :character ## Median :18025 Median :20914 Mode :character Mode :character ## Mean :21343 Mean :19832 ## 3rd Qu.