The other day in class, while talking about instances (e.g., analyzing clustered data or heteroskedastic residuals) where adjustments are required to the standard errors of a regression model, a student asked: how do we know what the ‘true’ standard error should be in the first place– which is necessary to know if it is too high or too low.
Illustration showing different flavors of robust standard errors. Load in library, dataset, and recode. Do not really need to dummy code but may make making the X matrix easier. Using the High School & Beyond (hsb) dataset.
A while back, I wrote a note about how to conduct a multilevel
confirmatory factor analysis (MLCFA) in
R. Part of the
note shows how to setup lavaan
to be able to run the MLCFA model.
NOTE: one of the important aspects of an MLCFA is that the factor
structure at the two levels may not be the same– that is the factor
structures are invariant across levels. The setup process is/was
cumbersome– but putting the note together was informative. Testing a 2-1
factor model (i.e., 2 factors at the first level and 1 factor at the
second level) required the following code (see the original note for the
detailed explanation of the setup and what the variables represent).
This is a measure of school engagement; n = 3,894 students in 254
schools.
How to simulate multilevel data using a Monte Carlo simulation.
To start off, the sample variance formula is:
First of all, is a deviation score (deviation from what? deviation from the mean). Summing the deviations will just get us zero so the deviations are squared and then added together. The numerator of this formula is then called the sum of squared deviations which is literally what it is. This is not yet what we refer to as the variance (s2). We have to divide this by n − 1 which is the sample degrees of freedom.
Earlier this year, I wrote an article on using instrumental variables (IV) to analyze data from randomized experiments with imperfect compliance (read the manuscript for full details; link updated; it’s open access). In the article, I described the steps of IV estimation and the logic behind it.
In our module on regression diagnostics, I mentioned 1) that at times (with clustered data) standard errors may be misestimated and may be too low, resulting in a greater chance of making a Type I error (i.e., claiming statistically significant results when they should not be). In our ANCOVA session, I also indicated that 2) covariates are helpful because they help to lower the (standard) error in the model and increase power. So, it sounds like we would like to have models with lower standard errors. However, there are cases when the standard error is estimated lower than it should be (i.e., the standard error is biased).
A primer on using IVs
This issue plagues a lot of the analysis using secondary or observational data
To illustrate how OVB may affect regression results, we examine some simulated data.