Instrumental variables within an SEM framework

Earlier this year, I wrote an article on using instrumental variables (IV) to analyze data from randomized experiments with imperfect compliance (read the manuscript for full details; link updated; it’s open access). In the article, I described the steps of IV estimation and the logic behind it.

The sample code using two stage least squares regression (the correct analysis) is shown below (see article for specifics):

library(ivreg)
dat <- read.csv('https://raw.githubusercontent.com/flh3/pubdata/refs/heads/main/IV/ivexample.csv')
head(dat)

  assign takeup y
1      0      0 0
2      0      0 0
3      0      0 0
4      0      0 0
5      0      0 0
6      0      0 0

tail(dat)

    assign takeup  y
195      1      1  9
196      1      1 10
197      1      1 10
198      1      1 12
199      1      1 11
200      1      1  9

summary(dat)

     assign        takeup            y         
 Min.   :0.0   Min.   :0.000   Min.   : 0.000  
 1st Qu.:0.0   1st Qu.:0.000   1st Qu.: 0.000  
 Median :0.5   Median :0.000   Median : 0.000  
 Mean   :0.5   Mean   :0.435   Mean   : 4.375  
 3rd Qu.:1.0   3rd Qu.:1.000   3rd Qu.:10.000  
 Max.   :1.0   Max.   :1.000   Max.   :13.000

iv1 <- ivreg(y ~ takeup, ~assign, data = dat)
summary(iv1)


Call:
ivreg(formula = y ~ takeup | assign, data = dat)

Residuals:
      Min        1Q    Median        3Q       Max 
-3.065942 -0.065942  0.006522  0.006522  2.934058 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.006522   0.085095  -0.077    0.939    
takeup      10.072464   0.153269  65.718   <2e-16 ***

Diagnostic tests:
                 df1 df2 statistic p-value    
Weak instruments   1 198   185.933  <2e-16 ***
Wu-Hausman         1 197     0.018   0.892    
Sargan             0  NA        NA      NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7478 on 198 degrees of freedom
Multiple R-Squared: 0.9782,	Adjusted R-squared: 0.9781 
Wald test:  4319 on 1 and 198 DF,  p-value: < 2.2e-16

The treatment on treated (TOT) effect is 10.072 (SE = 0.153).

However, I indicated that:

Although conceptually, the model is a full mediation model, the effect is not estimated using path analysis or structural equation modeling (SEM) as is commonly done in education or psychology (i.e., the indirect path is not path a x path b).

Using SEM, the results do not match.

library(lavaan)

#incorrect
t1 <- '
y ~ takeup
takeup ~ assign'

res1 <- lavaan::sem(model = t1, data = dat)
summary(res1)

lavaan 0.6-19 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         4

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                                 0.019
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.891

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  y ~                                                 
    takeup           10.057    0.106   94.774    0.000
  takeup ~                                            
    assign            0.690    0.050   13.704    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 0.554    0.055   10.000    0.000
   .takeup            0.127    0.013   10.000    0.000

Although the path model looks correct, the estimates are off. The TOT (IV) effect is NOT 0.69 $\times$ 10.057 which is 6.939 (vs. 10.072). The difference as well is that $y$ is not regressed on the actual/observed takeup values but the predicted takeup values (read the article).

Later on, I came across this post by Paul Allison who had a solution to get the correct estimate. The solution just involved correlating the error terms of the outcome and the actual takeup values.

### correct
t2 <- '
y ~ takeup
takeup ~ assign
y ~~ takeup'

res2 <- lavaan::sem(model = t2, data = dat)
summary(res2)

lavaan 0.6-19 ended normally after 13 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

  Number of observations                           200

Model Test User Model:
                                                      
  Test statistic                                 0.000
  Degrees of freedom                                 0

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|)
  y ~                                                 
    takeup           10.072    0.153   66.049    0.000
  takeup ~                                            
    assign            0.690    0.050   13.704    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
 .y ~~                                                
   .takeup           -0.004    0.027   -0.137    0.891

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .y                 0.554    0.055    9.998    0.000
   .takeup            0.127    0.013   10.000    0.000

The point estimate and the standard errors of the takeup variable now match: 10.072 (SE = 0.153). However, note that this is still not the same as testing the indirect effect of path (a) $\times$ path(b).

Instrumental variables within an SEM framework

Related