Illustrates why OVB is an issue

This issue plagues a lot of the analysis using secondary or observational data

Data are already existing
We may have unobserved characteristics that were not collected

To illustrate how OVB may affect regression results, we examine some simulated data.

Create some correlated data

library(stargazer) #to create simpler regression output
library(gendata) #to simulate data

#1 create two correlated variables X1 and X2 (r = .50)
dat <- genmvnorm(k = 2, .5, n = 1000, seed = 123)
cor(dat)

##       X1    X2
## X1 1.000 0.488
## X2 0.488 1.000

Plot of the correlated variables

plot(dat)

Create Y through this data generating process

B1 <- 3
B2 <- 5
set.seed(1234) #add seed so the error is the same
dat$Y <- dat$X1 * B1 + dat$X2 * B2 + rnorm(1000) #add error 
#coefficient for B1 = 3 and B2 = 5

Run some regressions

B1 should be 3 and B2 should be 5.

m1 <- lm(Y ~ X1 + X2, data = dat)
summary(m1) #correct

## 
## Call:
## lm(formula = Y ~ X1 + X2, data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.296 -0.656 -0.012  0.637  3.202 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.0265     0.0316   -0.84      0.4    
## X1            3.0437     0.0377   80.66   <2e-16 ***
## X2            5.0062     0.0350  143.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.997 on 997 degrees of freedom
## Multiple R-squared:  0.981,	Adjusted R-squared:  0.98 
## F-statistic: 2.51e+04 on 2 and 997 DF,  p-value: <2e-16

What about if we ‘forget’ or omit X1 or X2?

m2 <- lm(Y ~ X1, data = dat)
m3 <- lm(Y ~ X2, data = dat)
#export_summs(m1, m2, m3)

Compare models side by side:

stargazer(m1, m2, m3, type = 'text', no.space = T,
                     star.cutoffs = c(.05, .01, .001),
                     keep.stat = c("n","rsq"))

## 
## ============================================
##                    Dependent variable:      
##              -------------------------------
##                             Y               
##                 (1)        (2)        (3)   
## --------------------------------------------
## X1            3.040***   5.680***           
##               (0.038)    (0.153)            
## X2            5.010***             6.380*** 
##               (0.035)               (0.084) 
## Constant       -0.026     0.169     -0.097  
##               (0.032)    (0.146)    (0.087) 
## --------------------------------------------
## Observations   1,000      1,000      1,000  
## R2             0.981      0.580      0.853  
## ============================================
## Note:          *p<0.05; **p<0.01; ***p<0.001

Notice

How strong the bias is when the variables are correlated with each other
Notice how different the coefficients are in models 2 and 3
The bias will change based on the strength of the correlation and the direction of the correlation

If the variables are not correlated with each other

Will not be an issue

set.seed(541)
X1 <- rnorm(1000)
X2 <- rnorm(1000)
Y <- X1 * B1 + X2 * B2 + rnorm(1000)

By construction, X1 and X2 are not correlated with each other

cor(X1, X2)

## [1] 0.00236

plot(X1, X2)

Results will not be biased

m4 <- lm(Y ~ X1 + X2)
m5 <- lm(Y ~ X1)
m6 <- lm(Y ~ X2)
stargazer(m4, m5, m6, type = 'text', no.space = T,
                     star.cutoffs = c(.05, .01, .001),
                     keep.stat = c("n","rsq"))

## 
## ============================================
##                    Dependent variable:      
##              -------------------------------
##                             Y               
##                 (1)        (2)        (3)   
## --------------------------------------------
## X1            3.000***   3.010***           
##               (0.031)    (0.159)            
## X2            4.980***             4.990*** 
##               (0.032)               (0.101) 
## Constant       0.0003     0.301     -0.010  
##               (0.031)    (0.159)    (0.100) 
## --------------------------------------------
## Observations   1,000      1,000      1,000  
## R2             0.972      0.263      0.711  
## ============================================
## Note:          *p<0.05; **p<0.01; ***p<0.001

This is also a reason why experiments use random assignment

Due to randomization, X (or the treatment assignment variable) will not be related to the error term
Two groups will be the ‘same’ on both observed and unobserved characteristics (with happy randomization)
Notice as well the differences in the standard errors (much smaller in the full model)

Last updated on Jun 26, 2018