# Why does centering reduce multicollinearity?

Centering often reduces the correlation between the individual variables (x1, x2) and the product term (x1 $\times$ x2). In the example below, r(x1, x1x2) = .80. With the centered variables, r(x1c, x1x2c) = -.15.

NOTE: For examples of when centering may not reduce multicollinearity but may make it worse, see EPM article.

set.seed(123)
x1 <- rnorm(100, 10, 1)
x2 <- rnorm(100, 15, 1)
x1x2 <- x1*x2
x1c <- x1 - mean(x1)
x2c <- x2 - mean(x2)
x1x2c <- x1c * x2c
dat <- data.frame(x1, x2, x1x2, x1c, x2c, x1x2c)
round(cor(dat), 2)
##          x1    x2  x1x2   x1c   x2c x1x2c
## x1     1.00 -0.05  0.80  1.00 -0.05 -0.15
## x2    -0.05  1.00  0.55 -0.05  1.00 -0.17
## x1x2   0.80  0.55  1.00  0.80  0.55 -0.17
## x1c    1.00 -0.05  0.80  1.00 -0.05 -0.15
## x2c   -0.05  1.00  0.55 -0.05  1.00 -0.17
## x1x2c -0.15 -0.17 -0.17 -0.15 -0.17  1.00

A question though may be raised why centering reduces collinearity?

Consider the basic equation for a correlation:

$r_{(X, Y)} = \frac{cov(X,Y)}{\sqrt{(var(X) \cdot var(Y))}}$

For the product score (X1X2) and X1:

$r_{(X1X2, X1) = \frac{cov(X1X2, X1)}{\sqrt{(var(X1X2) \cdot var(X1))}}}$

Focusing only on the numerator and using covariance algebra, the covariance of a product score (X1X2) with another variable (X1) can be written as:

$cov(AB, C) = \mathbb{E}(A) \cdot cov(B, C) + \mathbb{E}(B) \cdot cov(A, C)$ $= \mathbb{E}(X1) \cdot cov(X2, X1) + \mathbb{E}(X2) \cdot cov(X1, X1)$ $= \mathbb{E}(X1) \cdot cov(X2, X1) + \mathbb{E}(X2) \cdot var(X1)$

With mean-centered variables:

$r_{((X1 - \bar{X}1)(X2 - \bar{X}2), (X1 - \bar{X}1))} = \frac{cov((X1 - \bar{X}1)(X2 - \bar{X}2), (X1 - \bar{X}1))}{\sqrt{var((X1 - \bar{X}1)(X2 - \bar{X}2)) \cdot var((X1 - \bar{X}1))}}$

Focusing only on the numerator again:

$= \mathbb{E}(X1 - \bar{X}1) \cdot cov(X2 - \bar{X}2, X1 - \bar{X}1) + \mathbb{E}(X2 - \bar{X}2) \cdot cov(X1 - \bar{X}1, X1 - \bar{X}1)$ $= \mathbb{E}(X1 - \bar{X}1) \cdot cov(X2 - \bar{X}2, X1 - \bar{X}1) + \mathbb{E}(X2 - \bar{X}2) \cdot var(X1 - \bar{X}1)$

The expected value though of a mean centered variable is zero. So if the numerator is zero, the whole equation reduces to zero (on average).

$= 0 \cdot cov(X2 - \bar{X}2, X1 - \bar{X}1) + 0 \cdot var(X1 - \bar{X}1)$

## Using a short simulation

• Randomly generate 100 x1 and x2 variables
• Mean center the variables
• Compute corresponding interactions (x1x2 and x1x2c)
• Get the correlations of the variables and the product term (r is for the raw variables, cr is for the centered variables)
• Get the average of the terms over the replications
set.seed(4567)
reps <- 1000
r1 <- r2 <- cr1 <- cr2 <- numeric(reps)
for (i in 1:reps){
x1 <- rnorm(100, 10, 1) #mean of 10, SD = 1
x2 <- rnorm(100, 15, 1) #mean of 15, SD = 1
x1x2 <- x1*x2
x1c <- x1 - mean(x1)
x2c <- x2 - mean(x2)
x1x2c <- x1c * x2c
cr1[i] <- cor(x1c, x1x2c)
cr2[i] <- cor(x2c, x1x2c)
r1[i] <- cor(x1, x1x2)
r2[i] <- cor(x2, x1x2)
}

# r(x1,x2) should be zero because generated independently
res <- data.frame(r1, r2, cr1, cr2)
round(colMeans(res), 3)
##     r1     r2    cr1    cr2
##  0.829  0.551 -0.001  0.008

On average, the correlations of the centered variables are 0 or near 0. They are not always zero and plotting the distribution shows the range of correlations.

library(dplyr)
library(tidyr)
library(ggplot2)
mm <- gather(res, key = 'vars', value = 'r')
mm %>% ggplot(aes(x = r)) +
geom_histogram(bins = 60) + facet_grid(~vars) +
theme_bw()

– END