Why does centering reduce multicollinearity?

Last updated on Jan 21, 2023 3 min read centering

Centering often reduces the correlation between the individual variables (x1, x2) and the product term (x1 \(\times\) x2). In the example below, r(x1, x1x2) = .80. With the centered variables, r(x1c, x1x2c) = -.15.

NOTE: For examples of when centering may not reduce multicollinearity but may make it worse, see EPM article.

set.seed(123)
x1 <- rnorm(100, 10, 1)
x2 <- rnorm(100, 15, 1) 
x1x2 <- x1*x2
x1c <- x1 - mean(x1)
x2c <- x2 - mean(x2)
x1x2c <- x1c * x2c
dat <- data.frame(x1, x2, x1x2, x1c, x2c, x1x2c)
round(cor(dat), 2)

##          x1    x2  x1x2   x1c   x2c x1x2c
## x1     1.00 -0.05  0.80  1.00 -0.05 -0.15
## x2    -0.05  1.00  0.55 -0.05  1.00 -0.17
## x1x2   0.80  0.55  1.00  0.80  0.55 -0.17
## x1c    1.00 -0.05  0.80  1.00 -0.05 -0.15
## x2c   -0.05  1.00  0.55 -0.05  1.00 -0.17
## x1x2c -0.15 -0.17 -0.17 -0.15 -0.17  1.00

A question though may be raised why centering reduces collinearity?

Consider the basic equation for a correlation:

\[r_{(X, Y)} = \frac{cov(X,Y)}{\sqrt{(var(X) \cdot var(Y))}}\]

For the product score (X1X2) and X1:

\[r_{(X1X2, X1) = \frac{cov(X1X2, X1)}{\sqrt{(var(X1X2) \cdot var(X1))}}}\]

Focusing only on the numerator and using covariance algebra, the covariance of a product score (X1X2) with another variable (X1) can be written as:

\[cov(AB, C) = \mathbb{E}(A) \cdot cov(B, C) + \mathbb{E}(B) \cdot cov(A, C)\] \[= \mathbb{E}(X1) \cdot cov(X2, X1) + \mathbb{E}(X2) \cdot cov(X1, X1)\] \[= \mathbb{E}(X1) \cdot cov(X2, X1) + \mathbb{E}(X2) \cdot var(X1)\]

With mean-centered variables:

\[r_{((X1 - \bar{X}1)(X2 - \bar{X}2), (X1 - \bar{X}1))} = \frac{cov((X1 - \bar{X}1)(X2 - \bar{X}2), (X1 - \bar{X}1))}{\sqrt{var((X1 - \bar{X}1)(X2 - \bar{X}2)) \cdot var((X1 - \bar{X}1))}}\]

Focusing only on the numerator again:

\[= \mathbb{E}(X1 - \bar{X}1) \cdot cov(X2 - \bar{X}2, X1 - \bar{X}1) + \mathbb{E}(X2 - \bar{X}2) \cdot cov(X1 - \bar{X}1, X1 - \bar{X}1)\] \[= \mathbb{E}(X1 - \bar{X}1) \cdot cov(X2 - \bar{X}2, X1 - \bar{X}1) + \mathbb{E}(X2 - \bar{X}2) \cdot var(X1 - \bar{X}1)\]

The expected value though of a mean centered variable is zero. So if the numerator is zero, the whole equation reduces to zero (on average).

\[= 0 \cdot cov(X2 - \bar{X}2, X1 - \bar{X}1) + 0 \cdot var(X1 - \bar{X}1)\]

Using a short simulation

Randomly generate 100 x1 and x2 variables
Mean center the variables
Compute corresponding interactions (x1x2 and x1x2c)
Get the correlations of the variables and the product term (r is for the raw variables, cr is for the centered variables)
Get the average of the terms over the replications

set.seed(4567)
reps <- 1000
r1 <- r2 <- cr1 <- cr2 <- numeric(reps)
for (i in 1:reps){
  x1 <- rnorm(100, 10, 1) #mean of 10, SD = 1
  x2 <- rnorm(100, 15, 1) #mean of 15, SD = 1 
  x1x2 <- x1*x2
  x1c <- x1 - mean(x1)
  x2c <- x2 - mean(x2)
  x1x2c <- x1c * x2c
  cr1[i] <- cor(x1c, x1x2c)
  cr2[i] <- cor(x2c, x1x2c)
  r1[i] <- cor(x1, x1x2)
  r2[i] <- cor(x2, x1x2)
}

# r(x1,x2) should be zero because generated independently
res <- data.frame(r1, r2, cr1, cr2)
round(colMeans(res), 3)

##     r1     r2    cr1    cr2 
##  0.829  0.551 -0.001  0.008

On average, the correlations of the centered variables are 0 or near 0. They are not always zero and plotting the distribution shows the range of correlations.

library(dplyr)
library(tidyr)
library(ggplot2)
mm <- gather(res, key = 'vars', value = 'r')
mm %>% ggplot(aes(x = r)) +
  geom_histogram(bins = 60) + facet_grid(~vars) +
  theme_bw()

– END

centering