The Effective Number of Clusters
Although cluster-robust standard errors (CRSEs) are very commonly used when analyzing nested/clustered data, there are many instances where they will not work properly (e.g., the type I error will be too high). I wrote about that in Educational and Psychological Measurement, When cluster-robust inferences fail. These problems can be brought about by analyzing data that have only a few clusters– though this can be an issue even with 100 clusters (see article). Other issues are present when: a) cluster sizes differ to a large extent and b) the dataset has only a few ‘treated’ higher level groups.
Although using CRSEs is helpful, when computing p-values, it is important to consider the degrees of freedom used (df; so it is not the SE alone that is important). When analyzing clustered data, cluster-level predictors will have fewer df (which should use G or the number of groups) compared to the level-1 predictors (which use the total sample size or n).
For example, if using the High School and Beyond dataset, the dataset has 7185 students nested in 160 schools:
library(mlmRev)
data(Hsb82)
NROW(Hsb82)
[1] 7185
unique(Hsb82$school) |> length() #number of schools
[1] 160
Analyzing level-2 predictors using $n - k - 1$ (with k predictors) when there are only really 160 schools can be misleading. For level-2 predictors, we can use $G - L - 1$ where $L$ is the number of level-2 predictors.
However, when cluster sizes vary (i.e., number of units in each cluster is not the same, it is not balanced), this can lower what is referred to as the the effective number of clusters. This has been written about decades ago in the survey sampling literature (Pothoff et al., 1992). At a basic level, the effective number of clusters (or G*) can be computed using:
$$ G^* = \frac{\left( \sum_{g=1}^{G} N_g \right)^2}{\sum_{g=1}^{G} N_g^2}. $$Still using the Hsb dataset, we get:
ng <- table(Hsb82$school) #number of students per schools
sum(ng)^2 / sum(ng^2) #ecs v1
[1] 149.6367
This is not a big difference though (160 vs ~150). This is also because the coefficient of variation (cv) is not very high. There is some variability (14 to 67 students per school), but not too high based on the cv which is just the standard deviation of cluster sizes divided by the mean of the cluster sizes:
sd(ng) / mean(ng) #coefficient of variation
[1] 0.2639919
The cv here is only ~0.26. If the sample size were completely balanced, the cv would be 0. As a frame of reference, in cluster randomized controlled trials, we may see a cv of 0.65 (Elridge et al., 2006). Values above 1 may be problematic.
Note however that the computation of G* here does not consider the predictors being used in the model (well, there is no model). Based on Carter et al. (2017) and Lee and Steigerwald (2018):
$$ G^* = \frac{G}{1 + \Gamma} $$where
$$ \Gamma = \frac{1}{G} \sum_{g=1}^{G} \left( \frac{\gamma_g - \bar{\gamma}}{\bar{\gamma}} \right)^2. $$Here, $\gamma_g = \mathbf{a}' (X'X)^{-1} (X'_g i_g i'_g X_g) (X'X)^{-1} \mathbf{a}$. The vector $a$ is a selection matrix and $i$ is a vector of 1s of Ng length (see Carter et al., 2017). In the manuscript, I indicated that it was a 1 $\times$ Ng matrix of 1s but it should be a Ng $\times$ 1 matrix (well, its really a vector).
A function to compute G*
Here’s some R code I put together to compute G* based on Carter et al. (2017):
effclust <- function(mx, cluster){
X <- model.matrix(mx) #design matrix
n <- nobs(mx) #how many observations
br <- solve(t(X) %*% X) #bread matrix
Xj <- split(data.frame(X), cluster) #split by cluster
js <- lapply(Xj, NROW) #number of clusters per group
Im <- Yg <- list() #container
G <- length(Xj) #how many clusters
k <- ncol(Xj[[1]]) #how many predictors
# building the i matrix (of 1s)
for (i in 1:length(js)){
tmp <- matrix(1, nrow = 1, ncol = js[[i]])
Im[[i]] <- t(tmp) %*% (tmp)
}
ecs <- numeric() #container for effective cluster sizes
initsel <- rep(0, k) #initialize selection matrix
for (xx in 1:k){
sel <- initsel
sel[xx] <- 1 #which variable to select
for (i in 1:G){
Yg[[i]] <- t(sel) %*% br %*% as.matrix(t(Xj[[i]])) %*%
Im[[i]] %*% as.matrix(Xj[[i]]) %*%
br %*% sel
}
ybar <- mean(unlist(Yg))
lam <- sum(sapply(Yg, \(x) (x - ybar) / ybar)^2) / G
ecs[xx] <- G / (1 + lam) #effective cluster size
}
names(ecs) <- colnames(X)
return(ecs)
}
We can compare this if we use another R package called effClust (that
package is more flexible as you can alter the rho). To compute G*, we
need a model:
m1 <- lm(mAch ~ sx + minrty + sector + meanses, data = Hsb82)
effclust(m1, Hsb82$school) #my function, this can fail in some instances
(Intercept) sxFemale minrtyYes sectorCatholic
81.90398 36.40511 44.95124 80.93753
meanses
51.56209
effClust::effClust(m1, ~school, rho = 1)
(Intercept) sxFemale minrtyYes sectorCatholic
81.90398 36.40511 44.95124 80.93753
meanses
51.56209
Results are the same.
Note though, in this case, the G* differs per predictor. The level-2 predictors (sector and meanses) have an effective sample size much lower than the original 160. In this case, this might not make a big difference but it will in other situations.
An example using G*
Let’s evaluate the type I error of CRSEs using the crct dataset in the
CR2 package. In this case, there are only 18 clusters
library(CR2)
data(crct)
ng <- table(crct$usid) #number of obs in school
ns <- length(ng) #number of clusters
crct <- dplyr::arrange(crct, usid) #just sorting the dataset
We’ll just use a small simulation. We will just create a random
cluster-level predictor and evaluate how often this is statistically
significant. As it is a random variable, this should just be 5% of the
time (which is our alpha level). I will just use the default CRSE of
sandwich which is the CR1 (or HC1).
library(sandwich)
library(clubSandwich)
library(lmtest)
reps <- 1000 #how many replications
pv1 <- pv2 <- pv3 <- pv4 <- tmp <- numeric()
NN <- nrow(crct)
set.seed(20260126) #for reproducability
for (i in 1:reps){
crct$wg <- rep(rnorm(ns), ng) #random level-2 predictor
m1 <- lm(odr_post ~ odr_pre + female + stype + trt +
size + race_Black + wg,
data = crct)
vc <- vcovCL(m1, crct$usid) #get the CRSE, HC1 by default
vc2 <- vcovCR(m1, crct$usid, type = 'CR2') #the CR2
df2 <- effclust(m1, crct$usid) #get G*
adjdf = df2[9] - 5 #less # of L2 predictors
adjdf = max(adjdf, 1) #to avoid df < 1
tmp[i] <- adjdf #just to inspect
res1 <- coeftest(m1, vc, df = adjdf) #there are 5 CL predictors
res2 <- coeftest(m1, vc, df = Inf) #a z test
res3 <- coeftest(m1, vc, df = ns - 1) #just using the Stata default
res4 <- coef_test(m1, vc2) #by defaults uses Satt df
## saving the p values
pv1[i] <- res1[9, 4] #G*
pv2[i] <- res2[9, 4] #z
pv3[i] <- res3[9, 4] #G-1
pv4[i] <- res4[9, 7] #cr2
# if (i %% 100 == 0) print(i)
}
Can evaluate the type I error rates:
data.frame(
gstar = mean(pv1 < .05),
ztest = mean(pv2 < .05),
stata = mean(pv3 < .05),
cr2 = mean(pv4 < .05)
)
gstar ztest stata cr2
1 0.025 0.172 0.15 0.05
Interestingly:
- The G* df results are closer to the expected .05 error rate (but are now overly conservative).
- The z-test (unsurprisingly), had the highest type I error rate at
around 18%. NOTE: this is the type I error rate you would get if you
are using SEM software (like
lavaanorMplus) which assume infinite degrees of freedom (so be careful). - Using G - 1 is a bit (but not much) better.
- The CR2 (together with Satterthwaite degrees of freedom)– as we have shown in other papers– does perform the best.
What’s interesting is that I am just using the usual conventional CR1 standard errors. The only difference is the df being used.
-END
Related
- Accounting for random slopes using cluster robust standard errors
- When Cluster-robust Inferences Fail
- Accounting for random slopes using cluster-robust standard errors in multilevel models
- Cluster-robust standard errors with three-level data
- Using cluster-robust standard errors when analyzing group-randomized trials with few clusters