The Effective Number of Clusters

The Effective Number of Clusters

Feb 2, 2026 · 7 min read

Although cluster-robust standard errors (CRSEs) are very commonly used when analyzing nested/clustered data, there are many instances where they will not work properly (e.g., the type I error will be too high). I wrote about that in Educational and Psychological Measurement, When cluster-robust inferences fail. These problems can be brought about by analyzing data that have only a few clusters– though this can be an issue even with 100 clusters (see article). Other issues are present when: a) cluster sizes differ to a large extent and b) the dataset has only a few ‘treated’ higher level groups.

Although using CRSEs is helpful, when computing p-values, it is important to consider the degrees of freedom used (df; so it is not the SE alone that is important). When analyzing clustered data, cluster-level predictors will have fewer df (which should use G or the number of groups) compared to the level-1 predictors (which use the total sample size or n).

For example, if using the High School and Beyond dataset, the dataset has 7185 students nested in 160 schools:

library(mlmRev)
data(Hsb82)
NROW(Hsb82)
[1] 7185
unique(Hsb82$school) |> length() #number of schools
[1] 160

Analyzing level-2 predictors using $n - k - 1$ (with k predictors) when there are only really 160 schools can be misleading. For level-2 predictors, we can use $G - L - 1$ where $L$ is the number of level-2 predictors.

However, when cluster sizes vary (i.e., number of units in each cluster is not the same, it is not balanced), this can lower what is referred to as the the effective number of clusters. This has been written about decades ago in the survey sampling literature (Pothoff et al., 1992). At a basic level, the effective number of clusters (or G*) can be computed using:

$$ G^* = \frac{\left( \sum_{g=1}^{G} N_g \right)^2}{\sum_{g=1}^{G} N_g^2}. $$

Still using the Hsb dataset, we get:

ng <- table(Hsb82$school) #number of students per schools
sum(ng)^2 / sum(ng^2) #ecs v1
[1] 149.6367

This is not a big difference though (160 vs ~150). This is also because the coefficient of variation (cv) is not very high. There is some variability (14 to 67 students per school), but not too high based on the cv which is just the standard deviation of cluster sizes divided by the mean of the cluster sizes:

sd(ng) / mean(ng) #coefficient of variation
[1] 0.2639919

The cv here is only ~0.26. If the sample size were completely balanced, the cv would be 0. As a frame of reference, in cluster randomized controlled trials, we may see a cv of 0.65 (Elridge et al., 2006). Values above 1 may be problematic.

Note however that the computation of G* here does not consider the predictors being used in the model (well, there is no model). Based on Carter et al. (2017) and Lee and Steigerwald (2018):

$$ G^* = \frac{G}{1 + \Gamma} $$

where

$$ \Gamma = \frac{1}{G} \sum_{g=1}^{G} \left( \frac{\gamma_g - \bar{\gamma}}{\bar{\gamma}} \right)^2. $$

Here, $\gamma_g = \mathbf{a}' (X'X)^{-1} (X'_g i_g i'_g X_g) (X'X)^{-1} \mathbf{a}$. The vector $a$ is a selection matrix and $i$ is a vector of 1s of Ng length (see Carter et al., 2017). In the manuscript, I indicated that it was a 1 $\times$ Ng matrix of 1s but it should be a Ng $\times$ 1 matrix (well, its really a vector).

A function to compute G*

Here’s some R code I put together to compute G* based on Carter et al. (2017):

effclust <- function(mx, cluster){
  
  X <- model.matrix(mx) #design matrix
  n <- nobs(mx) #how many observations
  br <- solve(t(X) %*% X) #bread matrix
  Xj <- split(data.frame(X), cluster) #split by cluster
  js <- lapply(Xj, NROW) #number of clusters per group
  Im <- Yg <- list() #container
  G <- length(Xj) #how many clusters
  k <- ncol(Xj[[1]]) #how many predictors
  
  # building the i matrix (of 1s)
  for (i in 1:length(js)){
    tmp <- matrix(1, nrow = 1, ncol = js[[i]])
    Im[[i]] <- t(tmp) %*% (tmp)
  }
  
  ecs <- numeric() #container for effective cluster sizes
  
  initsel <- rep(0, k) #initialize selection matrix
  
  for (xx in 1:k){
  
    sel <- initsel
    sel[xx] <- 1 #which variable to select
      for (i in 1:G){
        Yg[[i]] <- t(sel) %*% br %*% as.matrix(t(Xj[[i]])) %*% 
          Im[[i]] %*% as.matrix(Xj[[i]]) %*% 
          br %*% sel
      }
    
    ybar <- mean(unlist(Yg))
    lam <- sum(sapply(Yg, \(x) (x - ybar) / ybar)^2) / G
    ecs[xx] <- G / (1 + lam) #effective cluster size
  
  }
  names(ecs) <- colnames(X)
  return(ecs)
}

We can compare this if we use another R package called effClust (that package is more flexible as you can alter the rho). To compute G*, we need a model:

m1 <- lm(mAch ~ sx + minrty + sector + meanses, data = Hsb82)
effclust(m1, Hsb82$school) #my function, this can fail in some instances
   (Intercept)       sxFemale      minrtyYes sectorCatholic 
      81.90398       36.40511       44.95124       80.93753 
       meanses 
      51.56209 
effClust::effClust(m1, ~school, rho = 1)
   (Intercept)       sxFemale      minrtyYes sectorCatholic 
      81.90398       36.40511       44.95124       80.93753 
       meanses 
      51.56209 

Results are the same.

Note though, in this case, the G* differs per predictor. The level-2 predictors (sector and meanses) have an effective sample size much lower than the original 160. In this case, this might not make a big difference but it will in other situations.

An example using G*

Let’s evaluate the type I error of CRSEs using the crct dataset in the CR2 package. In this case, there are only 18 clusters

library(CR2)
data(crct)
ng <- table(crct$usid) #number of obs in school
ns <- length(ng) #number of clusters
crct <- dplyr::arrange(crct, usid) #just sorting the dataset

We’ll just use a small simulation. We will just create a random cluster-level predictor and evaluate how often this is statistically significant. As it is a random variable, this should just be 5% of the time (which is our alpha level). I will just use the default CRSE of sandwich which is the CR1 (or HC1).

library(sandwich)
library(clubSandwich)
library(lmtest)
reps <- 1000 #how many replications
pv1 <- pv2 <- pv3 <- pv4 <- tmp <- numeric()
NN <- nrow(crct)

set.seed(20260126) #for reproducability
for (i in 1:reps){
  
  crct$wg <- rep(rnorm(ns), ng) #random level-2 predictor
  
  m1 <- lm(odr_post ~ odr_pre + female + stype + trt +
             size + race_Black + wg,
           data = crct)
  
  vc <- vcovCL(m1, crct$usid) #get the CRSE, HC1 by default
  vc2 <- vcovCR(m1, crct$usid, type = 'CR2') #the CR2
  df2 <- effclust(m1, crct$usid) #get G*
  adjdf = df2[9] - 5 #less # of L2 predictors
  adjdf = max(adjdf, 1) #to avoid df < 1
  tmp[i] <- adjdf #just to inspect
  res1 <- coeftest(m1, vc, df = adjdf) #there are 5 CL predictors
  res2 <- coeftest(m1, vc, df = Inf) #a z test
  res3 <- coeftest(m1, vc, df = ns - 1) #just using the Stata default
  res4 <- coef_test(m1, vc2) #by defaults uses Satt df
  ## saving the p values
  pv1[i] <- res1[9, 4] #G*
  pv2[i] <- res2[9, 4] #z
  pv3[i] <- res3[9, 4] #G-1
  pv4[i] <- res4[9, 7] #cr2

# if (i %% 100 == 0) print(i)

}

Can evaluate the type I error rates:

data.frame(
  gstar = mean(pv1 < .05),
  ztest = mean(pv2 < .05),
  stata = mean(pv3 < .05),
  cr2 = mean(pv4 < .05)
)
  gstar ztest stata  cr2
1 0.025 0.172  0.15 0.05

Interestingly:

  • The G* df results are closer to the expected .05 error rate (but are now overly conservative).
  • The z-test (unsurprisingly), had the highest type I error rate at around 18%. NOTE: this is the type I error rate you would get if you are using SEM software (like lavaan or Mplus) which assume infinite degrees of freedom (so be careful).
  • Using G - 1 is a bit (but not much) better.
  • The CR2 (together with Satterthwaite degrees of freedom)– as we have shown in other papers– does perform the best.

What’s interesting is that I am just using the usual conventional CR1 standard errors. The only difference is the df being used.

-END