In class, I’ve talked about using weights in large scale assessments. I’ve provided a bit of intuition about using weights and why they are important. Here’s some R syntax to go along with the example I discussed.

Imagine there are two schools in one school district. You are asked what is the average score on some measure of students in the district. There are only two schools (school A and B) and their size varies (School A = 100, School B = 1,000). You don’t have enough resources to assess everyone so you sample a total of 50 students (25 from each school).

First, create data for all the kids. This shows us the true state (which we don’t observe).

```
set.seed(123)
sa <- rnorm(100, 55, 2) #school A
sb <- rnorm(1000, 45, 2) #school B
district <- c(sa, sb) #district scores
hist(district, main = 'District Scores', xlab = 'Scores', breaks = 20)
```

`(ov <- mean(c(sa, sb))) #this is the true average score`

`[1] 45.96022`

Now, imagine that we administer the assessment to 25 kids in each school:

```
sa.s <- sa[sample(100, 25)] #25 in school A
sb.s <- sb[sample(1000, 25)] #25 in school B
```

Here are the scores from each school. Close to the true means in each school:

`mean(sa.s)`

`[1] 55.60827`

`mean(sb.s)`

`[1] 45.49553`

If you merely take the average score of the 50 kids, this is not the mean of students in the school district:

`(ov.s <- mean(c(sa.s, sb.s))) #not representative of the pop of interest`

`[1] 50.5519`

But, we can use weights to make each response count appropriately. Remember weights are based on the inverse of the probability of selection. Let’s create a `data.frame`

with this information:

```
sa.df <- data.frame(score = sa.s, wt = 1/(25/100), school = 'a')
sb.df <- data.frame(score = sb.s, wt = 1/(25/1000), school = 'b')
comb.df <- rbind(sa.df, sb.df)
psych::headTail(comb.df)
```

```
score wt school
1 54.72 4 a
2 56.56 4 a
3 59.34 4 a
4 55.61 4 a
... ... ... <NA>
47 44.47 40 b
48 47.76 40 b
49 44.03 40 b
50 45.21 40 b
```

You can see that the weights differ based on the school attended. Now, if we use the weights, we can make an estimate of what the average score is for the district:

`weighted.mean(comb.df$score, comb.df$wt)`

`[1] 46.41487`

`sum(comb.df$wt) #summing the weights gives the population N`

`[1] 1100`

Manually doing this:

`((mean(sa.s) * 100) + (mean(sb.s) * 1000)) / 1100`

`[1] 46.41487`

If we use a regression, just add the `weights`

option. First, I show the results unweighted (this is just the mean of the 50 kids):

`lm(score ~ 1, data = comb.df) #unweighted, not correct`

```
Call:
lm(formula = score ~ 1, data = comb.df)
Coefficients:
(Intercept)
50.55
```

If the `weights`

option is used:

`lm(score ~ 1, data = comb.df, weights = wt) #weighted`

```
Call:
lm(formula = score ~ 1, data = comb.df, weights = wt)
Coefficients:
(Intercept)
46.41
```

This is much closer the the true score. Using weights allows us to generalize to the population of interest.