## 5.6 Cluster Sampling

In practice, collecting sample data can require a tremendous amount of travel. Imagine a survey where the residents of the entire country are the target population, and a range of demographic and health indicators are measured about the units:

• age, height, weight, ethnicity, neighborhood, etc;

• blood pressure, blood cholesterol and mercury levels, body-mass index, etc.

Some of the information can be self-reported by the units (age, ethnicity, etc.), but in many cases (body-mass index, mercury levels, etc.), data collection requires the use of health experts and specialized equipment.

If all the sample units are from the Greater Toronto Area (GTA), say, it may be efficient to move the panel of experts (with all the required equipment in a trailer) from site to site, staying 2 weeks at each site. With about 20 sites in the GTA, data collection would take about a year to complete, but the cost of the survey would be greatly reduced: each night, the interviewers would go home; the cost of moving the equipment would also be minimized because of the small distances involved.

In a national study, where units could be drawn from several jurisdictions and remote locations, this approach is no longer necessarily recommended as it is potentially very expensive. Instead, one could start by taking a first sample of geographic areas (cities, regional municipalities, etc.), and then select a sub-sample of units (residents) in each of these areas.

Such a strategy is called multi-stage sampling (M$$n$$S, see Section 5.7.3). Stratiied sampling, as an example, is a M2S for which the first level sample is a census and the second level sample is a SRS.

As another example, when the first level sample comes from a SRS and the second level sample is a census (all units are selected), we speak of cluster sampling (ClS).

### 5.6.1 Estimators and Confidence Intervals

As it was the case in the second chapter, we are interested in a finite population $$\mathcal{U}=\{u_1,\ldots,u_N\}$$ of expectation $$\mu$$ and variance $$\sigma^2$$.

Suppose we can cover the population with $$M$$ disjoint clusters containing, respectively, $$N_1,\ldots,N_M$$ units, so that $$N_1+\cdots+N_M=N$$: \begin{aligned} \mathcal{G}_1&={\{u_{1,1},\ldots, u_{1,N_1}\}}, \ \cdots \ , \ \mathcal{G}_M={\{u_{M,1},\ldots, u_{M,N_M}\}}, \end{aligned} with cluster expectation, total, and variance given by $\mu_i={\frac{1}{N_i}\sum_{j=1}^{N_i}u_{i,j}},\quad\tau_i={N_i\mu_i},\quad\text{and}\quad \sigma_i^2={\frac{1}{N_i}\sum_{j=1}^{N_i}u_{i,j}^2-\mu_i^2},\quad 1\leq i\leq M.$

A cluster random sample (ClS) $$\mathcal{Y}$$ is a subset of the target population $$\mathcal{U}$$ which is obtained by first drawing a SRS of $$m>1$$ clusters, and then selecting all units in the selected clusters: $\mathcal{G}_{i_1}\cup\cdots\cup \mathcal{G}_{i_m}={\{\underbrace{\strut y_{i_1,1},\ldots, y_{i_1,N_{i_1}}}_{{\text{cluster \mathcal{G}_{i_1}}}},\ldots,\underbrace{\strut y_{i_m,1},\ldots, y_{i_m,N_{i_m}}}_{{\text{cluster \mathcal{G}_{i_m}}}}\}}\subseteq \bigcup_{\ell=1}^M\mathcal{G}_{\ell}=\mathcal{U}.$

When $$\mathcal{G}_{i_k}$$ belongs to the ClS $$\mathcal{Y}$$, we denote its mean, total, and variance by $$\overline{y}_{i_k}$$, $$y_{i_k}$$, and $$s_{i_k}^2$$, respectively, for $$1\leq k\leq m$$.

In a ClS design, each observation has the same probability of being selected, but the sample size may change from one ClS to another (unless the clusters all have the same size in the first place).

#### Estimating the Mean $$\mu$$; Clusters of Equal Size

Let us assume that all clusters have the same size: $$N_1=\cdots = N_M=n \implies N=Mn$$. The cluster mean of the sample observations in $$\mathcal{Y}$$ is an estimator of $$\mu$$: $\overline{y}_G={\frac{1}{mn}\sum_{k=1}^{m}\sum_{j=1}^ny_{i_k,j}=\frac{1}{mn}\sum_{k=1}^{m}y_{i_k}=\frac{1}{m}\sum_{k=1}^m\overline{y}_{i_k}=\frac{1}{m}\sum_{k=1}^m\mu_{i_k}}.$

Therefore, the cluster average is simply the average of the selected cluster averages. This is not surprising since $\mu={\frac{1}{N}\sum_{\ell=1}^M\sum_{j=1}^nu_{\ell,j}=\frac{1}{Mn}\sum_{\ell=1}^M\sum_{j=1}^nu_{\ell,j}=\frac{1}{Mn}\sum_{\ell=1}^M\tau_{\ell}=\frac{1}{M}\sum_{\ell=1}^M\mu_{\ell}}.$

We can easily show that $$\overline{y}_G$$ is an unbiased estimator of $$\mu$$: $\text{E}(\overline{y}_G)={\frac{1}{m}\sum_{k=1}^m\text{E}(\mu_{i_k})=\frac{1}{m}\sum_{k=1}^m\mu=\mu}.$ Furthemore, its sampling variance is $\text{V}(\overline{y}_G)={\frac{\sigma_G^2}{m}\Big(\frac{M-m}{M-1}\Big)},\quad \text{where }\sigma_G^2={\frac{1}{M}\sum_{\ell=1}^M(\mu_{\ell}-\mu)^2},$ since clusters are drawn using an SRS. Indeed, $$\overline{y}_G$$ is the mean of a SRS with $$m$$: $\{\mu_{i_1},\ldots, \mu_{i_m}\}\subseteq \{\mu_1, \ldots,\mu_M\}.$

Central Limit Theorem – ClS: if $$m$$ and $$M-m$$ are sufficiently large, then $\overline{y}_{G}\sim_{\text{approx.}}{\mathcal{N}\left(\text{E}(\overline{y}_{G}),\text{V}(\overline{y}_{G})\right)=\mathcal{N}\left(\mu,\frac{\sigma_G^2}{m}\Big(\frac{M-m}{M-1}\Big)\right)}.$ In a ClS, the bound on the error of estimation is thus $B_{\mu;G}=2\sqrt{\text{V}(\overline{y}_{G})}={2\sqrt{ \frac{\sigma_G^2}{m}\Big(\frac{M-m}{M-1}\Big)}},$ and the corresponding 95% C.I. for $$\mu$$ is simply $\text{C.I.}_{G}(\mu;0.95):\quad {\overline{y}_{G}\pm B_{\mu;G}}.$

In practice, the variance of the cluster means $$\sigma^2_G$$ is rarely known – the empirical variance (and the corresponding correction factor) is used instead: $\hat{\text{V}}(\overline{y}_G)={\frac{s_G^2}{m}\Big(1-\frac{m}{M}\Big)}, \text{ where }s^2_G={\frac{1}{m-1}\sum_{k=1}^m(\overline{y}_{i_k}-\overline{y}_G)^2}.$

The bound on the error of estimation is then approximated by \begin{aligned} B_{\mu;G}\approx \hat{B}_{\mu;G}&=2\sqrt{\hat{\text{V}}(\overline{y}_{G})}={2\sqrt{\frac{s_G^2}{m}\Big(1-\frac{m}{M}\Big)}},\\ \implies &\text{C.I.}_{G}(\mu;0.95): \quad {\overline{y}_{G}\pm \hat{B}_{\mu;G}\equiv \overline{y}_{G}\pm 2\sqrt{\frac{s_G^2}{m}\Big(1-\frac{m}{M}\Big)}}.\end{aligned}

Example: consider a finite population $$\mathcal{U}$$ of size $$N=37,444$$, divided into $$M=44$$ clusters $$\mathcal{G}_{\ell}$$, each of size $$n=851$$. We draw a SRS of $$m=6$$ clusters. The means of these clusters are: $\overline{y}_{1}=120.7, \overline{y}_{2}=75.2, \overline{y}_{3}=116.3, \overline{y}_{4}=111. 1, \overline{y}_{5}=116.9, \overline{y}_{6}=96.6.$ Find a 95% C.I. for the mean $$\mu$$.

Solution: the bound on the error of estimation for $$\mu$$ is $$\approx\hat{B}_{\mu;G}=2\sqrt{\hat{\text{V}}(\overline{y}_{G})}$$; we see that \begin{aligned} \textstyle{\overline{y}_G=\frac{1}{6}\sum_{k=1}^6\overline{y}_k\approx 106.1,\ s^2_G=\frac{1}{6-1}\sum_{k=1}^6(\overline{y}_k-\overline{y}_G)^2=\frac{69089.6-6(106.1)^2}{6-1}\approx 300.8},\end{aligned} from which we have $\text{C.I.}_{G}(\mu;0.95)\approx \textstyle{106.1 \pm 2\sqrt{\frac{300.8}{6}\big(1-\frac{6}{44}\big)}}\equiv (93.0,119.3).\quad \blacksquare$

#### Estimating the Mean $$\mu$$; Clusters of Different Sizes

In practice, the clusters are often all of different sizes. In this case, we can rewrite $\mu={\frac{\displaystyle{\sum_{\ell=1}^M\sum_{j=1}^{N_{\ell}}u_{\ell,j}}}{\displaystyle{\sum_{\ell=1}^MN_{\ell}}}=\frac{\displaystyle{\sum_{\ell=1}^M\tau_{\ell}}}{\displaystyle{\sum_{\ell=1}^MN_{\ell}}}},$ where $$\tau_{\ell}$$ is the sum of the $$u_{\ell,j}$$ for the units in the cluster $$\mathcal{G}_{\ell}$$, $$1\leq \ell \leq M$$.

If we still draw $$m$$ clusters from the population of $$M$$ clusters using an SRS, the form of $$\mu$$ suggests the use of the following estimator: $\overline{y}_G={\frac{\displaystyle{\sum_{k=1}^m\sum_{j=1}^{N_{i_k}}y_{i_k,j}}}{\displaystyle{\sum_{k=1}^mN_{i_k}}}=\frac{\displaystyle{\sum_{k=1}^my_{i_k}}}{\displaystyle{\sum_{k=1}^mN_{i_k}}}},$ where we are using the notation of Section 5.5.

If the average cluster size is denoted by $$\overline{N}=\frac{N}{M}$$, this is reminiscent of the situation that leads to ratio estimation of the mean. By performing the mapping $$(\overline{y}_G,\mu,\overline{N},\tau_{\ell},N_{\ell})\leftrightsquigarrow (r,R,\mu_X,Y_j,X_j)$$, we can therefore conclude that $$\overline{y}_G$$ is a biased estimator of $$\mu$$, whose sampling variance is approximately $\text{V}(\overline{y}_G)\approx {\frac{1}{\overline{N}^2}\cdot \frac{1}{m}\Big(\frac{M-m}{M-1}\Big)\cdot \frac{1}{M}\sum_{\ell=1}^M(\underbrace{\strut \tau_{\ell}-\mu N_{\ell}}_{=N_{\ell}(\mu_{\ell}-\mu)})^2}.$

Consequently, the bound on the error of estimation is given by $B_{\mu;G}=2\sqrt{\text{V}(\overline{y}_G)}\approx {2\sqrt{\frac{1}{\overline{N}^2}\cdot \frac{1}{m}\Big(\frac{M-m}{M-1}\Big)\cdot \frac{1}{M}\sum_{\ell=1}^M(\tau_{\ell}-\mu N_{\ell})^2}}$ (if $$N_{1}=\cdots=N_M=n$$, the formulas collapse to those seen in the preceding section).

In practice, we often only have access to the sampled clusters – we must then use the empirical variance: \begin{aligned} \hat{\text{V}}(\overline{y}_G) &\approx {\frac{1}{\overline{N}^2}\cdot \frac{1}{m}\Big(1-\frac{m}{M}\Big)\cdot \frac{1}{m-1}\sum_{k=1}^m(y_{i_k}-\overline{y}_GN_{i_k})^2} \\ &={\frac{1}{\overline{N}^2}\cdot \frac{1}{m}\Big(1-\frac{m}{M}\Big)(s_Y^2+s_N^2\overline{y}_G^2-2\overline{y}_G\hat{\rho}s_{N}s_Y),} \quad \text{where} \\ s_Y^2&=\frac{1}{m-1}\sum_{k=1}^m(y_{i_k}-\overline{\overline{y}})^2, \quad s_N^2={\frac{1}{m-1}\sum_{k=1}^m(N_{i_k}-\overline{N})^2}, \\ \hat{\rho}&=\frac{\sum_{k=1}^m (y_{i_k}-\overline{\overline{y}})(N_{i_k}-\overline{N})}{\sqrt{{\sum_{k=1}^m(y_{i_k}-\overline{\overline{y}})^2\sum_{k=1}^m(N_{i_k}-\overline{N})^2}}} \quad,\quad \overline{\overline{y}}=\frac{1}{m}\sum_{k=1}^my_{i_k}. \end{aligned}

Since it is not always possible to determine the average $$\overline{N}$$ of the clusters in the population $$\mathcal{U}$$, we often use $$\overline{n}$$, the average cluster size in the sample $$\mathcal{Y}$$ instead: $\overline{n}={\frac{N_{i_1}+\cdots+N_m}{m}}.$

The bound on the error of estimation is thus $\hat{B}_{\mu;G}\approx {2\sqrt{\frac{1}{\overline{n}^2}\cdot \frac{1}{m}\Big(1-\frac{m}{M}\Big)(s_Y^2+s_N^2\overline{y}_G^2-2\overline{y}_G\hat{\rho}s_{N}s_Y)}}$ and the approximate 95% C.I. for $$\mu$$ is $\text{C.I.}_{G}(\mu;0.95):\quad {\overline{y}_{G}\pm \hat{B}_{\mu;G}}.$

Example: consider a finite population $$\mathcal{U}$$ of size $$N=37,444$$, divided into $$M=44$$ clusters $$\mathcal{G}_{\ell}$$. We draw a SRS of $$m=6$$ clusters. The means of the observations in these clusters are:

$$k$$ $$1$$ $$2$$ $$3$$ $$4$$ $$5$$ $$6$$
$$\overline{y}_k$$ $$120.7$$ $$75.2$$ $$116.3$$ $$111.1$$ $$116.9$$ $$96.6$$
$${N}_k$$ $$850$$ $$176$$ $$1011$$ $$1001$$ $$843$$ $$910$$

Find a 95% C.I. for the mean $$\mu$$.

Solution: The bound on the error of estimation is $$\approx\hat{B}_{\mu;G}=2\sqrt{\hat{\text{V}}(\overline{y}_{G})}$$; we see that \begin{aligned} \overline{y}_G&=\frac{{\sum_{k=1}^6 N_k\overline{y}_k}}{{\sum_{k=1}^6N_k}}=\frac{531073.3}{4791}\approx 110.8,\quad \overline{n}=\frac{1}{6}\sum_{k=1}^6N_k=\frac{4791}{6}=798.5 \\ \overline{\overline{y}}&=\frac{\sum_{k=1}^6N_k\overline{y}_k}{6}=\frac{531073.3}{6}=88,512.2,\\ s_N^2&=\frac{1}{6-1}\sum_{k=1}^6(N_k-\overline{n})^2=98,146.7\\ s_Y^2&=\frac{1}{6-1}\sum_{k=1}^6(N_k\overline{y}_k-\overline{\overline{y}})^2=1,465,229,403.4,\\ \hat{\rho}&=\frac{\sum_{k=1}^6(N_k-\overline{n})(N_k\overline{y}_k-\overline{\overline{y}})}{\sqrt{\sum_{k=1}^6(N_k-\overline{n})^2\sum_{k=1}^6(N_k\overline{y}_k-\overline{\overline{y}})^2}}\approx 0.9796 \\ s_Y^2&+s_N^2\overline{y}_G^2-2\overline{y}_G\hat{\rho}s_{N}s_Y=66,814,598.95\end{aligned} from which we conclude that \begin{aligned} \hat{\text{V}}(\overline{y}_G)&=\frac{1}{798.5^2}\cdot \frac{1}{6}\Big(1-\frac{6}{44}\Big)(66,814,598.95)\approx 15.1\end{aligned} and $$\text{C.I.}_{G}(\mu;0.95)\approx 110.8 \pm 2\sqrt{15.1}\equiv (103.1,118.6)$$ $$\blacksquare$$.

Example: Find a 95% C.I. for the average life expectancy by country in 2011 (including India and China), using a ClS of size $$m=8$$, assuming that the $$N=185$$ countries have been grouped into $$M=22$$ clusters determined by geographic regions.

Solution: We re-use the code from the previous sections, with some modifications, in particular with respect to the clusters (region). The cluster sizes in the population are as follows:

gapminder.ClS <- gapminder |>  filter(year==2011) |>
select(life_expectancy, region)
summary(gapminder.ClS,22)
 life_expectancy                       region
Min.   :46.70   Australia and New Zealand: 2
1st Qu.:65.30   Caribbean                :13
Median :73.70   Central America          : 8
Mean   :71.18   Central Asia             : 5
3rd Qu.:77.40   Eastern Africa           :16
Max.   :83.02   Eastern Asia             : 6
Eastern Europe           :10
Melanesia                : 5
Micronesia               : 2
Middle Africa            : 8
Northern Africa          : 6
Northern America         : 3
Northern Europe          :10
Polynesia                : 3
South America            :12
South-Eastern Asia       :10
Southern Africa          : 5
Southern Asia            : 8
Southern Europe          :12
Western Africa           :16
Western Asia             :18
Western Europe           : 7  

We note in passing that the average life expectancy is $$\mu=71.18$$. We can explore the distribution of life expectancy by cluster using the following code:

ggplot(data=gapminder.ClS, aes(x=life_expectancy, y=region, fill=region)) +
geom_point(col="black", alpha=.2,pch=22) +
theme(legend.title = element_blank(), legend.position="none")

We notice a significant variability between some clusters (Southern Africa vs. Southern Europe, for example), but there is still a lot of overlap (which is a good sign). Next, we draw a SRS of $$m=8$$ clusters:

regions = unique(gapminder.ClS[,"region"])
set.seed(12345) # replicability
M=length(regions)
m=8
(sample.reg = sample(1:M,m, replace=FALSE))
[1] 14 19 16 11  2 21  6  7

Then, we provide a summary of the observations in the sampled clusters:

sample.ind = gapminder.ClS$region %in% regions[sample.reg] gapminder.ClS.n = gapminder.ClS[sample.ind,] gapminder.ClS.n$region <- as.factor(gapminder.ClS.n$region) (summ = gapminder.ClS.n |> group_by(region) |> summarise(N=n(), y.barre=mean(life_expectancy), total.y=sum(life_expectancy))) # A tibble: 8 × 4 region N y.barre total.y <fct> <int> <dbl> <dbl> 1 Australia and New Zealand 2 81.5 163 2 Central America 8 75.0 600. 3 Central Asia 5 69.4 347. 4 Melanesia 5 65.5 328. 5 Northern Africa 6 70.7 424. 6 Northern America 3 77.2 232. 7 South-Eastern Asia 10 72.6 726. 8 Western Asia 18 75.8 1364. We also produce a summary of this summary: (summ.final = summ |> summarise(sum.N = sum(N), moy.N = mean(N), y.barre.barre = mean(total.y), sum.y.barre = sum(total.y))) # A tibble: 1 × 4 sum.N moy.N y.barre.barre sum.y.barre <int> <dbl> <dbl> <dbl> 1 57 7.12 523. 4184. We can now calculate the cluster estimator: (est.y.barre.G=summ.final$sum.y.barre/summ.final$sum.N) [1] 73.40316 Next, its sampling variance: s2.Y = var(summ$total.y)
s2.N = var(summ$N) rho = cor(summ$N,summ$total.y) V.est.y.G = 1/summ.final$moy.N^2*1/m*(1-m/M)*
(s2.Y+s2.N*est.y.barre.G^2-2*est.y.barre.G*rho*sqrt(s2.N*s2.Y))

The bound on the error of estimation and the 95% C.I. for $$\mu$$ are:

B = 2*sqrt(V.est.y.G)
c(est.y.barre.G - B,est.y.barre.G + B) 
[1] 71.35310 75.45321

The performance of ClS is generally worse than that of SRS and/or StS – no surprise, given the discussion at the beginning of this section. The nature of the clusters may also play a role (in contrast to StS, ClS is more efficient when the cluster structure is similar from one cluster to another), which is not really the case here. We will discuss this further.

#### Estimating the Total $$\tau$$

Most of the work has already been done: since the total $$\tau$$ can be rewritten as $\tau=\sum_{j=1}^Nu_j=N\mu,$ we can estimate the total with a ClS using the formula $\hat{\tau}_{G}={N\overline{y}_{G}}$. There are two possibilities: either $$N_1=cdots=N_M=n$$, or the clusters are not all the same size.

If $$N_1=\cdots=N_M=n$$, we have an unbiased estimator of $$\tau$$: \begin{aligned} \text{E}(\hat{\tau}_{G})&={\text{E}(N\overline{y}_{G})=N\cdot \text{E}(\overline{y}_{G})=N\mu=\tau}, \\ \text{V}(\hat{\tau}_G)&={N^2\cdot\text{V}(\overline{y}_G)=N^2\cdot\frac{\sigma_G^2}{m}\Big(\frac{M-m}{M-1}\Big)\approx N^2\cdot \hat{\text{V}}(\overline{y}_G)=N^2\cdot \frac{s_G^2}{m}\Big(1-\frac{m}{M}\Big)}.\end{aligned}

If the clusters are of different sizes, we have a biased estimator of $$\tau$$, with sampling variance given by \begin{aligned} \text{V}(\hat{\tau}_{G})&={\text{V}(N\overline{y}_{G})=N^2\cdot \text{V}(\overline{y}_{G})\approx N^2\cdot\hat{\text{V}}(\overline{y}_G)} \\ &={\frac{N^2}{\overline{N}^2}\cdot \frac{1}{m}\Big(1-\frac{m}{M}\Big)(s_Y^2+s_N^2\overline{y}_G^2-2\overline{y}_G\hat{\rho}s_{N}s_Y)} \\ &={M^2\cdot \frac{1}{m}\Big(1-\frac{m}{M}\Big)(s_Y^2+s_N^2\overline{y}_G^2-2\overline{y}_G\hat{\rho}s_{N}s_Y)}.\end{aligned}

The estimator follows an approximate normal distribution $\hat{\tau}_G\sim_{\text{approx}}{\mathcal{N}\left(\text{E}(\hat{\tau}_G),\hat{\text{V}}(\hat{\tau}_G)\right)},$ as long as the quantities $$m$$, and $$M-m$$ are both “large enough”.

In both cases, the bound on the error of estimation is $B_{\tau;G}\approx \hat{B}_{\tau;G}={2\sqrt{\hat{\text{V}}(\hat{\tau}_G)}}$ and the 95% C.I. for $$\tau$$ takes the usual form: $\text{C.I.}_G(\tau;0.95):\quad {\hat{\tau}_G\pm \hat{B}_{\tau;G}}.$

Example: consider a finite population $$\mathcal{U}$$ of size $$N=37,444$$, divided into $$M=44$$ clusters $$\mathcal{G}_{\ell}$$, each of size $$n=851$$. We draw a SRS of $$m=6$$ clusters. The means of the observations in these clusters are: $\overline{y}_{1}=120.7, \ \overline{y}_{2}=75.2, \ \overline{y}_{3}=116.3, \ \overline{y}_{4}=111.1, \ \overline{y}_{5}=116.9, \ \overline{y}_{6}=96.6.$ Find a 95% C.I. for the total $$\tau$$ in $$\mathcal{U}$$.

Solution: in a previous example, we already have shown that $$\text{C.I.}_{G}(\mu;0.95)\equiv (93.0,119.3)$$ for this ClS, with $$N_1=\cdots=N_6=851$$. Therefore, $\text{C.I.}_{G}(\tau;0.95) \approx 37444(93.0,119.3)\equiv (3481307.7,4466805.3).\quad \blacksquare$

Example: consider a finite population $$\mathcal{U}$$ of size $$N=37,444$$, divided into $$M=44$$ clusters $$\mathcal{G}_{\ell}$$. We draw a SRS of $$m=6$$ clusters. The mean of the observations in these clusters are:

$$k$$ $$1$$ $$2$$ $$3$$ $$4$$ $$5$$ $$6$$
$$\overline{y}_k$$ $$120.7$$ $$75.2$$ $$116.3$$ $$111.1$$ $$116.9$$ $$96.6$$
$${N}_k$$ $$850$$ $$176$$ $$1011$$ $$1001$$ $$843$$ $$910$$

Find a 95% C.I. for the total $$\tau$$ in $$\mathcal{U}$$.

Solution: we have already seen in a previous example that $$\text{C.I.}_{G}(\mu;0.95)\equiv (103.1,118.6)$$ for this ClS with different cluster sizes. Therefore, $\text{C.I.}_{G}(\tau;0.95) \approx 37444(103.1,118.6)\equiv (3860476, 4440858).\quad \blacksquare$

Warning! How do we do this if the size $$N$$ of the population is unknown? Note that $\tau={\sum_{\ell=1}^M\tau_{\ell}=M\cdot \frac{1}{M}\sum_{\ell=1}^M\tau_{\ell}=M\overline{\tau}},$ where $$\overline{\tau}$$ is mean of the cluster totals in the population.

We could then use the estimator $M\overline{y}_T={M\cdot\frac{1}{m}\sum_{k=1}^my_{i_k}},$ where $$\overline{y}_T$$ is the Mean of the $$m$$ cluster totals in the ClS. In that case, we are dealing with a SRS of size $$m$$, drawn from $$M$$ cluster totals, i.e., this is an unbiased estimator: \begin{aligned} \text{E}(M\overline{y}_T)&={\tau} \\ \text{V}(M\overline{y}_T)&={M^2\cdot \text{V}(\overline{y}_T)=M^2\cdot\frac{\sigma_T^2}{m}\Big(\frac{M-m}{M-1}\Big)} \\ \hat{\text{V}}(M\overline{y}_T)& \approx {M^2\cdot \hat{\text{V}}(\overline{y}_T)=M^2\cdot \frac{s_T^2}{m}\Big(1-\frac{m}{M}\Big)}, \end{aligned} where $\sigma_T^2={\frac{1}{M}\sum_{\ell=1}^M(\tau_{\ell}-\overline{\tau})^2}\quad\text{and}\quad s_T^2={\frac{1}{m-1}\sum_{k=1}^m(y_{i_k}-\overline{y}_T)^2}.$

The estimator follows an approximate normal distribution $M\overline{y}_T\sim_{\text{approx}}{\mathcal{N}\left(\tau,\hat{\text{V}}(M\overline{y}_T)\right)},$ as long as the quantities $$m$$, and $$M-m$$ are both “large enough”.

The bound on the error of estimation is then $B_{\tau;T}\approx \hat{B}_{\tau;T}={2\sqrt{\hat{\text{V}}(M\overline{y}_{T})}}$ and the 95% C.I. for $$\tau$$ takes the usual form: $\text{C.I.}_T(\tau;0.95):\quad {M\overline{y}_{T}\pm \hat{B}_{\tau;T}}.$

Example: consider a finite population $$\mathcal{U}$$ of unknown size, divided into $$M=44$$ clusters $$\mathcal{G}_{\ell}$$. We draw a SRS of $$m=6$$ clusters. The mean of the observations in these clusters are:

$$k$$ $$1$$ $$2$$ $$3$$ $$4$$ $$5$$ $$6$$
$$\overline{y}_k$$ $$120.7$$ $$75.2$$ $$116.3$$ $$111.1$$ $$116.9$$ $$96.6$$
$${N}_k$$ $$850$$ $$176$$ $$1011$$ $$1001$$ $$843$$ $$910$$

Find a 95% C.I. for the total $$\tau$$ in $$\mathcal{U}$$.

Solution: Since the population size $$N$$ is unknown, the bound on the error of estimation for $$\tau$$ is $$\approx\hat{B}_{\tau;T}=2\sqrt{\hat{\text{V}}(M\overline{y}_{T})}$$; we see that \begin{aligned} \overline{y}_T=\frac{1}{6}\sum_{k=1}^6 N_k\overline{y}_k=\frac{531073.3}{6}\approx 88512.2,\quad M\overline{y}_T=44(88512.2)=3894537.5\end{aligned} and \begin{aligned} s_T^2&=\frac{1}{6-1}\sum_{k=1}^6(N_k\overline{y}_k-\overline{y}_T)^2=\frac{1}{5}\left(\sum_{k=1}^6N_k^2\overline{y}_k^2-6\overline{y}_T^2\right)=1465229403, \end{aligned} from which we conclude that \begin{aligned} \hat{\text{V}}(M\overline{y}_T)&=(44)^2\cdot \frac{1465229403}{6}\Big(1-\frac{6}{44}\Big) =408310593755.73\end{aligned} and $$\text{C.I.}_{T}(\tau;0.95)\approx 3894537.5 \pm 2\sqrt{408310593755.73}\equiv (2616554,5172521)$$.

The estimator is unbiased, but the confidence interval for $$\tau$$ is much wider than that given by $$\text{C.I.}_{G}(\tau;0.95)\equiv (3860476, 4440858)$$; this is not surprising since we have more information in the latter case (namely, the size $$N$$ of the population). $$\blacksquare$$

Example: Find a 95% C.I. for the world population in 2011 (excluding China and India), using a ClS of size $$m=8$$, drawn from $$M=22$$ clusters determined by geographic regions.

Solution: we re-use the code from the previous sections to create the clusters. The true population total is found below:

gapminder.ClS.pop <- gapminder |> filter(year==2011) |>
select(population, region) |> filter(population < 500000000)
(sum(gapminder.ClS.pop$population)) [1] 4264258312 We start by studying the distribution of population by region: ggplot(data=gapminder.ClS.pop, aes(x=population, y=region, fill=population)) + geom_point(col="black", alpha=.2,pch=22) + theme(legend.title = element_blank(), legend.position="none") The essential statistics are calculated as follows: summ.pop = gapminder.ClS.pop |> group_by(region) |> summarise(N=n(), y.pop=mean(population), tau.pop=sum(population)) Next we draw a SRS of clusters: set.seed(22) # replicability regions = unique(gapminder.ClS.pop[,"region"]) M=length(regions) N=nrow(gapminder.ClS.pop) m=8 (sample.reg = sample(1:M,m, replace=FALSE)) [1] 6 9 10 12 17 5 11 3 The sample is summarized as follows: sample.ind = gapminder.ClS.pop$region %in% regions[sample.reg]
gapminder.ClS.T = gapminder.ClS.pop[sample.ind,]
gapminder.ClS.T$region <- as.factor(gapminder.ClS.T$region)
(summ.T = gapminder.ClS.T |> group_by(region) |>
summarise(N=n(), tau=sum(population)))
# A tibble: 8 × 3
region              N       tau
<fct>           <int>     <int>
1 Central America     8 163510619
2 Eastern Europe     10 294249971
3 Middle Africa       8 134483803
4 Northern Europe    10  99989705
5 South America      12 401182686
6 Southern Asia       7 450825356
7 Western Africa     16 316604189
8 Western Asia       18 237909741

If we assume the number of units in the population to be known ($$N=183$$), the estimator of the average population per country is:

(y.G = sum(summ.T$tau)/sum(summ.T$N))
[1] 23581529

The estimator for the total population (excluding China and India) is thus:

(tau.G = N*y.G)
[1] 4315419784

The bound on the error of estimation and the 95% C.I. for $$\tau$$ are computed below:

s2.G =  1/(m-1)*sum((summ.T$tau-y.G*summ.T$N)^2)
V = M^2*s2.G/m*(1-m/M)
B = 2*sqrt(V)
c(tau.G-B,tau.G+B)
[1] 2441918142 6188921427

If we assume instead that the number of units is unknown, the estimator of the population per cluster is:

(y.T = sum(summ.T$tau)/m) [1] 262344509 The estimator for the total population (excluding China and India) would then be: (tau.T = M*y.T) [1] 5771579192 The bound on the error of estimation and the 95% C.I. for $$\tau$$ in that case are computed below: s2.T = 1/(m-1)*sum((summ.T$tau-y.T)^2)
V = M^2*s2.T/m*(1-m/M)
B = 2*sqrt(V)
c(tau.G-B,tau.G+B)
[1] 2746857662 5883981906

The actual value $$\tau=4,264,258,312$$ is found within the 95% C.I. in both cases (but different SRS of clusters might lead to different outcomes).

#### Estimating a Proportion $$p$$

In a population where $$A_{\ell,j} \in \{0,1\}$$ represents the absence or presence of a characteristic for the $$j$$th unit in the $$\ell$$th cluster, the mean $p={\frac{1}{N}\sum_{\ell=1}^M\sum_{j=1}^{N_{\ell}}A_{\ell,j}=\frac{\displaystyle{\sum_{\ell=1}^MA_{\ell}}}{\displaystyle{\sum_{\ell=1}^MN_{\ell}}}}$ is the proportion of the population units possessing the characteristic, where $$A_{\ell}$$ is the number of units with the characteristic in the $$\ell$$th cluster.

If we are still drawing $$m$$ clusters using a SRS from the $$M$$ clusters in the population, the form taken by $$p$$ suggests the use of the following estimator: $\hat{p}_G={\frac{\displaystyle{\sum_{k=1}^m\sum_{j=1}^{N_{i_k}}a_{i_k,j}}}{\displaystyle{\sum_{k=1}^mN_{i_k}}}=\frac{\displaystyle{\sum_{k=1}^ma_{i_k}}}{\displaystyle{\sum_{k=1}^mN_{i_k}}}},$ where $$a_{i_k}$$ is the number of units with the characteristic in the $$\k$$th sampled cluster.

Set $$\overline{N}=\frac{N}{M}$$. If $$N$$ is unknown, we use $$\overline{N}\approx {\overline{n}=\frac{1}{m}(N_{i_1}+\cdots+N_{i_m})}$$. There are then two possibilities: either $$N_1=\cdots=N_M=n$$, or the clusters are not all of the same size. If $$N_1=\cdots=N_M=n$$, we have an unbiased estimator of $$p$$: \begin{aligned}\text{E}(\hat{p}_{G})&={p}, \quad \text{V}(\hat{p}_G)={\frac{1}{n^2}\cdot \frac{\sigma_P^2}{m}\Big(\frac{M-m}{M-1}\Big)\approx \frac{1}{n^2}\cdot \frac{s_P^2}{m}\Big(1-\frac{m}{M}\Big)}=\hat{\text{V}}(\hat{p}_G),\end{aligned} where $\sigma_P^2={\frac{1}{M}\sum_{\ell=1}^M(A_{\ell}-pN_{\ell})^2}\quad\text{and}\quad s_P^2={\frac{1}{m-1}\sum_{k=1}^m(a_{i_k}-\hat{p}_GN_{i_k})^2}.$

If the clusters are of different sizes, we have a biased estimator of $$p$$, whose sampling variance is: \begin{aligned} \text{V}(\hat{p}_{G})&\approx{\frac{1}{\overline{N}^2}\cdot \frac{\sigma_P^2}{m}\Big(\frac{M-m}{M-1}\Big)},\quad \hat{\text{V}}(\hat{p}_G)\approx {\frac{1}{\overline{n}^2}\cdot \frac{s_P^2}{m}\Big(1-\frac{m}{M}\Big).}\end{aligned}

The estimator follows an approximate normal distribution $\hat{p}_G\sim_{\text{approx}}{\mathcal{N}\left(\text{E}(\hat{p}_G),\hat{\text{V}}(\hat{p}_G)\right)},$ as long as the quantities $$m$$, and $$M-m$$ are both “large enough”.

In both cases, the bound on the error of estimation is $B_{p;G}\approx \hat{B}_{p;G}={2\sqrt{\hat{\text{V}}(\hat{p}_G)}}$ and the **95% C.I. for $$p$$* takes the usual form: $\text{C.I.}_G(p;0.95):\quad {\hat{p}_G\pm \hat{B}_{p;G}}.$

Example: find a 95% C.I. for the proportion of countries whose life expectancy is above 75 years in 2011, using a ClS with $$m=8$$, assuming that the countries are grouped into $$M=22$$ clusters determined by geographic regions.

Solution: we re-use the code of the previous sections to create the clusters, and we create a new indicator variable for the 75 years life expectancy threshold:

gapminder.ClS$life.75 <- ifelse(gapminder.ClS$life_expectancy>75,1,0)
gapminder.ClS.75 <- gapminder.ClS |> select(life.75,region)
(mean(gapminder.ClS.75$life.75)) # true proportion [1] 0.3945946 OWe begin by examining the proportions in each region: summ.75 = gapminder.ClS.75 |> group_by(region) |> summarise(N=n(), p.hat=mean(life.75)) ggplot(data=summ.75,aes(x=p.hat, y=region, size=N, fill=p.hat)) + geom_point(col="black", alpha=.2,pch=22) + theme(legend.title = element_blank(), legend.position="none") The proportion of countries with a life expectancy of more than 75 years is found to vary greatly from region to region – this may affect the quality of the estimate. First, we draw a SRS of $$m-8$$ clusters: regions = unique(gapminder.ClS[,"region"]) set.seed(0) # replicability M=length(regions) m=8 (sample.reg = sample(1:M,m, replace=FALSE)) [1] 14 4 7 1 2 11 22 18 Then, we provide a summary of the proportions by cluster: sample.ind = gapminder.ClS$region %in% regions[sample.reg]
gapminder.ClS.G = gapminder.ClS[sample.ind,]
gapminder.ClS.G$region <- as.factor(gapminder.ClS.G$region)
(summ.75.n = gapminder.ClS.G |>
group_by(region) |>
summarise(N=n(), p.hat=mean(life.75)))
# A tibble: 8 × 3
region                        N p.hat
<fct>                     <int> <dbl>
1 Australia and New Zealand     2 1
2 Caribbean                    13 0.385
3 Central America               8 0.5
4 Micronesia                    2 0
5 Northern Africa               6 0.333
6 Northern Europe              10 0.8
7 South-Eastern Asia           10 0.2
8 Southern Europe              12 1    

We have enough information to compute the ClS estimator of the proportion:

(p.G = sum(summ.75.n$N*summ.75.n$p.hat)/sum(summ.75.n$N)) [1] 0.5555556 Next, we compute the sampling variance, the margin of error, and the 95% C.I. for $$p$$ (assuming that the average cluster size is not known): taille.moyenne = sum(summ.75.n$N)/m
s2.p.G =  1/(m-1)*sum((summ.75.n$N*summ.75.n$p.hat-p.G*summ.75.nN)^2) V = 1/taille.moyenne^2*s2.p.G/m*(1-m/M) (B = 2*sqrt(V)) c(p.G-B,p.G+B) [1] 0.2025966 [1] 0.3529590 0.7581521 The actual value $$p=0.394$$ is indeed within the 95% confidence interval. We assumed that the average cluster size was unknown; is this also the case if we use the known value $$\overline{N}=\frac{185}{22}\approx 8.41$$? The observations of the Gapmider dataset are probably not that suitable for ClS … at least, not if one uses regions as clusters. ### 5.6.2 Sample Size Depending on whether the clusters are of equal size or not, the variance formulas take different forms; however, they coincide when $$N_i=n$$ for all $$i$$; it is only the nature of the estimator bias and the exactness of its sampling variance that are affected. Consequently, we will only study the situation where the clusters are assumed to be of different sizes. In what follows, we will use the notations $\sigma^2_E={\frac{1}{M}\sum_{\ell=1}^M(\tau_{\ell}-\mu N_{\ell})^2}\quad\text{and}\quad s_E^2={\frac{1}{m-1}\sum_{k=1}^m(y_{i_k}-\overline{y}_GN_{i_k})^2}.$ #### Moyenne $$\mu$$ If we want to estimate $$\mu$$ with $$\overline{y}_G$$, we use: \begin{aligned} B_{\mu;G}=2\sqrt{\frac{1}{\overline{N}^2}\cdot\frac{\sigma_E^2}{m}\Big(\frac{M-m}{M-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{\mu;G}\overline{N}^2}{4}}_{=D_{\mu}}=\frac{\sigma_E^2}{m}\Big(\frac{M-m}{M-1}\Big)} \\ & \Longleftrightarrow {\frac{(M-1)D_{\mu}}{\sigma_E^2}=\frac{M-m}{m}=\frac{M}{m}-1} \\ & \Longleftrightarrow {\frac{(M-1)D_{\mu}+\sigma_E^2}{\sigma_E^2}=\frac{M}{m}} \\ & \Longleftrightarrow m_{\mu;G}=\frac{M\sigma^2_E}{(M-1)D_{\mu}+\sigma_E^2}.\end{aligned} Obviously, we can only use this formula if we know the variance $$\sigma_E^2$$ of the cluster totals in the population $$\mathcal{U}$$. If that is not available, we can use the empirical variance $$s_E^2$$ from a preliminary sample, or that from a prior survey. If the average size $$\overline{N}$$ of the clusters of $$\mathcal{U}$$ is unknown, we replace $$\overline{N}$$ by the empirical average size $$\overline{n}=(N_{i_1}+\cdots+N_{i_m})/m$$ from the preliminary sample. Finally, note that this formula allows us to determine the number of clusters $$m$$ to be drawn from a SRS of clusters in order to obtain some margin of error on the estimate; the sample size may change from one realization to another, depending on the size of the sampled clusters. Example: consider a company that wants a cost inventory for theN=625 items in stock. In practice, it might be tedious to obtain a SRS of these items; however, the items are arranged on $$M=100$$ shelves and it is relatively easy to select a SRS of shelves, treating each shelf as a cluster of items. How many shelves would need to be sampled in order to estimate the average value of all items in inventory with a bound on the error of estimation of at most $$B_{\mu;G}=1.25\$$, assuming $$\sigma^2_E\approx 317.53\$$? Solution: set $$D_{\mu}=\frac{B^2_{\mu;G}\overline{N}^2}{4}=\frac{(1.25)^2(6.25)^2}{4}\approx 15.26$$; then ${m_{\mu;G}=\frac{M\sigma_E^2}{(M-1)D_{\mu}+\sigma_E^2}=\frac{100(317.53)}{(100-1)(15.26)+317.53}=17.4\approx 18.} \quad \blacksquare$ #### Total $$\tau$$ If we want to estimate $$\tau$$ with $$N\overline{y}_G$$, we use: \begin{aligned} B_{\tau;G}=2\sqrt{M^2\cdot \frac{\sigma_E^2}{m}\Big(\frac{M-m}{M-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{\tau;G}}{4M^2}}_{=D_{\tau;G}}=\frac{\sigma_E^2}{m}\Big(\frac{M-m}{M-1}\Big)} \\ & \Longleftrightarrow {\frac{(M-1)D_{\tau;G}}{\sigma_E^2}=\frac{M-m}{m}=\frac{M}{m}-1} \\ & \Longleftrightarrow {\frac{(M-1)D_{\tau;G}+\sigma_E^2}{\sigma_E^2}=\frac{M}{m}} \\ & \Longleftrightarrow m_{\tau;G}=\frac{M\sigma_E^2}{(M-1)D_{\tau;G}+\sigma_E^2}.\end{aligned} Example: consider a company that wants a cost inventory for theN=625 items in stock. In practice, it might be tedious to obtain a SRS of these items; however, the items are arranged on $$M=100$$ shelves and it is relatively easy to select a SRS of shelves, treating each shelf as a cluster of items. How many shelves would need to be sampled in order to estimate the total value of all items in inventory with a bound on the error of estimation of at most $$B_{\tau;G}=600\$$, assuming $$\sigma^2_E\approx 317.53\$$? Solution: set $$D_{\tau;G}=\frac{B^2_{\tau;G}}{4M^2}=\frac{(600)^2}{4(100)^2}=9$$; then ${m_{\tau;G}=\frac{M\sigma_E^2}{(M-1)D_{\tau;G}+\sigma_E^2}=\frac{100(317.53)}{(100-1)(9)+317.53}=26.3\approx 27.}\quad \blacksquare$ If we want to estimate $$\tau$$ with $$M\overline{y}_T$$, we use: \begin{aligned} B_{\tau;T}=2\sqrt{M^2\cdot \frac{\sigma_T^2}{m}\Big(\frac{M-m}{M-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{\tau;T}}{4M^2}}_{=D_{\tau}}=\frac{\sigma_T^2}{m}\Big(\frac{M-m}{M-1}\Big)} \\ & \Longleftrightarrow {\frac{(M-1)D_{\tau}}{\sigma_T^2}=\frac{M-m}{m}=\frac{M}{m}-1} \\ & \Longleftrightarrow {\frac{(M-1)D_{\tau}+\sigma_T^2}{\sigma_T^2}=\frac{M}{m}} \\ & \Longleftrightarrow m_{\tau;T}=\frac{M\sigma_T^2}{(M-1)D_{\tau}+\sigma_T^2}.\end{aligned} Example: consider a company that wants a cost inventory for theN=\$625 items in stock. In practice, it might be tedious to obtain a SRS of these items; however, the items are arranged on $$M=100$$ shelves and it is relatively easy to select a SRS of shelves, treating each shelf as a cluster of items. How many shelves would need to be sampled in order to estimate the total value of all items in inventory with a bound on the error of estimation of at most $$B_{\tau;T}=600\$$, assuming $$\sigma^2_T\approx 682.77\$$?

Solution: Set $$D_{\tau;T}=\frac{B^2_{\tau;T}}{4M^2}=\frac{(600)^2}{4(100)^2}=9$$; then ${m_{\tau;T}=\frac{M\sigma_T^2}{(M-1)D_{\tau;T}+\sigma_T^2}=\frac{100(682.77)}{(100-1)(9)+682.77}=43.4\approx 44.}\quad\blacksquare$

#### 5.6.2.1 Proportion $$p$$

If we want to estimate $$p$$ with $$\hat{p}_G$$, we use: \begin{aligned} B_{p;G}=2\sqrt{\frac{1}{\overline{N}^2}\cdot \frac{\sigma_p^2}{m}\Big(\frac{M-m}{M-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{p;G}\overline{N}^2}{4}}_{=D_{p;G}}=\frac{\sigma_p^2}{m}\Big(\frac{M-m}{M-1}\Big)} \\ & \Longleftrightarrow {\frac{(M-1)D_{p;G}}{\sigma^2_P}=\frac{M-m}{m}=\frac{M}{m}-1} \\ & \Longleftrightarrow {\frac{(M-1)D_{p;G}+\sigma^2_P}{\sigma^2_P}=\frac{M}{m}} \\ & \Longleftrightarrow m_{p;G}=\frac{M\sigma_P^2}{(M-1)D_{p;G}+\sigma^2_P}.\end{aligned}

### 5.6.3 Comparison Between SRS and CLS

Consider a ClS $$\mathcal{Y}$$ consisting of $$m$$ clusters drawn from a population $$\mathcal{U}$$ of size $$N$$, distributed in $$M$$ clusters. Let $$\mu$$ be the mean and $$\sigma^2$$ the variance of the population $$\mathcal{U}$$.

If the clusters are all of size $$n$$, we can show that $\text{V}(\overline{y}_G)\approx {\frac{\sigma^2-\overline{\sigma^2}}{m}\Big(1-\frac{m}{M}\Big)},\quad \text{where }\overline{\sigma^2}={\frac{1}{M}\sum_{\ell=1}^M\sigma_{\ell}^2},$ where $$\sigma_{\ell}^2$$ is the variance in the $$\ell$$th cluster.

But we can also consider $$\mathcal{Y}$$ as having arisen from a SRS with size $$mn$$. In that case, we have $\text{V}(\overline{y}_{\text{SRS}})=\frac{\sigma^2}{mn}\Big(\frac{N-mn}{N-1}\Big)\approx {\frac{\sigma^2}{mn}\Big(1-\frac{mn}{N}\Big)=\frac{\sigma^2}{mn}\Big(1-\frac{mn}{Mn}\Big)=\frac{\sigma^2}{mn}\Big(1-\frac{m}{M}\Big)},$ from which we conclude that \begin{aligned}\text{V}(\overline{y}_G)-\text{V}(\overline{y}_{\text{SRS}})&\approx {\frac{1}{m}\Big(1-\frac{m}{M}\Big)\Big(\sigma^2-\overline{\sigma^2}-\frac{\sigma^2}{n}\Big)=\frac{1}{m}\Big(1-\frac{m}{M}\Big)\Big(\frac{n-1}{n}\sigma^2-\overline{\sigma^2}\Big)} \\ &\approx {\frac{1}{m}\Big(1-\frac{m}{M}\Big)(\sigma^2-\overline{\sigma^2}),\quad \text{si }n-1\approx n}. \end{aligned}

Consequently, $$\text{V}(\overline{y}_G)\gg \text{V}(\overline{y}_{\text{SRS}})$$ if and only if $${\sigma^2\gg \overline{\sigma^2}}$$, which is the case when the mean of the cluster variances is smaller than the variance in the population (the moral of the story is that a ClS is effective if the clusters, regardless of their size, are as heterogeneous as the population itself).