5.4 Stratified Random Sampling

The theory we developed in the previous section allows us to determinee the distribution of the three unbiased estimators \(\overline{y}\), \(\hat{tau}\), and \(p\).

For instance, we have shown that if the size \(N\) of a finite population \(\mathcal{U}=\{u_1,\ldots,u_N\}\) of expectation \(\mu\) and variance \(\sigma^2\) and the size \(n\) of the SRS \(\mathcal{Y}\) from which the estimator \(\overline{y}\) is constructed are sufficiently large, and if moreover the responses \(u_j\) are i.i.d. for \(1\leq j\leq N\), then \(\overline{y}\) follows approximately a normal distribution whose parameters are \[\text{E}(\overline{y})=\mu\quad\text{and}\quad \text{V}(\overline{y})=\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big).\]

The higher \(\sigma^2\) is, the more the \(\overline{y}\) values that result from a repeated SRS vary.

In practice, the normal approximation is:

  • often acceptable – see average life expectancy, previous section;

  • but it is not always so, which can lead to some challenges – cf. the \(\text{C.I.}(\mu;0.95)\) for the average population which was in fact only an 80% C.I. for a SRS of size \(n=20\) in the previous section.

In the presence of outliers or when \(n,N\) are too small, the performance of an SRS may leave something to be desired.

Example: consider a finite population with \(N=16\) elements: \[2,2,2,2,0,0,0,0,1,1,1,1,5,5,5,5.\] The population mean and variance are, respectively: \[\begin{aligned} \mu&={\frac{1}{16}(4\cdot 2+4\cdot 0+4\cdot 1+4\cdot 5)=2}; \\ \sigma^2&={\frac{1}{16}(4\cdot 2^2+4\cdot 0^2+4\cdot 1^2+4\cdot 5^2)-2^2=\frac{7}{2}}.\end{aligned}\] Suppose that we draw an SRS of size \(n=4\) from this population, in order to estimate the mean \(\mu\).

From what we discussed in the previous section, the expectation and sampling variance of the estimator \(\overline{y}\) are, respectively: \[\text{E}(\overline{y})={2}\quad\mbox{and}\quad\text{V}(\overline{y})={\frac{\sqrt{7/2}^2}{4}\Big(\frac{16-4}{16-1}\Big)=\frac{7}{10}}.\]

But we could also restrict the sampling structure in the following manner:

  1. we start by separating the population into 4 segments (the strata): \[\begin{aligned} \mathbf{strata\ 1:\ }& 2,2,2,2 \\ \mathbf{strata\ 2:\ }& 0,0,0,0 \\ \mathbf{strata\ 3:\ }& 1,1,1,1 \\ \mathbf{strata\ 4:\ }& 5,5,5,5 \end{aligned}\]

  2. we then draw a SRS of size \(n=4\) by selecting one unit per stratum.

In such a situation (which is NOT a SRS\((n=4,N=16)\)), each achieved sample takes the form \(\{2,0,1,5\}\): the empirical mean is always \(2\), and so the sampling variance is null.

In practice, this artificial situation rarely occurs, but if the units of the population can be grouped into natural strata, i.e., sub-populations for which:

  • the response value is homogeneous within each stratum, but

  • it is heterogeneous from one stratum to another, then

this approach can produce an estimator whose sampling variance is lower than that of the SRS estimator (as a bonus, the sample preserves certain population structures). \(\blacksquare\)

Example: find an approximate 95% C.I. for the average population per country (excluding China and India) in 2011.

Solution: the population distribution in the 2011 Gapminder dataset has the following characteristics:

gapminder.StS <- gapminder |>
    filter(year==2011) |>
    select(population) |>
    filter(population < 1000000000)
summary(gapminder.StS$population)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    56441   2061342   7355231  23301958  22242334 312390368 

The true average population, by country, is \(\mu=23,301,958\). Recall that the population distribution is asymetrical:

N = nrow(gapminder.StS)
ggplot(data=gapminder.StS, aes(population)) +
    geom_histogram(col="black", fill="blue", alpha=.2) + 
    geom_vline(xintercept=mean(gapminder.StS$population), color="red") + 
    geom_rug()

We use the following population strata: from 0 to 10M, from 10M to 25M, from 25M to 50M, from 50M to 100M, and 100M+.

gapminder.StS <- gapminder.StS |> 
      mutate(strata = ifelse(population<10000000,"S1",
                      ifelse(population<25000000,"S2",
                      ifelse(population<50000000,"S3",
                      ifelse(population<100000000,"S4","S5")))))

# sorting the observations from the smallest to the largest 
gapminder.StS <- gapminder.StS[order(gapminder.StS$population),]

# format conversion for the strata variable
gapminder.StS$strata <- as.factor(gapminder.StS$strata)

The number of countries in each stratum is:

strata.N <- tapply(gapminder.StS$population, gapminder.StS$strata, length)
strata.N
 S1  S2  S3  S4  S5 
105  35  21  13   9 

For a sample size of \(n=20\), we use approximately \(n_i\) countries per stratum \(S_i\):

strata.N/sum(strata.N)*20
        S1         S2         S3         S4         S5 
11.4754098  3.8251366  2.2950820  1.4207650  0.9836066 

Some practical considerations might suggest the use of a different allocation (more on this later). The distribution of the population by stratum has the following characteristics:

tapply(gapminder.StS$population, gapminder.StS$strata, summary)
$S1
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  56441  622957 2886010 3386819 5411377 9988846 

$S2
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
10027140 11234699 15177280 15682124 20213668 24928503 

$S3
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
25016921 29427631 34499905 36211465 41655616 49356692 

$S4
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
52237272 63268405 73517002 73841185 83787634 94501233 

$S5
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
120365271 143211476 163770669 182154642 200517584 312390368 

In the first attempt, we draw a SRS from each stratum, using the following sizes: \((n_1,n_2,n_3,n_4,n_5)=(11,4,3,1,1)\):

n=c()
n[1] = 11
n[2] = 4
n[3] = 3
n[4] = 1
n[5] = 1
indices = list()
set.seed(12345) # replicability

# draw a SRS of indices in each of the 5 strata
indices[[1]] <- sample(1:strata.N[1],n[1])
indices[[2]] <- sum(strata.N[1:1]) +  sample(1:strata.N[2],n[2])
indices[[3]] <- sum(strata.N[1:2]) + sample(1:strata.N[3],n[3])
indices[[4]] <- sum(strata.N[1:3]) + sample(1:strata.N[4],n[4])
indices[[5]] <- sum(strata.N[1:4]) + sample(1:strata.N[5],n[5])

The average population in the sample is computed as below (this value will change from one StS to another).

ind.StS <-unique(unlist(indices))
sample.StS <- gapminder.StS[ind.StS,]  
mean(sample.StS$population)
[1] 24378331

Despite the relative accuracy of the estimate, this naı̈ve approach is not ideal. The estimator \[\overline{y}_{\text{StS}}=\textstyle{\frac{1}{20}(y_1+\cdots+y_{20})}\] implies that each observation had the same probability of being chosen, which is not the case in reality (\(\neg\)SRS).

In the second attempt, the weight of each selected observation depends on the size of the stratum (we will discuss the theoretical details in the next section).

cumul.n = cumsum(n)
cumul.N = cumsum(strata.N)

set.seed(123456) # replicability
indices = list()
indices[[1]] <- sample(1:strata.N[1],n[1])
for(j in 2:length(n)){
    indices[[j]] <- cumul.N[j-1] + sample(1:strata.N[j],n[j])
}
ind.StS <-unique(unlist(indices))
sample.StS <- gapminder.StS[ind.StS,]
sample.StS = sample.StS[order(sample.StS$population),]

y.bar <- list()
y.bar[[1]] <- mean(sample.StS[1:n[1],c("population")])
for(j in 2:length(n)){
    y.bar[[j]] <- mean(sample.StS[(cumul.n[j-1]+1):cumul.n[j],c("population")])
}

y.bar.StS <- 0
for(j in 1:length(n)){
    y.bar.StS <- y.bar.StS + as.numeric(strata.N[j])*y.bar[[j]]
}
y.bar.StS/N
[1] 22668202

The estimate is very close to the actual value of \(\mu\), but a lone point estimate does not tell the full story.

We repeat this procedure 500 times, using the size allocation \((n_1,n_2,n_3,n_4,n_5)=(9,3,3,3,2)\):

set.seed(12) # replicability 
strata.N <- tapply(gapminder.StS$population, gapminder.StS$strata, length)
cumul.N = cumsum(strata.N)

n=c()
n[1] = 9
n[2] = 3
n[3] = 3
n[4] = 3
n[5] = 2
cumul.n = cumsum(n)

m=500
means <- c()
for(k in 1:m){
    indices = list()
    indices[[1]] <- sample(1:strata.N[1],n[1])
    for(j in 2:length(n)){
      indices[[j]] <- cumul.N[j-1] + sample(1:strata.N[j],n[j])
    }
    ind.StS <-unique(unlist(indices))
    sample.StS <- gapminder.StS[ind.StS,]
    sample.StS = sample.StS[order(sample.StS$population),]
  
    y.bar <- list()
    y.bar[[1]] <- mean(sample.StS[1:n[1],c("population")])
    for(j in 2:length(n)){
      y.bar[[j]] = mean(sample.StS[(cumul.n[j-1]+1):cumul.n[j],c("population")])
    }

    y.bar.StS <- 0
    for(j in 1:length(n)){
      y.bar.StS <- y.bar.StS + as.numeric(strata.N[j])*y.bar[[j]]
    }

    means[k] <- y.bar.StS/N 
  }

For each sample \(1 \leq i \leq 500\), we then compute the empirical mean – their distribution has the following characteristics:

summary(means)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
17608174 21602380 22735650 23179372 24655297 29082447 

Finally, we plot the histogram of the StS means (with their mean in red):

ggplot(data=data.frame(means), aes(means)) + geom_rug() +
    geom_histogram(col="black", fill="blue", alpha=.2) + 
    geom_vline(xintercept=mean(means), color="red") 

Not only is the shape of the distribution closer to a normal distribution, compared to the distribution of \(overline{y}\) obtained using SRS, but its variance is also much lower (compare the following image, on the same scale as the corresponding histogram for SRS in Section 5.3.2).

ggplot(data=data.frame(means), aes(means)) + geom_rug() +
  geom_histogram(col="black", fill="blue", alpha=.2) + xlim(0,60000000) + 
  geom_vline(xintercept=mean(means), color="red") 

5.4.1 Estimators and Confidence Intervals

Assume that we are yet again interested in a finite population \(\mathcal{U}=\{u_1,\ldots,u_N\}\), whose expectation is \(\mu\) and variance is \(\sigma^2\). Suppose that we can cover the population with \(M\) disjoint strata, containing, respectively, \(N_1,\ldots,N_M\) units: \[\mathcal{U}_1=\{u_{1,1},\ldots, u_{1,N_1}\}, \ \cdots \ , \ \mathcal{U}_M=\{u_{M,1},\ldots, u_{M,N_M}\},\] with stratum mean and stratum variance \[\mu_i={\frac{1}{N_i}\sum_{j=1}^{N_i}u_{i,j}}\quad\text{and}\quad \sigma_i^2={\frac{1}{N_i}\sum_{j=1}^{N_i}u_{i,j}^2-\mu_i^2},\quad 1\leq i\leq M. \]

A stratified sample \(\mathcal{Y}\) of size \(n\leq N\) is a subset of the target population \(\mathcal{U}\), with\(n_1+\cdots + n_M=n\) and \(n_i\leq N_i\) for \(1\leq i\leq M\): \[\{\underbrace{\strut y_{1,1},\ldots, y_{1,n_1}}_{{\text{$\in$ strate $\mathcal{U}_1$}}},\ldots,\underbrace{\strut y_{M,1},\ldots, y_{M,n_m}}_{{\text{$\in$ strate $\mathcal{U}_M$}}}\}\subseteq \bigcup_{i=1}^M\mathcal{U}_i=\mathcal{U}.\]

If every sample \(\mathcal{Y}_i=\{y_{i,j}\mid 1\leq j\leq n_i\}\) is drawn from the corresponding stratum \(\mathcal{U}_i\) via a SRS, independently from one stratum to another, we obtain a stratified random sample (StS) of size \(n\). The sample mean and the sample variance (which it is important to remember is not the same thing as the “sampling variance” of an estimator) of \(\mathcal{Y}_i\) are denoted by \(\overline{y}_i\) et \(s_i^2\), respectively.

In a StS design, each observation in a stratum has the same probability of being selected, but this probability may differ from one stratum to another.

Schematics of StS: target population (left) and sample (right).Schematics of StS: target population (left) and sample (right).

Figure 5.10: Schematics of StS: target population (left) and sample (right).

Mean \(\mu\)

In a StS, the sample mean of the observations of the sample \(\mathcal{Y}\) falling in the stratum \(\mathcal{U}_i\) is an estimator of \(\mu_i\) given by \[\overline{y}_i={\frac{1}{n_i}\sum_{\ell=1}^{n_i}y_{i,\ell}},\quad \text{where }n_i=|\mathcal{U}\cap \mathcal{Y}_i| ,\ 1\leq i\leq M.\]

The true mean \(\mu\) and the StS estimator of \(\mu\) are thus: \[\mu=\frac{1}{N}\sum_{j=1}^Nu_{j}={\frac{1}{N}\sum_{i=1}^M\sum_{j=1}^{N_i}u_{i,j}=\frac{1}{N}\sum_{i=1}^MN_i\mu_i}\quad \text{and}\quad \overline{y}_{\text{StS}}={\frac{1}{N}\sum_{i=1}^MN_i\overline{y}_i}.\] For the sake of completeness, the SRS estimator is sometimes denoted by \(\overline{y}_{\text{SRS}}\).

Since \(\mathcal{Y}_i\) is a SRS drawn from \(\mathcal{U}_i\), we have: \[\text{E}(\overline{y}_i)={\mu_i}\quad\text{and}\quad \text{V}(\overline{y}_i)={\frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)},\quad\text{pour } 1\leq i\leq M.\]

The expectation of the StS estimator is thus: \[\text{E}\left(\overline{y}_{\text{StS}}\right)=\text{E}\left({\frac{1}{N}\sum_{i=1}^MN_i\overline{y}_i}\right)={\frac{1}{N}\sum_{i=1}^MN_i\text{E}(\overline{y}_i)=\frac{1}{N}\sum_{i=1}^MN_i\mu_i=\mu},\] which is to say that \(\overline{y}_{\text{StS}}\) is an unbiased estimator of the true mean \(\mu\) for a population of size \(N\) with variance \(\sigma^2\).59

The sampling variance of the estimator \(\overline{y}_{\text{StS}}\) is \[\begin{aligned} \text{V}\left(\overline{y}_{\text{StS}}\right)&=\text{V}\left({\frac{1}{N}\sum_{i=1}^MN_i\overline{y}_i}\right)={\frac{1}{N^2}\sum_{i=1}^M N_i^2\text{V}(\overline{y}_i)+\sum_{i\neq i'}^MN_iN_{i'}\underbrace{\strut \text{Cov}(\overline{y}_i,\overline{y}_{i'})}_{\text{$=0$}}} \\ &= {\frac{1}{N^2}\sum_{i=1}^M N_i^2\text{V}(\overline{y}_i)} = \frac{1}{N^2}\sum_{i=1}^M N_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big).\end{aligned}\]

Central Limit Theorem – StS: if \(n\), \(N-n\), \(n_i\), and \(N_i-n_i\) are all sufficiently large, for all i, then \[\overline{y}_{\text{StS}}\sim_{\text{approx.}}{\mathcal{N}\left(\text{E}(\overline{y}_{\text{StS}}),\text{V}(\overline{y}_{\text{StS}})\right)=\mathcal{N}\left(\mu,\frac{1}{N^2}\sum_{i=1}^M N_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)\right)}.\]

In a StS, the bound on the error of estimation is \[B_{\mu;\text{StS}}=2\sqrt{\text{V}(\overline{y}_{\text{StS}})}={2\sqrt{\frac{1}{N^2}\sum_{i=1}^M N_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)}}\] and the corresponding approximate 95% C.I. for \(\mu\) is \[\text{C.I.}_{\text{StS}}(\mu;0.95):\quad {\overline{y}_{\text{StS}}\pm B_{\mu;\text{StS}}}.\]

In practice, the population variance \(\sigma^2\) is rarely known (as is the variance \(\sigma_i^2\) in each stratum \(\mathcal{U}_i\), \(1\leq i\leq M\)). We then use the sample variance (and the corresponding finite population correction factor).

In each stratum, the empirical variance \(s_i^2\) is \[s_i^2={\frac{1}{n_i-1}\sum_{\ell=1}^{n_i}(y_{i,\ell}-\overline{y}_i)^2=\frac{1}{n_i-1}\Big[\sum_{\ell=1}^{n_i}y_{i,\ell}^2-n_i\overline{y}_i^2\Big]},\quad 1\leq i\leq M.\]

We can then approximate the sampling variance in \(\mathcal{U}_i\) as we did for a SRS, using \[\hat{\text{V}}(\overline{y}_i)={\frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)}, \quad 1\leq i\leq M.\]

The sampling variance of the estimator \(\overline{y}_{\text{StS}}\) is thus \[\hat{\text{V}}(\overline{y}_{\text{StS}})={\frac{1}{N^2}\sum_{i=1}^M N_i^2\text{V}(\overline{y}_i)=\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot\frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)}.\]

The bound of the estimation error is approximated by \[B_{\mu;\text{StS}}\approx \hat{B}_{\mu;\text{StS}}=2\sqrt{\hat{\text{V}}(\overline{y}_{\text{StS}})}={2\sqrt{\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot\frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)}},\] whence \[\text{C.I.}_{\text{StS}}(\mu;0.95): \quad {\overline{y}_{\text{StS}}\pm \hat{B}_{\mu;\text{StS}}\equiv \overline{y}_{\text{StS}}\pm 2\sqrt{\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot\frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)}}\] is an approximate 95% C.I. for \(\mu\).

In practice, when the stratum sampling rate \(\frac{n_i}{N_i}\) is below 5%, we can drop the FPCF in the corresponding stratum.

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\), separated in two disjoint strata \(\mathcal{U}_1\) and \(\mathcal{U}_2\), of respective sizes \(N_1=21,123\) and \(N_2=16,321\). A StS sample \(\mathcal{Y}\) of size \(n=132\) is drawn from \(\mathcal{U}\), with \(n_1=82\) and \(n_2=50\). Suppose the empirical mean and standard deviation in \(\mathcal{Y}_1\) and \(\mathcal{Y}_2\) are \(\overline{y}_1=120.7\), \(\overline{y}_2=96.6\), \(s_1=18.99\), and \(s_2=14.31\), respectively. Find a 95% C.I. for the mean \(\mu\) of \(\mathcal{U}\).

Solution: the bound on the error of estimation is \(\approx\hat{B}_{\mu;\text{StS}}=2\sqrt{\hat{\text{V}}(\overline{y}_{\text{StS}})}\): \[2\sqrt{\frac{21123^2}{37444^2}\cdot\frac{18.99^2}{82}\Big(1-\frac{82}{21123}\Big)+\frac{16321^2}{37444^2}\cdot\frac{14.31^2}{50}\Big(1-\frac{50}{16321}\Big)}\approx 2.95,\] whence \(\text{C.I.}_{\text{StS}}(\mu;0.95)\approx \big(\textstyle{\frac{21,123(120.7)}{37,444}}+\textstyle{\frac{16,321(96.6)}{37,444}}\big)\pm 2.95\equiv (107.25,113.14).\) \(\blacksquare\)

Example: find a 95% confidence interval for the average life expectancy by country in 2011 (including India and China), using a StS of size \(n=20\) (stratifying using the country populations, as we did earlier in this section).

Solution: we can basically re-use the same code:

LE.1 <- gapminder |> filter(year==2011) |>
    select(population,life_expectancy)
summary(LE.1)
   population        life_expectancy
 Min.   :5.644e+04   Min.   :46.70  
 1st Qu.:2.064e+06   1st Qu.:65.30  
 Median :7.563e+06   Median :73.70  
 Mean   :3.708e+07   Mean   :71.18  
 3rd Qu.:2.423e+07   3rd Qu.:77.40  
 Max.   :1.348e+09   Max.   :83.02  

The average life expectancy is \(\mu=71.18\). We now prepare the strata according to the population, and we sort the observations from the smallest population to the largest:

LE.1 <- LE.1 |> mutate(strata = ifelse(population<10000000,"S1", ifelse(population<25000000,"S2", ifelse(population<50000000,"S3", ifelse(population<100000000,"S4","S5")))))
LE.1 <- LE.1[order(LE.1$population),]
LE.1$strata <- as.factor(LE.1$strata)

The number of countries in each stratum is given below:

(strata.N <- tapply(LE.1$life_expectancy, LE.1$strata, length))
 S1  S2  S3  S4  S5 
105  35  21  13  11 

Unfortunately, the life expectancy distributions of each stratum overlap to a great extent: this is not a good sign as it suggests that a country’s population is not aligned with its life expectancy (and so the strata are not heterogeneous with respect to life expectancy).

ggplot(LE.1,aes(x=life_expectancy,fill=strata)) + 
    geom_density(alpha=0.5) + geom_rug() 

Since there are \(N=185\) observations in the data set, a sample of size \(n=20\), allocated in such a way as to maintain the relative frequencies of the number of observations in each \(\mathcal{U}_i\) (this is known as proportional allocation), would have the following stratum allocation:

N=sum(strata.N)
strata.N/sum(strata.N)*20
       S1        S2        S3        S4        S5 
11.351351  3.783784  2.270270  1.405405  1.189189 

In practice, we prefer to have at least 2 observations per stratum, so we might use \((n_1,n_2,n_3,n_4,n_5)=(11,3,2,2,2)\).

n=c(11,3,2,2,2)

We select a StS sample \(\mathcal{Y}\) with these characteristics via:

cumul.n = cumsum(n)
cumul.N = cumsum(strata.N)

set.seed(123456) # replicability
indices = list()
indices[[1]] <- sample(1:strata.N[1],n[1])
for(j in 2:length(n)){
    indices[[j]] <- cumul.N[j-1] + sample(1:strata.N[j],n[j])
  }

ind.LE.1 <-unique(unlist(indices))

sam.LE.1 <- LE.1[ind.LE.1,]
sam.LE.1 <- sam.LE.1[order(sam.LE.1$population),]

Next, we compute the mean \(\overline{y}_i\) and the standard deviation \(s_i\) in each bucket \(\mathcal{Y}_i\), \(1\leq i\leq 5\).

y.bar <- list()
std.dev <- list()
y.bar[[1]] <- mean(sam.LE.1[1:n[1],c("life_expectancy")])
std.dev[[1]] <- sd(sam.LE.1[1:n[1],c("life_expectancy")])
for(j in 2:length(n)){
    y.bar[[j]] <- mean(sam.LE.1[(cumul.n[j-1]+1):cumul.n[j],c("life_expectancy")])
    std.dev[[j]] <- sd(sam.LE.1[(cumul.n[j-1]+1):cumul.n[j],c("life_expectancy")])
}

rbind(y.bar,std.dev)
        [,1]     [,2]     [,3]     [,4]     [,5]    
y.bar   70.83636 71.6     67.55    72.15    76.2    
std.dev 7.551327 3.774917 18.45549 2.757716 9.050967

There is not much variation in the means, but the standard deviation values are all over the place: this is due to small sample sizes in some strata, and overlapping distributions of life expectancy by strata.

As we’ve already mentioned, the stratification of countries by population does not align with the estimate of mean life expectancy.60. The estimator \(\overline{y}_{\text{StS}}\) is:

mean.LE.1 <- 0
for(j in 1:length(n)){
    mean.LE.1 <- mean.LE.1 + as.numeric(strata.N[j])*y.bar[[j]]
}
(mean.LE.1 <- mean.LE.1/N)
[1] 71.01902

This is fairly close to the true \(\mu\). The bound on the error of estimation \(\hat{B}_{\mu;\text{StS}}\) is:

B=0
for(j in 1:length(n)){
    B <- B + as.numeric((strata.N[j]/N)^2*std.dev[[j]]^2/n[j]*(1-n[j]/strata.N[j]))
  }
(B <- 2*sqrt(B))
[1] 3.883388

This is fairly large, all things considered. The 95% C.I. is thus:

c(mean.LE.1 - B, mean.LE.1 + B)
[1] 67.13563 74.90241

Compare with the \(\text{C.I.}_{\text{SRS}}(\mu;0.95)\) obtained previously – it was much narrower. This is no doubt due to stratification on the basis of population being a poor choice when dealing with life expectancy.

Example: find a 95% confidence interval for the average life expectancy by country in 2011 (including India and China), using a StS of size \(n=20\) (stratifying using the country life expectations.61

Solution: we make some modifications to the code, this time using the following strata: \[\mathcal{U}_1=\{u_{j} \mid u_j<70\},\quad \mathcal{U}_2=\{u_{j} \mid 70\leq u_j<80\},\quad\mathcal{U}_3=\{u_{j} \mid u_j\geq 80\}.\]

LE.2 <- gapminder |> filter(year==2011) |>  select(life_expectancy)
LE.2 <- LE.2 |> mutate(strata = ifelse(life_expectancy<70,"S1", 
                                        ifelse(life_expectancy<80,"S2","S3")))
LE.2 <- LE.2[order(LE.2$life_expectancy),]
LE.2$strata <- as.factor(LE.2$strata)
(strata.N <- tapply(LE.2$life_expectancy, LE.2$strata, length))
S1 S2 S3 
65 93 27 

By construction, the life expectancy distributions do not overlap from stratum to stratum.

ggplot(LE.2,aes(x=life_expectancy,fill=strata)) + 
    geom_density(alpha=0.5) + geom_rug(aes(color=life_expectancy)) 

Since there are \(N=185\) observations in the data set, with \((N_1,N_2,N_3)=(65,93,27)\), a sample of size \(n=20\) could be drawn according to the followin allocation:

N=sum(strata.N)
strata.N/sum(strata.N)*20
       S1        S2        S3 
 7.027027 10.054054  2.918919 

We use \((n_1,n_2,n_3)=(7,10,3)\).

n=c(7,10,3)

The rest of the code is used as it was in the previous example.

cumul.n = cumsum(n)
cumul.N = cumsum(strata.N)

set.seed(123456) # replicability
indices = list()
indices[[1]] <- sample(1:strata.N[1],n[1])
for(j in 2:length(n)){
    indices[[j]] <- cumul.N[j-1] + sample(1:strata.N[j],n[j])
  }

ind.LE.2 <-unique(unlist(indices))

sam.LE.2 <- LE.2[ind.LE.2,]
sam.LE.2 <- sam.LE.1[order(sam.LE.2$life_expectancy),]

y.bar <- list()
std.dev <- list()
y.bar[[1]] <- mean(sam.LE.2[1:n[1],c("life_expectancy")])
std.dev[[1]] <- sd(sam.LE.2[1:n[1],c("life_expectancy")])
for(j in 2:length(n)){
    y.bar[[j]] <- mean(sam.LE.2[(cumul.n[j-1]+1):cumul.n[j],c("life_expectancy")])
    std.dev[[j]] <- sd(sam.LE.2[(cumul.n[j-1]+1):cumul.n[j],c("life_expectancy")])
  }

With this sample \(\mathcal{Y}\), the strata means and standard deviations are:

rbind(y.bar,std.dev)
        [,1]     [,2]     [,3]    
y.bar   71.5     70.27    74.2    
std.dev 8.469553 7.721838 7.277362

These quantities are more reasonable than with the previous stratification (why?), but they could change from one sample to the next. The values for \(\overline{y}_{\text{StS}}\) and \(\hat{B}_{\mu;\text{StS}}\) are:

mean.LE.2 <- 0
for(j in 1:length(n)){
    mean.LE.2 <- mean.LE.2 + as.numeric(strata.N[j])*y.bar[[j]]
}
(mean.LE.2 <- mean.LE.2/N)

B=0
for(j in 1:length(n)){
    B <- B + as.numeric((strata.N[j]/N)^2*std.dev[[j]]^2/n[j]*(1-n[j]/strata.N[j]))
}
(B <- 2*sqrt(B))
[1] 71.27573
[1] 3.35133

The estimator is farily close to the true value \(\mu=71.18\), but it is when calculating the bound on the error of estimation that the StS approach proves superior: the 95% C.I. for \(\mu\) is:

c(mean.LE.2 - B, mean.LE.2 + B)
[1] 67.92440 74.62706

These examples show that stratified sampling can improve SRS estimation, but that this is not always going to be the case.

Total \(\tau\)

Most of the work has already been done: since the total \(\tau\) can be re-written as \[\tau=\sum_{j=1}^Nu_j=N\mu,\] we can estimate the total with a StS using: \[\hat{\tau}_{\text{StS}}={N\overline{y}_{\text{StS}}=\frac{N}{N}\sum_{i=1}^MN_i\overline{y}_i=\sum_{i=1}^MN_i\overline{y}_i}.\]

It is an unbiased estimator of the total since its expectation is \[\text{E}(\hat{\tau}_{\text{StS}})={\text{E}(N\overline{y}_{\text{StS}})=N\cdot \text{E}(\overline{y}_{\text{StS}})=N\mu=\tau}.\] Its sampling variance is \[\text{V}(\hat{\tau}_{\text{StS}})={\text{V}(N\overline{y}_{\text{StS}})=N^2\cdot \text{V}(\overline{y}_{\text{StS}})=\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)},\] assuming that we know the variance \(\sigma_i^2\) in each strata \(\mathcal{U_i}\), \(1\leq i\leq M\), whence the bound on the error of estimation is \[B_{\tau;\text{StS}}=2\sqrt{\text{V}(\hat{\tau}_{\text{StS}})}={2\sqrt{\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)}=N\cdot B_{\mu;\text{StS}}}.\]

Since the variances \(\sigma_i^2\) are usually unknown, we often substitute \(\sigma_i^2\) with the empirical stratum variance \(s_i^2\) calculated from the sample, which is multiplied by the correction factor \({\frac{N_i-1}{N_i}}\), \(1\leq i\leq M\). The approximation of the sampling variance is then \[\hat{\text{V}}(\hat{\tau}_{\text{StS}})={\hat{\text{V}}(N\overline{y}_{\text{StS}})=\sum_{i=1}^MN_i^2\cdot \frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)},\] whence the bound on the error of estimation is \[B_{\tau;\text{StS}}\approx\hat{B}_{\tau;\text{StS}}=2\sqrt{\hat{\text{V}}(\hat{\tau}_{\text{StS}})}={2\sqrt{\sum_{i=1}^MN_i^2\cdot \frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)}=N\cdot\hat{B}_{\mu;\text{StS}}},\] and the approximate 95% C.I. for \(\tau\) is \[\text{C.I.}_{\text{StS}}(\tau;0.95):\quad {\hat{\tau}_{\text{StS}}\pm \hat{B}_{\tau;\text{StS}}}.\]

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\), split into two strata \(\mathcal{U}_1\) and \(\mathcal{U}_2\), of sizes \(N_1=21,123\) and \(N_2=16,321\), respectively. A StS \(\mathcal{Y}\) of size \(n=132\) is drawn from \(\mathcal{U}\), with \(n_1=82\) and \(n_2=50\). Suppose the empirical mean and standard deviation in \(\mathcal{Y}_1\) and \(\mathcal{Y}_2\) are \(\overline{y}_1=120.7\), \(\overline{y}_2=96.6\), \(s_1=18.99\), and \(s_2=14.31\), respectively. Find a 95% C.I. of the total \(\tau\) in \(\mathcal{U}\).

Solution: The bound on the error of estimation is \(\approx\hat{B}_{\tau;\text{StS}}=2\sqrt{\hat{\text{V}}(\hat{\tau}_{\text{StS}})}\): \[2\sqrt{21123^2\cdot\frac{18.99^2}{82}\Big(1-\frac{82}{21123}\Big)+16321^2\cdot\frac{14.31^2}{50}\Big(1-\frac{50}{16321}\Big)}\approx 110312.3,\] so \(\text{C.I.}_{\text{StS}}(\tau;0.95)\approx 21123(120.7)+16321(96.6)\pm 110312.3\approx (4015842,4236467).\) \(\blacksquare\)

Proportion \(p\)

If the response \(u_{i,\ell} \in \{0,1\}\) represents the absence or the presence of a certain characteristic for the \(\ell\)th unit in the \(i\)th strata \(\mathcal{U}_i\), the mean \[p=\mu=\frac{1}{N}\sum_{i=1}^M\sum_{\ell=1}^{N_i}u_{i,\ell}\] is the proportion of all units in \(\mathcal{U}\) which posess the characteristic.

This proportion can be estimated with a StS via \[\hat{p}_{\text{StS}}={\frac{1}{N}\sum_{i=1}^MN_i\hat{p}_i,\quad\text{where } \hat{p}_i=\frac{1}{n_i}\sum_{\ell=1}^{n_i}u_{i,\ell}, \ 1\leq i\leq M}.\]

This is an unbiased estimator of \(p\) since \[\text{E}(\hat{p}_{\text{StS}})={\text{E}(\overline{y}_{\text{StS}})=\mu=p}.\]

Its sampling variance is: \[\begin{aligned} \text{V}(\hat{p}_{\text{StS}})&={\text{V}(\overline{y}_{\text{StS}})= \frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)}\\&={\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{p_i(1-p_i)}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)},\end{aligned}\] where \(\sigma_i^2=p_i(1-p_i)\) is the variance of the response variable \(u\) in the stratum \(\mathcal{U}_i\).

The bound on the error of estimation is \[B_{p;\text{StS}}=2\sqrt{\text{V}(\hat{p}_{\text{StS}})}={2\sqrt{\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{p_i(1-p_i)}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)}}.\]

Since the proportions \(p_i\) are not usually known, the approximate sampling variance is used instead: \[\hat{\text{V}}(\hat{p}_{\text{StS}})={\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\hat{p}_i(1-\hat{p}_i)}{n_i-1}\Big(1-\frac{n_i}{N_i}\Big)}.\]

The approximate bound on the error of estimation is thus \[B_{p;\text{StS}}\approx\hat{B}_{p;\text{StS}}=2\sqrt{\hat{\text{V}}(\hat{p}_{\text{StS}})}={\frac{2}{N}\sqrt{\sum_{i=1}^MN_i^2\cdot \frac{\hat{p}_i(1-\hat{p}_i)}{n_i-1}\Big(1-\frac{n_i}{N_i}\Big)}},\] and the corresponding approximate 95% C.I. for \(p\) is \[\text{C.I.}_{\text{StS}}(p;0.95):\quad {\hat{p}_{\text{StS}}\pm \frac{2}{N}\sqrt{\sum_{i=1}^MN_i^2\cdot \frac{\hat{p}_i(1-\hat{p}_i)}{n_i-1}\Big(1-\frac{n_i}{N_i}\Big)}}.\]

If the sample size in a stratum is too small, we can use the conservative estimate \(\hat{p}_i=0.5\).

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\), split into two strata \(\mathcal{U}_1\) and \(\mathcal{U}_2\), of sizes \(N_1=21,123\) and \(N_2=16,321\), respectively. A StS \(\mathcal{Y}\) of size \(n=132\) is drawn from \(\mathcal{U}\), with \(n_1=82\) and \(n_2=50\). Suppose that \(n_1=20\) of the observations from \(\mathcal{Y}_1\) and \(n_2=5\) of the observations from \(\mathcal{Y}_2\) possess a certain characteristic. Find a 95% C.I. for the proportion \(p\) of the units in \(\mathcal{U}\) that possess the characteristic.

Solution: In this case, \(\hat{p}_1=20/82\approx 0.24\) and \(\hat{p}_2=5/50=0.10\), from which we obtain \[\hat{p}_{\text{StS}}=\frac{21123}{37444}(0.24)\frac{16321}{37444}(0.10)=0.181.\] The bound on the error of estimation is thus \[\textstyle{\hat{B}_p=\frac{2}{37444}\sqrt{21123^2\frac{0.24(1-0.24)}{82-1}\Big(1-\frac{82}{21123}\Big)+16321^2\frac{0.1(1-0.1)}{50-1}\Big(1-\frac{50}{16321}\Big)}\approx 0.0654},\] from which we conclude that \[\text{C.I.}(p;0.95)\approx 0.181\pm 0.0654 \equiv (0.116,0.247).\quad \blacksquare \]

5.4.2 Sample Size and Allocation

When determining the size of a StS sample \(\mathcal{Y}\), one must also consider the problem of allocating the number of units \(n_i\) in each stratum \(\mathcal{Y}_i\). If \(|\mathcal{Y}_i|=n_i\), \(1\leq i\leq M\), then \(n=n_1+\cdots+n_M.\) But how are the \(n_i\) determined?

In a StS, the sampling variance of the estimator \(\overline{y}_{\text{StS}}\) is \[\text{V}(\overline{y}_{\text{StS}})=\frac{1}{N^2}\sum_{i=1}^M N_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big).\]

When \(N_i\gg 1\), then \(N_i\approx N_i-1\) and \[\text{V}(\overline{y}_{\text{StS}})\approx {\frac{1}{N^2}\sum_{i=1}^M N_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i}\Big)=\frac{1}{N^2}\sum_{i=1}^M N_i^2\cdot\frac{\sigma_i^2}{n_i}-\frac{1}{N^2}\sum_{i=1}^M N_i\sigma_i^2}.\] Since the sampling variance \(\text{V}(\overline{y}_{\text{StS}})\) determines the bound on the error of estimation \(\hat{B}_{\mu;\text{StS}}\), we can minimize the bound (and thus the error) by minimizing the sampling variance. The quantities \(N\), \(N_i\), \(\sigma_i^2\), are fixed for \(1\leq i\leq M\); what we minimize against is the sample size \(n\) and the allocation \(n_i\) in each stratum.

The total cost of the survey \(\tilde{C}\) can also affect the allocation. The survey budget includes the overhead cost (indirect costs) \(c_0\) and the cost per response \(c_i\) in each stratum \(\mathcal{U}_i\), \(1\leq i\leq M\). The total cost is thus \[\tilde{C}=c_0+\sum_{i=1}^Mc_in_i,\] which must remain below than available survey budget \(C\).

The allocation problem is an optimization problem: we seek to solve \[\arg_{(n,n_1,\ldots,n_M)}\min \text{V}(\overline{y}_{\text{StS}}),\quad \text{subject to } \tilde{C}\leq C.\] We use the method of Lagrange multipliers. The objective function becomes \[\begin{aligned} f(n_1,\ldots,n_M,\lambda)&=\text{V}(\overline{y}_{\text{StS}})+\lambda (\tilde{C}-C)\\ &={\frac{1}{N^2}\sum_{k=1}^M N_i^2\cdot\frac{\sigma_k^2}{n_k}-\frac{1}{N^2}\sum_{k=1}^M N_k\sigma_k^2+\lambda(c_0+\sum_{k=1}^Mc_kn_k-C)}.\end{aligned}\]

Its critical points solve \[\begin{aligned} 0&=\frac{\partial f(n_1,\ldots,n_M,\lambda)}{\partial n_i}= {\frac{1}{N^2}\sum_{k=1}^M N_k^2\sigma_k^2\frac{\partial (1/n_k)}{\partial n_i}+\lambda\sum_{k=1}^Mc_k\frac{\partial (n_k)}{\partial n_i}}\\ &={-\frac{N_i^2\sigma_i^2}{N^2n_i^2}+\lambda c_i},\quad 1\leq i\leq M,\end{aligned}\] which is to say that \[n_i={\frac{N_i\sigma_i}{N\sqrt{\lambda}\sqrt{c_i}}},\quad 1\leq i\leq M.\]

The strata sampling weights \(w_i\) are \[w_{i}={\frac{n_i}{n_1+\cdots+n_M}},\quad 1\leq i\leq M.\] The general optimal allocation is thus \[w_{i}={\frac{n_i}{n}=\frac{\displaystyle{\frac{N_i\sigma_i}{N\sqrt{\lambda}\sqrt{c_i}}}}{\displaystyle{\sum_{k=1}^M\frac{N_k\sigma_k}{N\sqrt{\lambda}\sqrt{c_k}}}}=\frac{\displaystyle{\frac{N_i\sigma_i}{\sqrt{c_i}}}}{\displaystyle{\sum_{k=1}^M\frac{N_k\sigma_k}{\sqrt{c_k}}}}},\quad 1\leq i\leq M.\]

Once we have determined the size \(n\) of the sample \(\mathcal{Y}\), we compute the size of the sample \(n_i\) in each \(\mathcal{Y}_i\) using \(w_i \cdot n\), \(1\leq i\leq M\). Since the product \(w_i\cdot n\) is not typically an integer, we allocate \([w_i\cdot n]\) units to each \(\mathcal{Y}_i\), and distribute the remaining \[n-[w_1\cdot n]-\cdots-[w_M\cdot n]\] units using “common sense” (while ensuring that \(\tilde{C}\leq C\)).

If the cost per response in each stratum is constant, \(c_1=\cdots=c_M\), Neyman allocation yields the following stratum sampling weights: \[w_i=\frac{n_i}{n}={\frac{N_i\sigma_i}{\displaystyle{N_1\sigma_1+\cdots+N_M\sigma_M}}},\quad 1\leq i\leq M.\]

If moreover the variance is the same in each stratum, \(\sigma_1^2=\cdots=\sigma_M^2\), proportional allocation yields the following stratum sampling weights: \[w_i=\frac{n_i}{n}={\frac{N_i}{N_1+\cdots+N_M}=\frac{N_i}{N}},\quad 1\leq i\leq M.\]

Once the sample size and allocation have been selected, the methods in the previous section can be used to provide confidence intervals for the mean \(\mu\), for the total \(\tau\), or for a proportion \(p\). When the variances are unknown, the usual approximations can be used.

We may at times use allocation schemes that are not necessarily ideal from a technical perspective, but which facilitate the preparation of reports or the dissemination of results: \[w_i=\frac{n_i}{n}={\frac{f(N_i)}{f(N_1)+\cdots+f(N_M)}},\quad 1\leq i\leq M, \ f\ \text{a random function}.\]

For intance, when studying Canadian populations, we often stratify according to the provinces and use \(f(x)=\sqrt{x}\); the proportional allocation and square root allocation sampling weights for the 13 Canadian jurisdictions (based on 2022 population data) are shown below.

Sampling weights for Canadian provinces, under proportional allocation and square root allocation (racine, in French).

Figure 5.11: Sampling weights for Canadian provinces, under proportional allocation and square root allocation (racine, in French).

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\), separated in two disjoint strata \(\mathcal{U}_1\) and \(\mathcal{U}_2\), of respective sizes \(N_1=21,123\) and \(N_2=16,321\). We seek to estimate the mean \(\mu\) of \(\mathcal{U}\) using a StS. The survey budget allows for a sample size \(n=132\). In a preliminary study, we estimated \(\sigma_1\approx 20\) and \(\sigma_2\approx 15\). If the cost of a response in the first stratum is four times that of the cost of a response in the second stratum, find the general optimal allocation. If the response cost per stratum is constant, determine the Neyman and the proportional allocations.

Solution: in the general case, we have \(c_1=4c_2\), \[\frac{N_1\sigma_1}{\sqrt{c_1}}={\frac{21123(20)}{\sqrt{4c_2}}=\frac{211230}{\sqrt{c_2}}}, \quad \frac{N_2\sigma_2}{\sqrt{c_2}}={\frac{16321(15)}{\sqrt{c_2}}=\frac{244815}{\sqrt{c_2}}},\] and \[\frac{N_1\sigma_1}{\sqrt{c_1}}+\frac{N_2\sigma_2}{\sqrt{c_2}}={\frac{211230}{\sqrt{c_2}}+ \frac{244815}{\sqrt{c_2}}=\frac{456045}{\sqrt{c_2}}},\] from which we conclude that \[n_1={132\left(\frac{211230}{456045}\right)=61.13}\quad \text{and}\quad n_2={132\left(\frac{244815}{456045}\right)=70.87};\] the general optimal allocation is thus \(n_1={61}\) and \(n_2={71}\).

If the cost for a response is the same in both strata, \(c_1=c_2\), then we have \[N_1\sigma_1={21123(20)=422460}, \quad N_2\sigma_2={16321(15)=244815},\] and \[N_1\sigma_1+N_2\sigma_2={422460+ 244815=667275},\] from which we conclude that \[n_1={132\left(\frac{422460}{667275}\right)=83.57}\quad \text{and}\quad n_2={132\left(\frac{244815}{667275}\right)=48.43};\] the Neyman allocation is thus \(n_1={84}\) and \(n_2={48}\).

If we do not trust the study conducted beforehand, and we assume that the variance is constant in each stratum, \(\sigma_1=\sigma_2\), then we have \[N_1={21123}, \quad N_2={16321},\] and \[N_1+N_2={21123+16321=37444},\] from which we conclude that \[n_1={132\left(\frac{21123}{37444}\right)=74.46}\quad \text{and}\quad n_2={132\left(\frac{16321}{37444}\right)=57.54};\] la proportional allocation is thus \(n_1={74}\) and \(n_2={58}\). \(\blacksquare\)

Sample Size, Given a Bound on the Error of Estimation

In theory, only analytical considerations should influence the choice of sample size.

Recall that in a StS of size \(n\), the sampling weight corresponding to the stratum \(\mathcal{U}_i\) is \(w_i=frac{n_i}{n}\), for \(1\leq i\leq M\). When we estimate \(\mu\) via \(\overline{y}_{\text{StS}}\), the bound on the error of estimation can be written \[B_{\mu;\text{StS}}={2\sqrt{\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2}{w_i\cdot n}\Big(\frac{N_i-w_i\cdot n}{N_i-1}\Big)}}.\] We seek to express \(n\) in terms of the parameters \(N_i\), \(\sigma_i\), \(w_i\), and \(B_{\mu;\text{StS}}\).

If \(N_i \gg 1\) (what is hopefully the casein practice), then \(N_i\approx N_i-1\) and so \[\begin{aligned} {\underbrace{\strut \frac{B^2_{\mu;\text{StS}}}{4}}_{=D_{\mu;\text{StS}}}}&{\approx \frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2}{w_i\cdot n}\Big(\frac{N_i-w_i\cdot n}{N_i}\Big)} \\ & \Longleftrightarrow {N^2D_{\mu;\text{StS}}\approx\frac{1}{n}\left\{\sum_{i=1}^M\frac{N_i^2\sigma_i^2}{w_i}\right\}-\sum_{i=1}^M\frac{N_i^2\sigma_i^2}{w_i}\cdot \frac{w_i}{N_i}} \\ & \Longleftrightarrow {\frac{N^2D_{\mu;\text{StS}}+\displaystyle{\sum_{i=1}^M N_i\sigma_i^2}}{\displaystyle{\sum_{i=1}^M\frac{N_i^2\sigma_i^2}{w_i}}}\approx \frac{1}{n}} \Longleftrightarrow n_{\mu;\text{StS}}\approx \frac{\displaystyle{\sum_{i=1}^M\frac{N_i^2\sigma_i^2}{w_i}}}{N^2D_{\mu;\text{StS}}+\displaystyle{\sum_{i=1}^M N_i\sigma_i^2}}.\end{aligned}\]

Under general optimal allocation, the stratum sampling weights are given by \[w_i= \displaystyle{\frac{N_i\sigma_i}{\sqrt{c_i}}}\left(\displaystyle{\sum_{k=1}^M\frac{N_k\sigma_k}{\sqrt{c_k}}}\right)^{-1}\!\!\!\!\!\!\! , \quad 1\leq i\leq M,\] and the sample size is then \[\begin{aligned} n_{\mu;\text{StS}}&\approx{\frac{\displaystyle{\left(\sum_{i=1}^M\frac{N_i^2\sigma_i^2}{N_i\sigma_i/\sqrt{c_i}} \right) \div \left(\sum_{k=1}^M\frac{N_k\sigma_k}{\sqrt{c_k}}\right)^{-1}}}{N^2D_{\mu;\text{StS}}+\displaystyle{\sum_{i=1}^M N_i\sigma_i^2}}} = \frac{\displaystyle{\left(\sum_{i=1}^MN_i\sigma_i\sqrt{c_i}\right) \left(\sum_{i=1}^M\frac{N_i\sigma_i}{\sqrt{c_i}}\right)}}{{N^2D_{\mu;\text{StS}}+\displaystyle{\sum_{i=1}^M N_i\sigma_i^2}}}\end{aligned}\]

Under Neyman allocation, the stratum sampling weights are given by \[w_i= \displaystyle{N_i\sigma_i}\left(\displaystyle{\sum_{k=1}^MN_k\sigma_k}\right)^{-1}\!\!\!\!\!\!\! , \quad 1\leq i\leq M,\] and the sample size is then \[\begin{aligned} n_{\mu;\text{StS}}&\approx{\frac{\displaystyle{\left(\sum_{i=1}^M\frac{N_i^2\sigma_i^2}{N_i\sigma_i} \right) \div \left(\sum_{k=1}^MN_k\sigma_k\right)^{-1}}}{N^2D_{\mu;\text{StS}}+\displaystyle{\sum_{i=1}^M N_i\sigma_i^2}}} = \frac{\displaystyle{\Big(\sum_{i=1}^MN_i\sigma_i\Big)^2}}{{N^2D_{\mu;\text{StS}}+\displaystyle{\sum_{i=1}^M N_i\sigma_i^2}}}\end{aligned}\]

In a proportional allocation scenario, the stratum sampling weights are given by \[w_i= \displaystyle{N_i}\left(\displaystyle{\sum_{k=1}^MN_k}\right)^{-1}\!\!\!\!\!\!\! , \quad 1\leq i\leq M,\] and the sample size is then \[\begin{aligned} n_{\mu;\text{StS}}&\approx{\frac{\displaystyle{\left(\sum_{i=1}^M\frac{N_i^2\sigma_i^2}{N_i} \right) \div \left(\sum_{k=1}^MN_k\right)^{-1}}}{N^2D_{\mu;\text{StS}}+\displaystyle{\sum_{i=1}^M N_i\sigma_i^2}}} = \frac{\displaystyle{\sum_{i=1}^MN_i\sigma_i^2}}{{ND_{\mu;\text{StS}}+\displaystyle{\frac{1}{N}\sum_{i=1}^M N_i\sigma_i^2}}}\end{aligned}\]

When we try to estimate the total \(\tau\) using the estimator \(\hat{\tau}_{\text{StS}}\), we must substitute \[D_{\mu;\text{StS}}=\frac{B^2_{\mu;\text{StS}}}{4}\quad\text{by} \quad D_{\tau;\text{StS}}=\frac{B^2_{\tau;\text{StS}}}{4N^2}.\]

When we want to estimate a proportion \(p\) using the estimator \(\hat{p}_{\text{StS}}\), the bound remains \[D_{p;\text{StS}}=\frac{B^2_{p;\text{StS}}}{4},\] but we have to substitute the stratum variances \(\sigma_i^2\) by \(p_i(1-p_i)\). The proportions \(p_i\) can be estimated with the help of a previous study, or, conservatively, by using \(p_i=0.5\).

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\), separated in two disjoint strata \(\mathcal{U}_1\) and \(\mathcal{U}_2\), of respective sizes \(N_1=21,123\) and \(N_2=16,321\). We seek to estimate the mean \(\mu\) of \(\mathcal{U}\) using a StS, with a bound on the error of estimation of \(B_{\mu;\text{StS}}=5\). The response costs by stratum are \(c_1=400\$\) and \(c_2=100\$\). In a preliminary study, we estimated \(\sigma_1\approx 20\) and \(\sigma_2\approx 15\). Determine the sample size and allocation in each of the three scenarios: general optimal allocation, Neyman allocation, and proportional allocation (in the last two cases, use \(c_1=c_2=100\$\)).

Solution: in the general case, we have \[\begin{aligned} \frac{N_1\sigma_1}{\sqrt{c_1}}&={\frac{21123(20)}{\sqrt{400}}=21123,} \quad \frac{N_2\sigma_2}{\sqrt{c_2}}={\frac{16321(15)}{\sqrt{100}}=24481.5,} \\ N_1\sigma_1\sqrt{c_1}&={21123(20)\sqrt{400}=8449200,}\quad N_2\sigma_2\sqrt{c_2}={16321(15)\sqrt{100}=2448150}\\ N_1\sigma_1^2&={21123(20)^2=8449200,} \quad N_2\sigma_2^2={16321(15)^2=3672225,} \\ \sum_{i=1}^2\frac{N_i\sigma_i}{\sqrt{c_i}}&={45604.5,} \quad \sum_{i=1}^2N_i\sigma_i\sqrt{c_i}={10897350,} \quad \sum_{i=1}^2N_i\sigma_i^2={12121425,}\\ D_{\mu;\text{StS}}&={\frac{5^2}{4}=6.25,}\quad n={\frac{(10897350)(45604.5)}{(37444)^2(6.25)+12121425}=56.63\approx 57}\\ n_1&={57\left(\frac{21123}{45604.5}\right)=26.4\approx 26},\quad n_2={57\left(\frac{24481.5}{45604.5}\right)=30.6\approx 31.}\end{aligned}\]

If instead the response cost per stratum is constant (\(c_1=c_2=100\)), we have: \[\begin{aligned} N_1\sigma_1&={21123(20)=422460,} \quad N_2\sigma_2={16321(15)=244815,} \\ N_1\sigma_1^2&={21123(20)^2=8449200,} \quad N_2\sigma_2^2={16321(15)^2=3672225,} \\ \sum_{i=1}^2N_i\sigma_i&={667275,} \quad \sum_{i=1}^2N_i\sigma_i^2={12121425,}\\ D_{\mu;\text{StS}}&={\frac{5^2}{4}=6.25,}\quad n={\frac{(667275)^2}{(37444)^2(6.25)+12121425}=50.74\approx 51}\\ n_1&={51\left(\frac{422460}{667275}\right)=32.30\approx 32},\quad n_2={51\left(\frac{244815}{667275}\right)=18.71\approx 19.}\end{aligned}\] It turns out that the exact value of \(c_1=c_2\) does not come into play.

If we look for a proportional allocation, we still have \[\begin{aligned} N_1\sigma_1&={21123(20)=422460,} \quad N_2\sigma_2={16321(15)=244815,} \\ N_1\sigma_1^2&={21123(20)^2=8449200,} \quad N_2\sigma_2^2={16321(15)^2=3672225,} \\ \sum_{i=1}^2N_i\sigma_i&={667275,} \quad \sum_{i=1}^2N_i\sigma_i^2={12121425,}\\ D_{\mu;\text{StS}}&={\frac{5^2}{4}=6.25,}\quad n={\frac{12121425}{37444(6.25)+\frac{12121425}{37444}}=51.72\approx 52}\\ n_1&={52\left(\frac{21123}{37444}\right)=29.33\approx 29},\quad n_2={52\left(\frac{16321}{37444}\right)=22.67\approx 23.}\end{aligned}\] The exact value of \(c_1=c_2\) still does not come into play. \(\blacksquare\)

Sample Size, Given a Budget

In practice, however, it is often budgetary considerations that play the most important role in sample size selection.

In a StS of size \(n\), the stratum sampling weights are \(w_i=\frac{n_i}{n}\), for \(1\leq i\leq M\). In this case, we seek to maximize the size \(n\) allowed by the survey budget \(C\): \[C=c_0+\sum_{i=1}^Mc_in_i=c_0+n\sum_{i=1}^Mc_iw_i \implies n={\frac{C-c_0}{\displaystyle{\sum_{i=1}^Mc_iw_i}}}.\]

In a general optimal allocation scenario, we have \[w_i= \displaystyle{\frac{N_i\sigma_i}{\sqrt{c_i}}}\left(\displaystyle{\sum_{k=1}^M\frac{N_k\sigma_k}{\sqrt{c_k}}}\right)^{-1}\!\!\!\!\!\!\! , \quad 1\leq i\leq M,\] from which we see that \[c_iw_i=c_i\cdot \frac{N_i\sigma_i}{\sqrt{c_i}}\left(\displaystyle{\sum_{k=1}^M\frac{N_k\sigma_k}{\sqrt{c_k}}}\right)^{-1}\!\!\!\!\!\!\! =N_i\sigma_i\sqrt{c_i}\left(\displaystyle{\sum_{k=1}^M\frac{N_k\sigma_k}{\sqrt{c_k}}}\right)^{-1}\!\!\!\!\!\!\! , \quad 1\leq i\leq M;\] the sample size is then \[\begin{aligned} n_{\text{StS}}&={(C-c_0)\left(\displaystyle{\sum_{i=1}^M\frac{N_i\sigma_i}{\sqrt{c_i}}}\right)\left(\displaystyle{\sum_{i=1}^MN_i\sigma_i\sqrt{c_i}}\right)^{-1}}\!\!\!\!\!\! .\end{aligned}\]

In a Neyman allocation or proportional allocation scenario, the sample weights are \[w_i= \displaystyle{N_i\sigma_i}\Big(\displaystyle{\sum_{k=1}^MN_k\sigma_k}\Big)^{-1}\!\!\!\!\!\!\! , \quad 1\leq i\leq M,\] from which we see that \[c_iw_i=c\cdot{N_i\sigma_i}\Big(\displaystyle{\sum_{k=1}^M{N_k\sigma_k}}\Big)^{-1}\!\!\!\!\!\!\! , \quad 1\leq i\leq M;\] the sample size is then \[\begin{aligned} n_{\text{StS}}&={(C-c_0)\left(\displaystyle{\sum_{i=1}^MN_i\sigma_i}\right)\left(c\displaystyle{\sum_{i=1}^MN_i\sigma_i}\right)^{-1}\!\!\!\!\!\! =\frac{C-c_0}{c}}.\end{aligned}\]

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\), separated in two disjoint strata \(\mathcal{U}_1\) and \(\mathcal{U}_2\), of respective sizes \(N_1=21,123\) and \(N_2=16,321\). We seek to estimate the mean \(\mu\) of \(\mathcal{U}\) using a StS. The budget for the study is \(C=20,000\$\), minus \(c_0=4,000\$\) for overhead costs. The cost of a response in each stratum are \(c_1=400\$\) and \(c_2=100\$\), respectively. In a preliminary study, we estimate \(\sigma_1=20\) and \(\sigma_2=15\). Determine the sample size and allocation in each of the three scenarios: general optimal allocation, Neyman allocation, and proportional allocation (in the last two cases, use \(c_1=c_2=100\$\)).

Solution: in the general case, we have \[\begin{aligned} \frac{N_1\sigma_1}{\sqrt{c_1}}&={\frac{21123(20)}{\sqrt{400}}=21123,} \quad \frac{N_2\sigma_2}{\sqrt{c_2}}={\frac{16321(15)}{\sqrt{100}}=24481.5,} \\ N_1\sigma_1\sqrt{c_1}&={21123(20)\sqrt{400}=8449200, }\\ N_2\sigma_2\sqrt{c_2}&={16321(15)\sqrt{100}=2448150}\\ \frac{N_1\sigma_1}{\sqrt{c_1}}+\frac{N_2\sigma_2}{\sqrt{c_2}}&={21123+ 24481.5=45604.5,} \\ N_1\sigma_1\sqrt{c_1}+N_2\sigma_2\sqrt{c_2}&={8449200+2448150=10897350,} \\ n&={(20000-4000)\Big(\frac{45604.5}{10897350}\Big)=66.96\approx 66,}\\ n_1={66\left(\frac{21123}{45604.5}\right)}&{=30.56\approx 31},\quad n_2={66\left(\frac{24481.5}{45604.5}\right)=35.43\approx 35.}\end{aligned}\]

If the response cost per stratum is constant (\(c_1=c_2=100\)): \[\begin{aligned} N_1\sigma_1&={21123(20)=422460,} \quad N_2\sigma_2={16321(15)=244815,} \\ N_1\sigma_1+N_2\sigma_2&={422460+ 244815=667275,} \\ n&={\frac{20000-4000}{100}=160,}\\ n_1={160\left(\frac{422460}{667275}\right)}&{=101.3\approx 101},\quad n_2={160\left(\frac{244815}{667275}\right)=58.7\approx 59.}\end{aligned}\]

If we also assume that the variances are equal in the 2 strata, the sample size remains \(n={160}\), but the proportional allocation yields \[\begin{aligned} n_1&={160\left(\frac{21123}{37444}\right)=90.25\approx 90}\quad \text{and}\quad n_2={160\left(\frac{16321}{37444}\right)=69.74\approx 70.}\quad \blacksquare\end{aligned}\]

5.4.3 Comparison Between SRS and StS

Let \(\mathcal{U}=\{u_1,\ldots,u_N\}\) have mean \(\mu\) and variance \(\sigma^2\).

Using a SRS of size \(n\), we can construct the estimator \(\overline{y}_{\text{SRS}}\), with sampling variance \[\text{V}(\overline{y}_{\text{SRS}})={\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)}.\] We have studied the properties of such estimators in section 5.3.

If \(\mathcal{U}\) can be split into \(M\) strata \[\begin{aligned} \mathcal{U}_1&=\{u_{1,1},\ldots, u_{1,N_1}\}, \ \cdots \ , \ \mathcal{U}_M=\{u_{M,1},\ldots, u_{M,N_M}\},\end{aligned}\] with mean and variance \(\mu_i\) and \(\sigma_i^2\), respectively, for \(1\leq i\leq M\).

Using a StS of size \(n=(n_1,\ldots,n_M)\), we can construct the estimator \(\overline{y}_{\text{StS}}\), with sampling variance \[\text{V}(\overline{y}_{\text{StS}})={\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)}.\]

Both samples have the same size; is there any way to determine which of the two approaches is preferable before computing the confidence intervals? In general, the sample design for which the sampling variance of the corresponding estimator is smallest is preferred.62

If \(N\gg n\) and \(N_i\gg n_i\) for all \(1\leq i\leq M\), then \({N-n\approx N-1}\) and \(N_i-n_1\approx N_i-1\) for all \(1\leq i\leq M\). Consequently, \[\text{V}(\overline{y}_{\text{SRS}})\approx {\frac{\sigma^2}{n}=\frac{1}{nN}\sum_{i=1}^M\sum_{j=1}^{N_i}(u_{i,j}-\mu)^2}\quad \text{and}\quad \text{V}(\overline{y}_{\text{StS}})\approx {\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2}{n_i}}.\]

In a proportional allocation scenario, \(n_i={n\cdot \frac{N_i}{N}}\) for all \(1\leq i\leq M\), from which we see that \[\text{V}(\overline{y}_{\text{StS}})_{\text{Prop}}\approx{\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2\cdot N}{nN_i}=\frac{1}{nN}\sum_{i=1}^MN_i\sigma_i^2}.\]

In a Neyman allocation scenario, \(n_i={n\cdot \frac{N_i\sigma_i}{N_1\sigma_1+\cdots+N_M\sigma_M}}\) for all \(1\leq i \leq M\), from which we see that \[\text{V}(\overline{y}_{\text{StS}})_{\text{Neyman}}\approx{\frac{1}{N^2}\sum_{i=1}^MN_i^2\cdot \frac{\sigma_i^2 \left(\sum_{k=1}^MN_{k}\sigma_k\right)}{nN_i\sigma_i}=\frac{1}{nN^2}\left(\sum_{i=1}^MN_i\sigma_i\right)^2}.\]

But \[\begin{aligned} \text{V}(\overline{y}_{\text{SRS}})&\approx {\frac{1}{nN}\sum_{i=1}^M\sum_{j=1}^{N_i}(u_{i,j}-\mu)^2=\frac{1}{nN}\sum_{i=1}^M\sum_{j=1}^{N_i}(u_{i,j}-\mu_i+\mu_i-\mu)^2} \\ &={\frac{1}{nN}\sum_{i=1}^M\sum_{j=1}^{N_i}\left\{(u_{i,j}-\mu_i)^2+2(u_{i,j}-\mu_i)(\mu_i-\mu)+(\mu_i-\mu)^2\right\}}\\ &={\frac{1}{nN}\left\{\sum_{i=1}^M\underbrace{\strut \sum_{j=1}^{N_i}(u_{i,j}-\mu_i)^2}_{N_i\sigma_i^2}+2\sum_{i=1}^M(\mu_i-\mu)\underbrace{\strut \sum_{j=1}^{N_i}(u_{i,j}-\mu_i)}_{N_i\mu_i-N_i\mu_i=0}+\sum_{i=1}^M(\mu_i-\mu)^2\underbrace{\strut \sum_{j=1}^{N_i}1}_{N_i}\right\}} \\ &={\frac{1}{nN}\left\{\sum_{i=1}^MN_i\sigma_i^2+\sum_{i=1}^MN_i(\mu_i-\mu)^2\right\}=\text{V}(\overline{y}_{\text{StS}})_{\text{Prop}}+\frac{1}{nN}\sum_{i=1}^MN_i(\mu_i-\mu)^2}.\end{aligned}\]

As such, \[\text{V}(\overline{y}_{\text{SRS}})\gg \text{V}(\overline{y}_{\text{StS}})_{\text{Prop}}, \quad \text{whenever } {\frac{1}{nN}\sum_{i=1}^MN_i(\mu_i-\mu)^2\gg 0};\] a StS under proportional allocation is substantially preferable to a SRS when the variance of the stratum means is high.

Similarly, set \[\overline{\sigma}=\frac{1}{N}\sum_{i=1}^MN_i\sigma_i={\sqrt{n\text{V}(\overline{y}_{\text{StS}})_{\text{Neyman}}}}.\] As such, \[\begin{aligned} \text{V}(\overline{y}_{\text{StS}})_{\text{Prop}}-\text{V}(\overline{y}_{\text{StS}})_{\text{Neyman}}&={\frac{1}{nN}\sum_{i=1}^MN_i\sigma_i^2-\frac{\overline{\sigma}^2}{n}=\frac{1}{nN}\left\{\sum_{i=1}^MN_i\sigma_i^2-N\overline{\sigma}^2\right\}}\\ &={\frac{1}{nN}\sum_{i=1}^MN_i(\sigma_i^2-2\sigma_i\overline{\sigma}+\overline{\sigma}^2)=\frac{1}{nN}\sum_{i=1}^MN_i(\sigma_i-\overline{\sigma})^2\geq 0;} \end{aligned}\] a StS under Neyman allocation is substantially preferable to a StS under proportional allocation when the variance of the stratum standard deviations is high.

Combining these, we can conclude that a StS under Neyman allocation is substantially preferable to a SRS when stratum means and standard deviations vary greatly across strata.

Since in practice there are other considerations at play (sampling cost, etc.), one may still decide in favor of a SRS or a StS under proportional allocation, especially if the difference in the corresponding variances is (relatively) small.