5.3 Simple Random Sampling

Let \(\mathcal{U}\) be a population composed of \(N\) units, whose responses are \[\mathcal{U}=\{u_1,\ldots,u_N\}. \] Suppose we are interested in the mean \(\mu\) of this target population \(\mathcal{U}\), where \[\mu=\frac{1}{N}\sum_{j=1}^Nu_j.\] Since the population is of finite size, it is possible to compute \(\mu\) directly… at least, in theory.

In practice, we rarely have access to the response values for the entire population \(\mathcal{U}\), which leads us to use sampling methods.

A sample \(\mathcal{Y}\) of size \(n\) is a subset of the target population \(\mathcal{U}\), \[\mathcal{Y}\subseteq \{y_1,\ldots,y_n\}\subseteq \{u_1,\ldots,u_N\}=\mathcal{U}, \] from which we can approximate \(\mu\) using the sample mean \[\overline{y}=\frac{1}{n}\sum_{i=1}^ny_i\] (note that this is not the only estimator of \(\mu\)).

A simple random sample (SRS) of size \(n\) is obtained by randomly selecting \(n\) units from the target population, one at a time, without replacement. In Figure 5.5, a SRS of size \(n=16\) is selected from a population of size \(N=64\).

Schematics of SRS: target population (left) and sample (right).Schematics of SRS: target population (left) and sample (right).

Figure 5.5: Schematics of SRS: target population (left) and sample (right).

At each stage of the sampling procedure, all units not yet in the sample have the same probability of being added to the sample. In an SRS, each subset of \(n\) units has the same probability of being selected.

How do we choose a random sample? This used to be done “by hand”, using tables of random numbers. Nowadays, we use software (SAS, R, etc.) to obtain (pseudo-)random samples.

Example: What is the average life span, by country, in 2011?

Solution: We use the data available in the Gapminder dataset.

library(dplyr)
gapminder = as.data.frame(unclass(data.frame(read.csv("Data/gapminder_SS.csv"))),
                          stringsAsFactors=TRUE)
#gapminder <- gapminder[complete.cases(gapminder),]
gapminder <- gapminder[,c("country","year","region","continent",
                          "population","infant_mortality","fertility","gdp",
                          "life_expectancy")]

The structure is provided below:

str(gapminder)
'data.frame':   10545 obs. of  9 variables:
 $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
 $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
 $ population      : int  1636054 11124892 5270844 54681 20619075 1867396 54208 10292328 7065525 3897889 ...
 $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
 $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
 $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
 $ life_expectancy : num  62.9 47.5 36 63 65.4 ...

A now-famous chart displays the relationship between 4 of the variables, for 2011:

Health and wealth of nations for the 2011 Gapminder data.

Figure 5.6: Health and wealth of nations for the 2011 Gapminder data.

We extract the information of interest:

library(tidyverse)
gapminder.SRS <- gapminder |> 
  filter(year==2011) |>
  select(life_expectancy)
str(gapminder.SRS)
'data.frame':   185 obs. of  1 variable:
 $ life_expectancy: num  77.4 76.1 58.1 75.9 76 ...

In this specific example, we know that the true average life expectancy per country in 2011 (at least, for the \(N=185\) countries in the dataset):

summary(gapminder.SRS)
life_expectancy
Min. :46.70
1st Qu.:65.30
Median :73.70
Mean :71.18
3rd Qu.:77.40
Max. :83.02

The distribution of the population \(\mathcal{U}=\{u_1,\ldots,u_{185}\}\) is shown below:

ggplot(data=gapminder.SRS, aes(life_expectancy)) +  
  geom_rug() + 
  geom_histogram(col="black", fill="blue", alpha=.2, 
                 breaks=seq(45, 85, by = 2.5)) +  
  geom_vline(xintercept=mean(gapminder.SRS$life_expectancy),
             color="red") 
Distribution of the country life expectancies (in 2011), with mean value (in red).

Figure 5.7: Distribution of the country life expectancies (in 2011), with mean value (in red).

Let us select a random sample of size \(n=10\). The indices are given by:

set.seed(1234) # for replicability
N = dim(gapminder.SRS)[1]
n = 10
(sample.ind = sample(1:N,n, replace=FALSE))
 [1]  28  80 150 101 111 137 133 166 144 132

The sample \(\mathcal{Y}=\{y_1,\ldots, y_{10}\}\) is obtained via:

(gapminder.SRS.n = gapminder.SRS[sample.ind,])
 [1] 67.60 67.70 76.10 79.97 75.70 79.70 70.20 59.60 78.90 78.50

The empirical mean of this sample is:

(y.bar = mean(gapminder.SRS.n))
[1] 73.397

which is to say, that \[\overline{y}={\frac{67.6+67.7+\cdots +78.5}{10}=73.40}.\]

But a different sample may lead to a different estimate. Case in point, consider the following:

set.seed(12345) # replicability
(sample.ind = sample(1:N,n, replace=FALSE))
(gapminder.SRS.n = gapminder.SRS[sample.ind,])
 [1] 142  51 152  58  93  75  96   2  86 180
 [1] 71.0 74.3 63.0 81.6 65.0 75.0 46.7 76.1 78.1 74.8

The empirical mean of this second sample is:

(y.bar = mean(gapminder.SRS.n))
[1] 70.56

It is quite reasonable for the estimates to be different – since each \(y_i\) in a SRS is a random variable, so is the mean \(\overline{y}\).

The sampling variability explains how the estimates vary with the sample. For example, if we prepare \(m=500\) samples, each of size \(n=10\), we could obtain the empirical means below:

set.seed(12) 
N=dim(gapminder.SRS)[1]
n=10
m=500
means <- c()
for(k in 1:m){
 means[k] <- mean(gapminder.SRS[sample(1:N,n, replace=FALSE),])
 }

ggplot(data=data.frame(means), aes(means)) + 
    geom_histogram(aes(y =..density..), 
                   breaks=seq(60, 80, by = 1), 
                   col="black", fill="blue", alpha=.2) + 
    geom_density(col=2) + geom_rug(aes(means)) 

There is some variability, of course, but the sample means seem to congregate around the 72 mark:

means
Min. :63.03
1st Qu.:69.83
Median :71.53
Mean :71.44
3rd Qu.:73.05
Max. :78.86

5.3.1 Basic Notions

The population variance \(\sigma^2\) is a measure of dispersion, i.e., the tendency of the response values to deviate from the population mean \(\mu\): \[\begin{aligned} \sigma^2&=\frac{1}{N}\sum_{j=1}^N(u_j-\mu)^2=\frac{1}{N}\sum_{j=1}^N(u_j^2-2u_j\mu+\mu^2) \\ &=\frac{1}{N}\left(\sum_{j=1}^Nu_j^2-2\mu \sum_{j=1}^Nu_j+N\mu^2\right)=\frac{1}{N}\left(\sum_{j=1}^Nu_j^2- 2N\mu^2+N\mu^2\right) \\ &=\frac{1}{N}\sum_{j=1}^N\left(u_j^2-N\mu^2\right)=\frac{1}{N}\sum_{j=1}^Nu_j^2-\mu^2\end{aligned}\]

The parameters \(\mu\) and \(\sigma^2\) can be interpreted in terms of the expectation and variance of a random variable.

Let \(X\) be a discrete random variable whose probability mass function (p.m.f.) is \(f(x)=P(X=x)\). Thus, \[\begin{aligned} \text{E}[X]&=\sum_x xf(x),\quad \text{V}[X]=\sum_x (x-\text{E}[X])^2f(x),\quad\text{and}\quad \text{SD}[X]=\sqrt{\text{V}[X]}.\end{aligned}\]

For a sample of size \(n=1\) from this population, whose value is represented by the random variable \(Y_1\), we have \(f(u_j)=P(Y_1=u_j)={\frac{1}{N}}\) for \(j=1,\ldots,N\), from which \[\text{E}[Y_1]={\sum_{j=1}^N u_j f(u_j)}={\frac{1}{N}\sum_{j=1}^Nu_j=\mu},\] and \[\begin{aligned} \text{V}[Y_1]&={\sum_{j=1}^N(u_j-\mu)^2f(u_j)=\frac{1}{N}\sum_{j=1}^Nu_j^2-\mu^2=\sigma^2}, \quad \text{SD}[Y_1]={\sqrt{\text{V}[Y_1]}=\sigma}.\end{aligned}\]

In general, however, the estimator \(\overline{y}\) of the population mean \(\mu\) is computed using more than one observation – different sample sizes \(n\) could yield different values of \(\overline{y}\). In order to control the sampling error associated with an SRS, one needs to know the distribution of \(\overline{Y}\); in particular, \(\text{E}[\overline{Y}]\) and \(\text{V}[\overline{Y}]\).

If \(y_1,\ldots,y_n\) are independent and identically distributed (i.i.d.) random variables, the central limit theorem (CLT) imposes \[\overline{Y}\sim_{\text{approx.}}\mathcal{N}(\mu,\sigma^2/n).\]

Example: Consider a finite population with \(N=4\) elements: \[u_1=2,\quad u_2=0,\quad u_3=1,\quad u_4=5.\] The population mean and variance are, respectively, \[\mu={\frac{1}{4}(2+0+1+5)=2\quad\text{and}\quad \sigma^2= \frac{1}{4}(2^2+0^2+1^2+5^2)-2^2=\frac{7}{2}}.\]

Suppose that draw a SRS of size \(n=3\) without replacement from this population in order to approximate (estimate) the true mean \(\mu\). There are \({\binom{4}{3}=4}\) such samples:

Sample Values \(\overline{y}\) \(P(\overline{Y}=\overline{y})\)
\(u_1,u_2,u_3\) \(2,0,1\) \(1\) \(1/4\)
\(u_1,u_2,u_4\) \(2,0,5\) \(7/3\) \(1/4\)
\(u_1,u_3,u_4\) \(2,1,5\) \(8/3\) \(1/4\)
\(u_2,u_3,u_4\) \(0,1,5\) \(2\) \(1/4\)

Then \[\begin{aligned} \text{E}[\overline{Y}]&=\sum_{\overline{y}}\overline{y}P(\overline{Y}=\overline{y})={\textstyle{\frac{1}{4}\left(1+\frac{7}{3}+\frac{8}{3}+2\right)}=2=\mu} \\ \text{V}[\overline{Y}]&=\sum_{\overline{y}}\overline{y}^2P(\overline{Y}=\overline{y})- \text{E}^2[\overline{Y}]={\textstyle{\frac{1}{4}\left(1^2+\left(\frac{7}{3}\right)^2+\left(\frac{8}{3}\right)^2+2^2\right)-2^2=\frac{7}{18}}}. \end{aligned}\]

This is all great… except that \({\text{V}[\overline{Y}]\neq\frac{\sigma^2}{n}=\frac{7}{6}}\). What is going on? \(\blacksquare\)

Here’s how we can explain this discrepancy. Let \(\mathcal{U}=\{u_1,\ldots,u_N\}\) be a finite population of size \(N\). A SRS \(\mathcal{Y}=\{y_1,\ldots,y_n\}\) of size \(n\) is drawn from \(\mathcal{U}\) without replacement. Let \(Y_i\) be the random variable which represents the value of the \(i-\)th unit of the sample, respectively.

All \(Y_i\) have identical distributions: for any \(u_j\in \mathcal{U}\), we have:55): \[\begin{aligned} P(Y_1=u_j)&=\frac{1}{N}, \\ P(Y_2=u_j)&= \frac{P(Y_2=u_j\mid Y_1\neq u_j)\cdot P(Y_1\neq u_j)}{P(Y_1\neq u_j\mid Y_2=u_j)}=\frac{\frac{1}{N-1}\cdot \frac{N-1}{N}}{1}=\frac{1}{N}, \\ P(Y_3=u_j)&= \frac{P(Y_3=u_j \mid Y_1,Y_2\neq u_j)\cdot P(Y_1,Y_2\neq u_j)}{P(Y_1,Y_2\neq u_j\mid Y_3=u_j)}= \frac{\frac{1}{N-2}\cdot \frac{N-2}{N-1}\cdot \frac{N-1}{N}}{1}=\frac{1}{N}, \end{aligned}\] and so on: \[P(Y_i=u_j)=\frac{1}{N}\] for any \(1\leq i \leq n\), \(1 \leq j \leq N\), and so \(\text{E}[Y_i]=\mu\) and \(\text{V}[Y_i]=\sigma^2\) for any \(i\).

Thus, in the preceding example, we would have \[\text{E}[Y_1]=\text{E}[Y_2]=\text{E}[Y_3]=\mu=2\quad\text{and}\quad \text{V}[Y_1]=\text{V}[Y_2]=\text{V}[Y_3]=\sigma^2=\frac{7}{2}.\]

But the \(\{Y_i\}\) are not independent of each other since (for example) \[\text{E}[\overline{Y}]=\mu=2, \quad \text{but}\quad \text{V}[\overline{Y}]=\text{V}\left[\textstyle{\frac{Y_1+Y_2+Y_3}{3}}\right]=\frac{7}{18}\neq \frac{\sigma^2}{3}=\frac{7/2}{3}=\frac{7}{6}. \] It is in the variance that we observe a difference. The covariance between two (discrete) random variables \(X_1\), \(X_2\) is a measure of the strength of association between \(X_1\) and \(X_2\). If \(\text{E}[X_i]=\mu_i\) and \(\text{V}[X_i]=\sigma^2_i<\infty\) for all \(i\), then \[\text{Cov}[X_1,X_2]= {\text{E}[(X_1-\mu_1)(X_2-\mu_2)]=\text{E}[X_1X_2]-\mu_1\mu_2}.\]

If \(X_1, X_2\) both take values in \(\mathcal{U}=\{u_1,\ldots,u_N\}\), then their joint expectation is \[\text{E}[X_1X_2]={\sum_{j=1}^N\sum_{k=1}^Nu_ju_kP(X_1=u_j,X_2=u_k)}.\] In the case where \(X_1=Y_i\) and \(X_2=Y_{\ell}\) (with the interpretation given before) for \(1\leq i\neq\ell\leq n\), we get \[P(Y_i=u_j, Y_{\ell}=u_k)=P(Y_i=u_j)P(Y_{\ell}=u_k \mid Y_i=u_j)=\begin{cases}\frac{1}{N}\cdot \frac{1}{N-1} & \text{if $j\neq k$} \\ 0 & \text{if $j= k$}\end{cases}\]

But \(\text{E}[Y_i]=\text{E}[Y_{\ell}]=\mu\), hence \[\begin{aligned} \text{Cov}(Y_i,Y_{\ell})&=\begin{cases} \frac{1}{N(N-1)}\Big[\displaystyle{\sum_{j=1 }^N\sum_{k=1}^Nu_ju_k}-\underbrace{\strut \sum_{m=1}^Nu_m^2}_{\text{doublecounting}}\Big] -\mu^2 & \text{if $i\neq \ell$} \\ \sigma^2 & \text{if $i=\ell$ (by convention)} \end{cases}\end{aligned}\]

We use the properties \(\sum u_{\xi} ={N\mu}\) and \(\sum u_{\xi}^2 = {N(\mu^2+\sigma^2)}\) to simplify the expression when \(i=\neq \ell\): \[\begin{aligned} \text{Cov}(Y_i,Y_{\ell})&={\frac{1}{N(N-1)}\Big[\sum_{j=1 }^N\sum_{k=1}^Nu_ju_k-\sum_{m=1}^Nu_m^2 -N(N-1)\mu^2\Big]} \\ &={\frac{1}{N(N-1)}\Big[\sum_{j=1}^Nu_j\Big(\sum_{k=1}^Nu_k\Big)-N(\sigma^2+\mu^2)-N(N-1)\mu^2\Big]} \\ &={\frac{1}{N(N-1)}\Big[N\mu\sum_{j=1}^Nu_j-N\sigma^2-N\mu^2-N^2\mu^2+N\mu^2\Big]} \\ &={\frac{1}{N(N-1)}\Big[N\mu\cdot N\mu-N\sigma^2-N^2\mu^2\Big]}=-\frac{\sigma^2}{N-1}.\end{aligned}\]

Using the formulas of the previous section, we thus obtain \[\begin{aligned} \text{E}[\overline{Y}]&={\text{E}\Big[\frac{Y_1+\cdots+Y_n}{n}\Big]=\frac{1}{n}\text{E}[Y_1+\cdots+Y_n]=\frac{1}{n}\Big(\text{E}[Y_1]+\cdots \text{E}[Y_n]\Big)} \\ &={\frac{1}{n}(\underbrace{\strut \mu+\cdots+\mu}_{\text{$n$ times}})}=\mu, \quad \text{and}\\ \text{V}[\overline{Y}]&={\text{V}\Big[\frac{Y_1+\cdots+Y_n}{n}\Big]=\frac{1}{n^2}\text{V}[Y_1+\cdots+Y_n]=\frac{1}{n^2}\sum_{i=1}^n\sum_{\ell=1}^n\text{Cov}(Y_i,Y_{\ell})} \\ &={\frac{1}{n^2}\Big[\sum_{i=1}^n\sigma^2+2\sum_{i=1}^n\sum_{\ell=i+1}^n\text{Cov}(Y_i,Y_\ell)\Big]=\frac{1}{n^2}\Big[n\sigma^2-n(n-1)\frac{\sigma^2}{N-1}\Big]} \\ &= {\frac{\sigma^2}{n}\left(1-\frac{n-1}{N-1}\right)}=\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big).\end{aligned}\]

Let’s go back to the above example: we have \(N=4\), \(n=3\), \(\mu=2\), and \(\sigma^2=\frac{7}{2}\). According to what we have just found, we indeed get \[\text{E}[\overline{Y}]= {2}\qquad \text{and} \qquad \text{V}[\overline{Y}]={\frac{7/2}{3}\left(\frac{4-3}{4-1}\right)=\frac{7}{18}}.\]

The component \(\frac{N-n}{N-1}\) is the finite population correction factor (FPCF); it shows up because the population is not infinite. Since the SRS is constructed without replacing the units in the finite population after they have been drawn into the sample, the presence of a unit in the SRS affects the probability that another unit will also be in the SRS – the random variables \(Y_i\) are not independent. When \(N\) is “large” and the ratio \(\frac{n}{N}\) is “small”, the FPCF \(\approx 1\), in which case the situation is very similar to sampling with replacement.

5.3.2 Estimators and Confidence Intervals

The estimator \(\overline{y}\) is unbiased under SRS. In that case, how do we interpret the sapling variance \(\text{V}(\overline{y})\)? Quite simply, it provides an idea of the typical distance between the empirical mean \(\overline{y}\) and the population mean \(\mu\).

The mean square error of \(\overline{y}\) under SRS is \[\begin{aligned} \text{MSE}(\overline{y})&={\text{V}(\overline{y})+\left(\text{E}(\overline{y})-\mu\right)^2= \text{V}(\overline{y})+0=\text{V}(\overline{y})},\end{aligned}\] which is to say that the estiation error is entirely dominated by \(\text{V}(\overline{y})\).

When we sample with replacement (\(\neg\)SRS), the samples \(y_1,\ldots,y_n\) are considered to be independent from one another. If they were also indentically distributed, we would then have \(\text{E}(y_i)=\mu\) and \(\text{V}(y_i)=\sigma^2\) for \(1\leq i\leq n\), which is to say, that \[\text{E}(\overline{y})={\mu},\quad\text{and}\quad \text{V}(\overline{y})={\frac{\sigma^2}{n}}.\]

When \(n\to \infty\), the central limit theorem states that \(\overline{y}\sim{_{\text{approx.}}\mathcal{N}(\mu,\sigma^2/n)}\), whence \[Z=\frac{\overline{y}-\mu}{\text{SD}(\overline{y})}={\frac{\overline{y}-\mu}{\sigma/\sqrt{n}}\sim_{\text{approx.}} \mathcal{N}(0,1)}.\]

Let \(\alpha\in (0,1)\). Denote the \((1-\alpha)^{\text{th}}\) quantile of a standard normal random variable \(Z\sim \mathcal{N}(0,1)\) by \(z_{\alpha}>0\). According to the frequentist interpretation of probability, we can expect that \(\frac{\overline{y}-\mu}{\sigma/\sqrt{n}}\) will fall in the interval \((-z_{\alpha/2},z_{\alpha/2})\) roughly \(100(1-\alpha)\%\) of the time: \[P(-z_{\alpha/2}\leq Z\leq z_{\alpha/2})=P\left(-z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\leq \overline{y}-\mu\leq z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right)\approx 1-\alpha.\]

The quantity \[B_{\alpha}=z_{\alpha/2}\frac{\sigma}{\sqrt{n}}=z_{\alpha/2}\text{SD}(\overline{y})\] is the bound on the error of estimation, and we can build an approximate 95% confidence interval for the mean \(\mu\): \[\text{C.I.}(\mu;100(1-\alpha)\%):\quad {y\pm B_{\alpha} = y\pm z_{\alpha/2}\frac{\sigma}{\sqrt{n}}}.\]

However, in a SRS scenario, we are NOT dealing with i.i.d. random variables. How must this argument be modified when we sample without replacement from a finite population?

Context

We will illustrate the important concepts of sampling theory with the help of the 2011 Gapminder dataset, as we had done at the start of the section. In addition to the average life expectancy, we will also be interested in the:

  • total population of the planet,

  • average population per country, and

  • proportion of countries with a population of less than 10M.

The population of 185 countries is available – it ranges from \(56,641\) to \(1,348,174,478\), with an average value \(\mu=37,080,426\).

gapminder.SRS <- gapminder |> 
  filter(year==2011) |>
  select(life_expectancy,population)
str(gapminder.SRS)
'data.frame':   185 obs. of  2 variables:
 $ life_expectancy: num  77.4 76.1 58.1 75.9 76 ...
 $ population     : int  2886010 36717132 21942296 88152 41655616 2967984 101936 22542371 8423559 9227512 ...
life_expectancy population
Min. :46.70 Min. :5.644e+04
1st Qu.:65.30 1st Qu.:2.064e+06
Median :73.70 Median :7.563e+06
Mean :71.18 Mean :3.708e+07
3rd Qu.:77.40 3rd Qu.:2.423e+07
Max. :83.02 Max. :1.348e+09
ggplot(data=gapminder.SRS, aes(population)) +  
  geom_rug() + 
  geom_vline(xintercept=mean(gapminder.SRS$population),
             color="red") + 
  geom_histogram(col="black", fill="blue", alpha=.2) 
Distribution of the country populations (in 2011), with mean value (in red).

Figure 5.8: Distribution of the country populations (in 2011), with mean value (in red).

The population distribution by country is asymmetric, with a tail that spreads to the right, and two outliers (China and India). These observations will sometimes be removed from the data set.

gapminder.SRS.2 <- gapminder |> 
  filter(year==2011) |>
  select(life_expectancy,population) |> 
  filter(population<500000000)
nrow(gapminder.SRS.2)
[1] 183
gapminder.SRS.2.population
Min. : 56441
1st Qu.: 2061342
Median : 7355231
Mean : 23301958
3rd Qu.: 22242334
Max. :312390368
ggplot(data=gapminder.SRS.2, aes(population)) +  
  geom_rug() + 
  geom_vline(xintercept=mean(gapminder.SRS$population),
             color="red") + 
  geom_histogram(col="black", fill="blue", alpha=.2) 
Distribution of the country populations (in 2011), with mean value (in red), without China and India.

Figure 5.9: Distribution of the country populations (in 2011), with mean value (in red), without China and India.

The associated distribution has the same shape as the one with all countries, but the 183 populations all fall below \(312,390,368\), with a mean value of \(\mu=23,301,958\).

Estimating the Mean \(\mu\)

In an SRS, we have shown that the empirical mean \(\overline{y}\) computed from a sample of size \(n\) is an unbiased estimator of the mean \(\mu\) of a population of size \(N\) and variance \(\sigma^2\). We have also shown that the sampling variance of the \(\overline{y}\) estimator is \[\text{V}(\overline{y})=\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big).\]

What distribution can we expect \(\overline{y}\) to follow? Let’s go back to the example of the world population (without China and India). We produce 500 SRS samples of \(n=20\) countries from the list of \(N=183\) countries. For each sample \(1 \leq i \leq 500\), we compute the empirical mean \(\overline{y}_i\) – their distribution takes the following form:

set.seed(12) # replicability
N=dim(gapminder.SRS.2)[1]
n=20
m=500

means <- c()
for(k in 1:m){
    means[k] <- mean(gapminder.SRS.2[sample(1:N,n, replace=FALSE),2])
  }

The SRS sample means are listed below:

means
Min. : 5244486
1st Qu.:16289930
Median :21986525
Mean :23238867
3rd Qu.:28718720
Max. :55152022

Their distribution (and mean) is:

ggplot(data=data.frame(means), aes(means)) + 
  geom_rug() +  
  geom_histogram(col="black", fill="blue", alpha=.2) +
  geom_vline(xintercept=mean(means), color="red") 

Although the distribution of empirical means \(\overline{y}_i\) is asymmetric with a tail spreading to the right, the density curve still resembles that of a normal distribution.

Central Limit Theorem – SRS: let \(\mathcal{U}=\{u_1,\ldots, u_N\}\) be a finite population with mean \(\mu\) and variance \(\sigma^2\), and let \(\mathcal{Y}=\{y_1,\ldots,y_n\}\subseteq \mathcal{U}\) be a simple random sample. If \(n\) and \(N-n\) are both “sufficiently large”, then \[\overline{y}\sim_{\text{approx.}}{\mathcal{N}\left(\text{E}(\overline{y}),\text{V}(\overline{y})\right)=\mathcal{N}\left(\mu,\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)\right)}.\]

In a SRS, the bound on the error of estimation and the approximate 95% C.I. are given by: \[B_{\mu}={2\sqrt{\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)}}\quad\text{and}\quad P(|\overline{y}-\mu|\leq B_{\mu})\approx {P\left(\left|\textstyle{\frac{\overline{y}-\mu}{\text{SD}(\overline{y})}}\right|\leq 2\right)\approx 0.9544}.\]

In practice, the population variance \(\sigma^2\) is rarely known. We usually approximate it with the empirical variance \[s^2={\frac{1}{n-1}\sum_{i=1}^n(y_i-\overline{y})^2=\frac{1}{n-1}\Big[\sum_{i=1}^ny_i^2-n\overline{y}^2\Big]},\quad \{y_i\}\text{ i.i.d.}\]

Unfortunately, \(s^2\) is a biased estimator of \(\sigma^2\) when the simple random sample is selected without replacement from a finite population. Indeed, \[\begin{aligned} \text{E}(s^2)&=\text{E}\left[\frac{1}{n-1}\sum_{i=1}^n(y_i-\overline{y})^2\right] \\ &= \text{E}\left[\frac{1}{n-1}\sum_{i=1}^n(y_i-\mu+\mu-\overline{y})^2\right] \\ &={\text{E}\left[\frac{1}{n-1}\Big[\sum_{i=1}^n(y_i-\mu)^2-n(\overline{y}-\mu)^2\Big]\right]} \\ &=\frac{1}{n-1}\Big[\sum_{i=1}^n\text{E}\left[(y_i-\mu)^2\right]-n\text{E}\left[(\overline{y}-\mu)^2\right]\Big] \\ &=\frac{1}{n-1}\Big[\sum_{i=1}^n\sigma^2-n\text{V}(\overline{y})\Big] \\ &={\frac{1}{n-1}\Big[n\sigma^2-n\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)\Big]=\frac{\sigma^2}{n-1}\Big[n-\frac{N-n}{N-1}\Big]} \\ &={\frac{\sigma^2}{n-1}\Big[\frac{nN-n-N+n}{N-1}\Big]=\frac{\sigma^2}{n-1}\cdot\frac{N(n-1)}{N-1}}=\frac{N}{N-1}\sigma^2. \end{aligned}\]

The unbiased estimator of \(\sigma^2\) in the SRS context is instead \[\frac{N-1}{N}s^2\] since \[\text{E}\left[\frac{N-1}{N}s^2\right]={\frac{N-1}{N}\text{E}(s^2)=\frac{N-1}{N}\cdot \frac{N}{N-1}\sigma^2=\sigma^2}.\]

We can approximate the sampling variance by replacing \(\sigma^2\) by \(\frac{N-1}{N}s^2\) in the expression for \(\text{V}(\overline{y})\): \[\hat{\text{V}}(\overline{y})={\frac{N-1}{N}\cdot\frac{s^2}{n}\left(\frac{N-n}{N-1}\right)=\frac{s^2}{n}\left(\frac{N-n}{N}\right)}=\frac{s^2}{n}\left(1-\frac{n}{N}\right).\]

The bound on the error of estimation is thus approxiated by \[B_{\mu}\approx \hat{B}_{\mu}=2\sqrt{\hat{\text{V}}(\overline{y})}={2\sqrt{\frac{s^2}{n}\Big(1-\frac{n}{N}\Big)}},\] from which we conclude that \[\text{C.I.}(\mu;0.95): \quad {\overline{y}\pm 2 \sqrt{\frac{s^2}{n}\Big(1-\frac{n}{N}\Big)}}\] is an approximate 95% confidence interval for \(\mu\).

If the population variance \(\sigma^2\) is known, the FPCF is \(\frac{N-n}{N-1}\); if it is unknwon, the FPCF in \(1-\frac{n}{N}\). In practice, when the sampling rate \(\frac{n}{N}\) is below 5%, we can easily drop the FPCF \(({1-\frac{n}{N}\approx 1})\) without affecting the resulting quantities too greatly.

Example: we draw a SRS sample \(\mathcal{Y}\) of size \(n=132\) from a finite population \(\mathcal{U}\) with \(N=37,444\) units. Let the sample mean and sample standard deviation be \(\overline{y}=111.3\) and \(s=16.35\), respectively. Find an approximate 95% C.I. for the population average \(\mu\).

Solution: the bound on the error of estimation is roughly \[\hat{B}_{\mu}=2\sqrt{\hat{\text{V}}(\overline{y})}=2\sqrt{\frac{16.35^2}{132}\Big(1-\frac{132}{37444}\Big)}\approx 2.8,\] which implies that \[\text{C.I.}(\mu;0.95)\approx 111.3\pm 2.8;\] the outcome is basically the same without the FPCF. \(\blacksquare\)

Example: find an approximate 95% C.I. for the average population per country in 2011 (excluding China and India) with a SRS of size \(n=20\).

Solution: we draw such a SRS sample and compute its sample mean \(\overline{y}\) and sample variance \(s^2\) (the outcomes will of course vary from one sample to another).

set.seed(12) # replicability
N = dim(gapminder.SRS.2)[1]
n = 20
SRS = gapminder.SRS.2[sample(1:N,n, replace=FALSE),2]
(y.bar = mean(SRS))
(s.2 = var(SRS))
[1] 35217143
[1] 5.492071e+15

If we do not know the true population variance, the bound \(\hat{B}_{\mu}\) is given by:

(B.hat = 2*sqrt(s.2/n*(1-n/N)))
[1] 31278890

and the corresponding approximate 95% C.I. for \(\mu\) is:

(IC.hat = c(y.bar-B.hat,y.bar+B.hat))
[1]  3938253 66496034

We can compare with the true mean \(\mu\):

(mu = mean(gapminder.SRS.2[,2]))
[1] 23301958

Sure enough, \(\mu\) is in the confidence interval:

mu > IC.hat[1] & mu < IC.hat[2]
[1] TRUE

In this case, however, we also knew the population variance \(\sigma^2\):

(sigma.2 = var(gapminder.SRS.2[,2]))
[1] 1.885224e+15

The bound \(B_{\mu}\) can then be obtained via:

(B = 2*sqrt(sigma.2/n*(N-1)/(N-n)))
[1] 20518160

The corresponding approximate 95% C.I. for \(\mu\) is thus:

(IC = c(y.bar-B,y.bar+B))
[1] 14698984 55735303

Sure enough, \(\mu\) is in the confidence interval:

mu > IC[1] & mu < IC[2]
[1] TRUE

In both cases, the true mean \(\mu=23,301,958\) is contained in the confidence interval. We also notice that the C.I. when the variance \(sigma^2\) is known is contained in the 95% C.I. when the variance is not known.56 \(\blacksquare\)

In this case, the true mean was in the confidence interval. But it could be that the 95% C.I. constructed from a sample does not contain the mean \(\mu\).

Example: we repeat this procedure \(m=1000\) times (with different samples each time). If the CLT for SRS applies, how many times would we expect \(\mu\) to be in the approximate 95% C.I. built from the simple random samples? Assume that \(\sigma^2\) is not known.

m = 1000
mu.in.IC = c()
y.bar = c()
for(j in 1:m){
    test = gapminder.SRS.2[sample(1:N,n, replace=FALSE),2]
    s.2 = var(test)
    B.hat = 2*sqrt(s.2/n*(1-n/N))
    y.bar[j] = mean(test)
    mu.in.IC[j] = y.bar[j]-B.hat < mu & mu < y.bar[j]+B.hat
  }
mean(mu.in.IC)
[1] 0.821

This is not the \(\approx\) 95% we expected; but if we increase the sample size, the proportion gets closer to 95% (see Exercises). The long tail of the population distribution for \(N=183\) units probably plays a role – the distribution of the sample measn \(\overline{y}\) (with \(m=1000\) samples of size \(n=20\)) does not appear to be normal.

ggplot(data=data.frame(y.bar), aes(y.bar)) + 
  geom_rug() +  
  geom_histogram(col="black", fill="blue", alpha=.2) +
  geom_vline(xintercept=mean(y.bar), color="red") 

Estimating the Total \(\tau\)

Most of the work has already been done: since the total \(\tau\) can be re-written as \[\tau=\sum_{j=1}^Nu_j=N\mu,\] we can approximate \(\tau\) with a SRS through the formula \[\hat{\tau}={N\overline{y}=\frac{N}{n}\sum_{i=1}^ny_i}.\] This estimator is unbiased since its expectation is \[\text{E}(\hat{\tau})={\text{E}(N\overline{y})=N\cdot \text{E}(\overline{y})=N\mu=\tau}.\]

Its sampling variance is given by \[\text{V}(\hat{\tau})={\text{V}(N\overline{y})=N^2\cdot \text{V}(\overline{y})=N^2\cdot \frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)};\] the bound on the estimation error is thus \[B_{\tau}=2\sqrt{\text{V}(\hat{\tau})}={2\sqrt{N^2\cdot \frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)}=N\cdot B_{\mu}}.\]

Since we do not usually know the true population variance \(\sigma^2\) of \(\mathcal{U}\), we provide an approximation by substituting \(\sigma^2\) by the sample variance \(s^2\), which needs to be multiplied by the “biased” factor \({\frac{N-1}{N}}\).57 We can thus provide an approximation of the sampling variance using \[\hat{\text{V}}(\hat{\tau})={\hat{\text{V}}(N\overline{y})=N^2\cdot \frac{s^2}{n}\Big(1-\frac{n}{N}\Big)};\] this yields an approximate bound on the estimation error of \[B_{\tau}\approx\hat{B}_{\tau}=2\sqrt{\hat{\text{V}}(\hat{\tau})}={2\sqrt{N^2\cdot \frac{s^2}{n}\Big(1-\frac{n}{N}\Big)}=N\cdot \hat{B}_{\mu}},\] and an approximate 95% C.I. for \(\tau\): \[\text{C.I.}(\tau;0.95):\quad {\hat{\tau}\pm 2 \sqrt{N^2\cdot \frac{s^2}{n}\Big(1-\frac{n}{N}\Big)}}.\]

Example: Consider a sample \(\mathcal{Y}\) of size \(n=132\) drawn from a finite population \(\mathcal{U}\) of size \(N=37,444\). Suppose the empirical mean and standard deviation of the sample are \(\overline{y}=111.3\) and \(s=16.35\), respectively. Give an approximate 95% C.I. for the total \(\tau\) in \(\mathcal{U}\).

Solution: the approximate bound on the error of estimation \[\hat{B}_{\tau}=2\sqrt{N^2\cdot \hat{\text{V}}(\overline{y})}=2\sqrt{37444^2\cdot \frac{16.35^2}{132}\Big(1-\frac{132}{37444}\Big)}\approx 106,383.9643,\] which yields \[\text{C.I.}(\tau;0.95)\approx 37,444 \cdot 111.3\pm 106,383.9643=4,167,517.2 \pm 106,384.0,\] or simply \((4,061,133.2;4,273,901.2)\). \(\blacksquare\)

Example: find an approximate 95% C.I. for the population of the planet in 2011 (excluding China and India), using a SRS of size \(n=20\), assuming that \[\overline{y}=27,396,632\quad\text{and}\quad \text{C.I.}(\mu;0.95)\equiv (6,755,099 ; 48,038,164).\]

Solution: we have \(\hat{B}_{\mu}\approx { 48,038,164-27,396,632=20,641,532}\) and \[\hat{B}_{\tau}\approx N\hat{B}_{\mu}={183\cdot 20,641,532 =3,777,400,356},\] from which we conclude that \[\text{C.I.}(\tau;0.95):\quad N\overline{y}\pm B_{\tau}={183(27,396,632)\pm 3,777,400,356},\] or simply, \(\text{C.I.}(\tau;0.95):\equiv {(1,236,183,300;8,790,984,012)}\).58 \(\blacksquare\)

Estimating a Proportion \(p\)

In a population \(\mathcal{U}\) where \(u_j \in \{0,1\}\) represents a binary response for all \(1\leq j\leq N\) (for example, \(u_j=1\) when the corresponding unit possesses a certain characteristic, and \(u_j=0\) when it does not), the mean takes a particular interpretation: \[p=\mu=\frac{1}{N}\sum_{j=1}^Nu_j\] is the proportion of the units possessing the characteristic in question.

This proportion can be estimated with a SRS via: \[\hat{p}=\overline{y}={\frac{1}{n}\sum_{i=1}^ny_i\quad y_i\in \{0,1\}}.\]

It is an unbiased estimator of the proportion since its expectation is \[\text{E}(\hat{p})={\text{E}(\overline{y})=\mu=p};\] its sampling variance is \[\text{V}(\hat{p})={\text{V}(\overline{y})= \frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)}.\] But \(U^2=U\) when \(U\) is a binary response, from which we see that \[\begin{aligned} \sigma^2&={\text{E}[U^2]-\text{E}^2[U]=\text{E}[U]-\text{E}^2[U]=p-p^2=p(1-p),}\end{aligned}\] and so \[\text{V}(\hat{p})={\frac{p(1-p)}{n}\Big(\frac{N-n}{N-1}\Big)}.\]

The bound on the error of estimation is thus \[B_{p}=2\sqrt{\text{V}(\hat{p})}={2\sqrt{\frac{p(1-p)}{n}\Big(\frac{N-n}{N-1}\Big) }}.\] When the population variance \(\sigma^2\) is unknown (which is to say, when the true \(p\) is unknown, which is usually the case), the sampling variation approximation is \[\hat{\text{V}}(\hat{p})={\hat{\text{V}}(\overline{y})=\frac{s^2}{n}\Big(1-\frac{n}{N}\Big)}.\]

But recall that \(y_i\) only takes on the values \(0\) and \(1\), so that \(y_i^2=y_i\) for \(1\leq i \leq n\), from which we see that \[\begin{aligned} s^2&=\frac{1}{n-1}\Big(\sum_{i=1}^ny_i^2 - n\overline{y}^2\Big)={\frac{n\overline{y}-n\overline{y}^2}{n-1}=\frac{n(\hat{p}-\hat{p}^2)}{n-1}=\frac{n\hat{p}(1-\hat{p})}{n-1}},\end{aligned}\] and \[\hat{\text{V}}(\hat{p})={\frac{n\hat{p}(1-\hat{p})}{(n-1)n}\Big(1-\frac{n}{N}\Big)=\frac{\hat{p}(1-\hat{p})}{n-1}\Big(1-\frac{n}{N}\Big)}.\]

The approximate estimation error bound becomes \[B_{p}\approx\hat{B}_{p}=2\sqrt{\hat{\text{V}}(\hat{p})}={2\sqrt{\frac{\hat{p}(1-\hat{p})}{n-1}\Big(1-\frac{n}{N}\Big)}},\] with the corresponding approximate 95% C.I. for \(p\) being \[\text{C.I.}(p;0.95):\quad {\hat{p}\pm 2\sqrt{\frac{\hat{p}(1-\hat{p})}{n-1}\Big(1-\frac{n}{N}\Big)}}.\]

Example: consider a sample \(\mathcal{Y}\) of size \(n=132\) drawn from a finite population \(\mathcal{U}\) of size \(N=37,444\). Suppose that \(25\) of the observations of \(\mathcal{Y}\) have a particular characteristic. Find an approximate 95% C.I. for the proportion \(p\) of the observations of \(\mathcal{U}\) that possess the feature.

Solution: In this case, \(\hat{p}=25/132\approx 0.19\). The required approximate bound is thus \[\hat{B}_p=2\sqrt{\hat{\text{V}}(\hat{p})}=2\sqrt{\frac{0.19(1-0.19)}{132-1}\Big(1-\frac{132}{37444}\Big)}\approx 0.0684,\] from which we get \[\text{C.I.}(p;0.95)\approx 0.19 \pm 0.0684 \equiv (0.121,0.258).\quad \blacksquare\]

Example: find an approximate 95% C.I. for the proportion of countries for which the 2011 population fell below 10M, using a SRS with sample size \(n=20\).

Solution: let’s draw a SRS sample of size \(n=20\) and compute \(\hat{p}\) (results will vary from one sample to when the population of a country is smaller than 10M and FALSE otherwise.

set.seed(1234) # replicability 
N=dim(gapminder.SRS.2)[1]
n=20
thresh.10 <- gapminder.SRS.2[,2] < 10000000
SRS = thresh.10[sample(1:N,n, replace=FALSE)]

The proportion of countries with a population smaller than 10M in that sample is:

(p.hat = mean(SRS))
[1] 0.6

The true proportion \(p\), amongst the \(N=185\) countries, is:

(p = mean(thresh.10))
[1] 0.5737705

If we assume that population variance is unknown, the bound \(\hat{B}_p\) and the approximate 95% C.I. are given by:

(B.p = 2*sqrt(p.hat*(1-p.hat)/(n-1)*(1-n/N)))
(IC = c(p.hat-B.p,p.hat+B.p))
[1] 0.2121422
[1] 0.3878578 0.8121422

The true proportion \(p\approx 0.568\) is indeed in the confidence interval. If we repeat this process \(m=1000\) times, how often is the true proportion found inside the obtained C.I.?

m=1000
p.in.IC = c()
p.hat = c()
for(j in 1:m){
    p.hat[j] = mean(thresh.10[sample(1:N,n, replace=FALSE)])
    B.p = 2*sqrt(p.hat[j]*(1-p.hat[j])/(n-1)*(1-n/N))
    p.in.IC[j] = p.hat[j]-B.p < p & p < p.hat[j]+B.p
  }
mean(p.in.IC)
[1] 0.963

Quite close to 95%, you will agree. The distribution of the \(m=1000\) estimates \(\hat{p}\) is shown below, with the true proportion (red vertical line).

ggplot(data=data.frame(p.hat), aes(p.hat)) + 
    geom_histogram(bins=21, col="black", fill="blue", alpha=.2) + 
    geom_vline(xintercept=mean(gapminder.SRS.2[,2]<10000000), color="red") +
  xlim(0,1)

5.3.3 Sample Size

Selecting an appropriate sample size is a challenge, and there is a bit of a chicken-and-egg scenario at play.

Firstly, there is a practical problem associated with sampling: since the cost associated with each response can be costly (in terms of time/cost), we often seek to minimize the size of the realized sample \(\mathcal{Y}\), given a desired error bound.

Secondly, the SRS error bound is expressed as \[B_{\xi}=2\sqrt{\text{V}(\hat{\xi})}, \quad \xi \in\{\mu,\tau,p\},\] but the variance depends on the sample size \(|\mathcal{Y}|=n\). We must then express \(n\) in terms of the (known) parameters \(N\), \(sigma^2\), and \(B_{\xi}\).

Mean \(\mu\)

If we are trying to estimate the mean \(\mu\), we have: \[\begin{aligned} B_{\mu}=2\sqrt{\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{\mu}}{4}}_{=D_{\mu}}=\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)} \\ & \Longleftrightarrow {\frac{(N-1)D_{\mu}}{\sigma^2}=\frac{N-n}{n}=\frac{N}{n}-1} \\ & \Longleftrightarrow {\frac{(N-1)D_{\mu}+\sigma^2}{\sigma^2}=\frac{N}{n}} \\ & \Longleftrightarrow n_{\mu}=\frac{N\sigma^2}{(N-1)D_{\mu}+\sigma^2}.\end{aligned}\] However, we can only use this formula is we know the population variance \(\sigma^2\).

We could chose to use the empirical variance \(s^2\) of the sample \(\mathcal{Y}\) as we did when we estimated the sample variance, we haven’t drawn \(\mathcal{Y}\) from \(\mathcal{U}\) yet!

Stratagies (to obtain \(\sigma^2\)):

  • use a preliminary sample (not necessarily random),

  • use the empirical variance obtained in a previous study, or

  • for a proportion, use a conservative estimate (\(p=0.5\)).

Example: consider a finite population \(\mathcal{U}\) with size \(N=37,444\). We are interested in the mean \(\mu\) of the response variable in \(\mathcal{U}\). In a preliminary SRS of size \(n=132\), we computed an (empirical) standard deviation of \(s=16.35\). Using \(\sigma=s\), find the minimal SRS sample size \(n_{\mu}\) required to estimate the mean with a bound on the error of estimation at most \(B_{\mu}=1.7\).

Solution: we can use the formula directly to get \[D_{\mu}={\frac{(1.7)^2}{4}\approx 0.73} \implies n_{\mu}={\frac{37444(16.35)^2}{(37444-1)(0.73)+16.35^2}=366.39\approx 367}.\quad \blacksquare\]

Total \(\tau\)

If instead, we are seeking to estimate the total \(\tau\), we have: \[\begin{aligned} B_{\tau}=2\sqrt{N^2\cdot \frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{\tau}}{4N^2}}_{=D_{\tau}}=\frac{\sigma^2}{n}\Big(\frac{N-n}{N-1}\Big)} \\ & \Longleftrightarrow {\frac{(N-1)D_{\tau}}{\sigma^2}=\frac{N-n}{n}=\frac{N}{n}-1} \\ & \Longleftrightarrow {\frac{(N-1)D_{\tau}+\sigma^2}{\sigma^2}=\frac{N}{n}} \\ & \Longleftrightarrow n_{\tau}=\frac{N\sigma^2}{(N-1)D_{\tau}+\sigma^2}.\end{aligned}\]

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\). We are interested in the total \(\tau\) of the response variable of \(\mathcal{U}\). In a preliminary SRS of size \(n=132\), we computed an empirical standard deviation of \(s=16.35\). Using \(\sigma=s\), find the minimal SRS sample size \(n_{\tau}\) required to estimate the total response with a bound on the error of estimation at most \(B_{\tau}=10000\).

Solution: we can use the formula directly to obtain \[D_{\tau}={\frac{(10000)^2}{4(37444)^2}\approx 0.018} \implies n_{\tau}={\frac{37444(16.35)^2}{(37444-1)(0.018)+16.35^2}\approx 10706}. \quad \blacksquare\]

Proportion \(p\)

If we are interested in the proportion \(p\), we have: \[\begin{aligned} B_{p}=2\sqrt{\frac{p(1-p)}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{p}}{4}}_{=D_{p}}=\frac{p(1-p)}{n}\Big(\frac{N-n}{N-1}\Big)} \\ & \Longleftrightarrow {\frac{(N-1)D_{p}}{p(1-p)}=\frac{N-n}{n}=\frac{N}{n}-1} \\ & \Longleftrightarrow {\frac{(N-1)D_{p}+p(1-p)}{p(1-p)}=\frac{N}{n}} \\ & \Longleftrightarrow n_{p}=\frac{Np(1-p)}{(N-1)D_{p}+p(1-p)}.\end{aligned}\]

Example: consider a finite population \(\mathcal{U}\) of size \(N=37,444\). We are interested in the proportion \(p\) of units that have a particular feature. In a preliminary SRS of size \(n=132\), we identify \(25\) observations possessing the feature. Using the approximation \(\sigma^2=\frac{25}{132}\cdot \frac{107}{132}\) from the preliminary sample, find the minimal SRS sample size \(n_{p}\) required to estimate the true proportion with a bound on the error of estimation of at most \(B_{p}=0.03\).

Solution: we use the formula directly and obtain \[D_{p}={\frac{(0.03)^2}{4}\approx 0.0002} \implies n_{p}={\frac{37444(0.189)(0.811)}{(37444-1)(0.0002)+(0.189)(0.811)}\approx 671}. \quad \blacksquare\]

Example: consider a situation similar to the previous example. Using the (onservative) approximation \(\sigma^2=(0.5)^2\), find the minimal SRS sample size \(n_{p}\) required to estimate the true proportion with a bound on the error of estimation of at most \(B_{p}=0.03\).

Solution: we use the formula directly and obtain \[D_{p}={\frac{(0.03)^2}{4}\approx 0.0002} \implies n_{p}={\frac{37444(0.5)(0.5)}{(37444-1)(0.0002)+(0.5)(0.5)}\approx 1080}.\quad \blacksquare\]