5.7 Special Topics

We complete this introduction to survey sampling by discussing a few additional topics, of which a few might be called advanced.

5.7.1 Systematic Sampling

With the advent of easy-to-access pseudo-random number generators (Excel, R, SAS, Python, etc.), it is not very arduous to draw a pseudo SRS \(\mathcal{Y}\) of size \(n\) from a population \(\mathcal{U}\) of size \(N\) (assuming that we have an appropriate sampling frame, of course).

However, it remains possible for the obtained sample to not be representative of the population: a SRS of countries that do not include China or India, for example, would not be very useful if we are trying to estimate the average population of the world’s countries. In some cases, a systematic sampling design (SyS) can be used to maximize the probability that the random sample \(\mathcal{Y}\) represents the population.

Here is how we draw a \(1-\)in\(-M\) systematic sample of size \(n\) (or \(n+1\)) from an ordered list of size \(N\):

  1. determine the integer part \(M=\lfloor \frac{N}{n}\rfloor\);

  2. randomly select an integer \(\gamma\) in \(\{1,2,\ldots, M\}\);

  3. the sample \(\mathcal{Y}\) then contains the values corresponding to units \[{\underbrace{\strut \gamma, \gamma+M,\ \gamma+2M,\ \ldots\ , \gamma + (n-1)M}_{n \text{ units}}, \ \underbrace{\strut \gamma+nM}_{\text{if }\gamma+nM\leq N}}.\]

If the ordering of the units in the sampling frame is fixed, there can only be \(M\) different SyS samples of size \(n\) (or \(n+1\), in some cases).
Schematics of SyS: target population (left) and sample (right; $1-$in$-4$).Schematics of SyS: target population (left) and sample (right; $1-$in$-4$).

Figure 5.15: Schematics of SyS: target population (left) and sample (right; \(1-\)in\(-4\)).

Example: The Gapminder dataset contains socio-economic information on \(N=185\) countries in 2011. What are the average life expectancy and population of the world’s countries?

Solution: We modify the code allowing us to access the data set:

gapminder.SyS <- gapminder |> filter(year==2011) |>
    select(country, life_expectancy, population)
N=nrow(gapminder.SyS)

There are 185 units in the data set. If we are interested in a SyS of size \(n=20\), say, the integer \(M\) is:

n=20
(M=floor(N/n))
[1] 9

The vector of observations \(0,M,2M,\ldots,nM\) is therefore:

index = M*(0:n)

We construct \(M=9\) samples \(\mathcal{Y}_{i}\), \(i=1,\ldots, 9\), assuming that the units appear in alphabetical order (by country name) in the dataset.

moy.SyS.life_exp = c() # initialization - life expectation
moy.SyS.pop = c()      # initialization - population

for(j in 1:M){ # all SyS of size n or n+1, alphabetical order
    index.tmp = j + index
    index.tmp <- index.tmp[index.tmp < N+1]  # keeping indices <= N
    sample.sys = gapminder.SyS[index.tmp,2:3]
    moy.SyS.life_exp[j]=mean(sample.sys$life_expectancy)
    moy.SyS.pop[j]=mean(sample.sys$population)
  }

# charts
par(mfrow=c(1,2))
plot(moy.SyS.life_exp, xlab="sample", ylab="mean life exp")
plot(moy.SyS.pop, xlab="sample", ylab="mean population")

Could you identify the sample that contains China or India? What if we change the order in which the countries are listed in the dataset?

gapminder.SyS <- gapminder.SyS[order(gapminder.SyS$population),] 

for(j in 1:M){ # all SyS of size n or n+1, population order
    index.tmp = j + index
    index.tmp <- index.tmp[index.tmp < N+1]
    sample.sys = gapminder.SyS[index.tmp,2:3]
    moy.SyS.life_exp[j]=mean(sample.sys$life_expectancy)
    moy.SyS.pop[j]=mean(sample.sys$population)
  }

par(mfrow=c(1,2))
plot(moy.SyS.life_exp, xlab="sample", ylab="mean life exp")
plot(moy.SyS.pop, xlab="sample", ylab="mean population")

We obtain similar results when ordering the units in the dataset by life expectancy. \(\blacksquare\)

In general, if there is a correlation between the position (rank) of the unit in the sampling frame and the value of the variable of interest, the sampling variance of the SyS estimator will be lower than that of the SRS estimator (because the sample is more likely to be representative of the population).

If there is no such correlation, the SyS sample is essentially an SRS sample, and the sampling variances are comparable (a SyS is as likely to be representative of the population as an SRS).

Finally, if the step \(M\) is aligned with the periodicity of the values of the variable of interest, it is the opposite: the sampling variance of a SyS is larger than that of an SRS (a SyS is then less representative of the population than an SRS). Some examples illustrating these situations are shown in Figure 5.16.

Various populations and systematic samplings: the order in which the population observations are presented may affect the representativity of the SyS sample.Various populations and systematic samplings: the order in which the population observations are presented may affect the representativity of the SyS sample.Various populations and systematic samplings: the order in which the population observations are presented may affect the representativity of the SyS sample.Various populations and systematic samplings: the order in which the population observations are presented may affect the representativity of the SyS sample.Various populations and systematic samplings: the order in which the population observations are presented may affect the representativity of the SyS sample.

Figure 5.16: Various populations and systematic samplings: the order in which the population observations are presented may affect the representativity of the SyS sample.

SyS as SRS

If the order in which the units are listed in the sampling frame is random (careful! this is not always easy to demonstrate), we can simply consider that the sample \[\mathcal{Y}_{\text{SyS}}=\underbrace{\{y_1,y_2,y_3,\ldots, y_{n-1},y_n\}}_{{\{u_{\gamma},u_{\gamma+M},\ldots, u_{\gamma+(n-1)M}\}}}\subseteq \mathcal{U}\] of size \(n\approx \frac{N}{M}\) is in fact a SRS of size \(n\). In that case, the theory developed in Section 5.3 for SRS remains valid.

Estimating the Mean \(\mu\)

The mean \[\overline{y}_{\text{SyS}}={\frac{1}{n}\sum_{i=1}^ny_i}\] is an unbiased estimator of the true population mean \(\mu\), with bound on the error of estimation \[B_{\mu;\text{SyS}}\approx \hat{B}_{\mu;\text{SyS}}= 2\sqrt{\hat{\text{V}}(\overline{y}_{\text{SyS}})}={2\sqrt{\frac{s_{\text{SyS}}^2}{n}\Big(1-\frac{n}{N}\Big)}},\ \text{where }s^2_{\text{SyS}}={\frac{1}{n-1}\sum_{i=1}^n(y_i-\overline{y}_{\text{SyS}})^2};\] the corresponding 95% C.I. for \(\mu\) is thus \[\text{C.I.}_{\text{SyS}}(\mu;0.95):\quad {\overline{y}_{\text{SyS}}\pm \hat{B}_{\mu;\text{SyS}}}.\]

Estimating the Total \(\tau\)

The quantity \[\hat{\tau}_{\text{SyS}}=N\overline{y}_{\text{SyS}}={\frac{N}{n}\sum_{i=1}^ny_i}\] is an unbiased estimator of the true population total \(\tau\), with bound on the error of estimation \[B_{\tau;\text{SyS}}\approx \hat{B}_{\tau;\text{SyS}}= 2N\sqrt{\hat{\text{V}}(\overline{y}_{\text{SyS}})}={2N\sqrt{\frac{s_{\text{SyS}}^2}{n}\Big(1-\frac{n}{N}\Big)}};\] the corresponding 95% C.I. for \(\tau\) is thus \[\text{C.I.}_{\text{SyS}}(\tau;0.95):\quad {\hat{\tau}_{\text{SyS}}\pm \hat{B}_{\tau;\text{SyS}}}.\]

Estimating the Proportion \(p\)

If \(y_i\in \{0,1\}\) denotes the absence or presence of a certain characteristic, the quantity \[\hat{p}_{\text{SyS}}={\overline{y}_{\text{SyS}}}\] is an unbiased estimator of the true proportion \(p\) of units with the characteristic, with bound on the error of estimation \[B_{p;\text{SyS}}\approx \hat{B}_{p;\text{SyS}}= 2\sqrt{\hat{\text{V}}(\hat{p}_{\text{SyS}})}={2\sqrt{\frac{\hat{p}_{\text{SyS}}(1-\hat{p}_{\text{SyS}})}{n-1}\Big(1-\frac{n}{N}\Big)}};\] the corresponding 95% C.I. for \(p\) is thus \[\text{C.I.}_{\text{SyS}}(p;0.95):\quad {\hat{p}_{\text{SyS}}\pm \hat{B}_{p;\text{SyS}}}.\]

SyS as ClS

In practice, SyS is equivalent to a ClS of size \(m=1\), where each cluster is one of the \(1-\)in\(-M\) SyS samples.

The quantity \[\overline{y}_G={\frac{\displaystyle{\sum_{k=1}^m\sum_{j=1}^{N_{i_k}}y_{i_k,j}}}{\displaystyle{\sum_{k=1}^mN_{i_k}}}=\frac{\displaystyle{\sum_{k=1}^my_{i_k}}}{\displaystyle{\sum_{k=1}^mN_{i_k}}}},\] where we use the ClS notation, is thus a biased estimator of the population mean, \(\mu\).

If the average cluster size is denoted by \(\overline{N}=\frac{N}{M}\), its sampling variance is \[\begin{aligned} \text{V}(\overline{y}_G)&\approx {\frac{1}{\overline{N}^2}\cdot \frac{1}{m}\Big(\frac{M-m}{M-1}\Big)\cdot \frac{1}{M}\sum_{\ell=1}^M(\underbrace{\strut \tau_{\ell}-\mu N_{\ell}}_{=N_{\ell}(\mu_{\ell}-\mu)})^2:=\frac{1}{\overline{N}^2}\cdot\frac{\sigma_G^2}{m}\Big(\frac{M-m}{M-1}\Big)},\end{aligned}\] and the corresponding 95% C.I. for \(\mu\) is thus \[\text{C.I.}_{\text{G}}(\mu;0.95):\quad {\overline{y}_G\pm 2\sqrt{\text{V}(\overline{y}_G)}}.\]

If the average cluster size \(\overline{N}\) is unknown, we simply substitute it by \[\overline{n}={\frac{1}{n}\sum_{k=1}^mN_{i_k}}.\]

The estimator of the total population \(\tau\) is thus either:

  • \(N\overline{y}_G\), when the number of units \(N\) in the population is known, or

  • \(M\overline{y}_T\), where \(\overline{y}_T\) is the (empirical) mean of the sampled cluster totals, when only \(M\) is known.

Consequently, the sampling variances are \[\text{V}(N\overline{y}_G)\approx {M^2\cdot\frac{\sigma_G^2}{m}\Big(\frac{M-m}{M-1}\Big)}\quad\text{and}\quad \text{V}(M\overline{y}_T)\approx {M^2\cdot\frac{\sigma_T^2}{m}\Big(\frac{M-m}{M-1}\Big)},\] where \(\sigma_G^2\) and \(\sigma_T^2\) are computed as for a ClS. We can then construct the 95% C.I. for \(\tau\) in the usual manner.

Pretty simple, eh?

The sample contains exactly \(m=1\) cluster, so \(\overline{n}=n\). The problem doesn’t end there – since we don’t know \(\sigma_G^2\) or \(\sigma^2_T\) in general, we would use the empirical variances \[\begin{aligned} \hat{\text{V}}(\overline{y}_G) &\approx {\frac{1}{\overline{N}^2}\cdot \frac{1}{m}\Big(1-\frac{m}{M}\Big)\cdot \frac{1}{m-1}\sum_{k=1}^m(y_{i_k}-\overline{y}_GN_{i_k})^2} \\ \hat{\text{V}}(M\overline{y}_T)&\approx {M^2\cdot \frac{1}{m}\Big(1-\frac{m}{M}\Big)\cdot \frac{1}{m-1}\sum_{k=1}^m(y_{i_k}-\overline{y}_T)^2}.\end{aligned}\]

But if \(m=1\), these variances do not exist. How do we get out of this mess? If we cannot treat the SyS as if it were a SRS (for whatever reason), the solution is to draw additional SyS samples (replicates) and treat it as a ClS, modifying the value of \(M\) as necessary.

5.7.2 Sampling with Probability Proportional to Size

In practice, the size (whether or not this is a physical characteristic) of the sample units is often quite variable – a SRS is not always effective since it does not take into account the importance that larger population units may have.

Additional information on the unit size can sometimes be used to select a sample that provides a more accurate estimator of the parameters of interest.

One possible way to do this is to assign (potentially) equal selection probabilities to different units, based on their size.

Example: to a certain extent (\(\rho={0.46}\)), the larger the area of a country, the larger its population. If we are trying to estimate the population of the planet, it might be desirable to adopt a sampling scheme in which the probability of selecting a country is proportional to its area – in an SRS, it is very likely that neither China nor India will be selected, resulting in an underestimate of the total sought. \(\blacksquare\)

If the variable of interest is (more or less) related to the size of the unit, one can assign a probability of selection proportional to the size of the unit (PPS). Note that in a PPS, previously selected units are replaced in the population, allowing for the multiple selection of a single unit.

Selecting a PPS With Replacement

We consider two selection methods for a PPS sample:

  • cumulative totals, and

  • the Lahiri method.

In both cases, the PPS sample selection procedure consists of associating with each unit a range of numbers (these are often integers, but that is not necessary) related to the size of the unit, and taking the units that correspond to numbers chosen at random from the set of numbers associated with the entire population of \(N\) units.

In the method of cumulative totals, the size of the \(i-\)th unit is denoted by \(x_i\), \(1\leq i \leq N\). We associate a range to each unit in the following way:

Unit Range
\(1\) \(1\) to \(x_1\)
\(2\) \(x_1+1\) to \(x_1+x_2\)
\(3\) \(x_1+x_2+1\) to \(x_1+x_2+x_3\)
\(\vdots\) \(\vdots\) \(\vdots\) \(\vdots\)
\(N-1\) \(x_1+\cdots + x_{N-2}+1\) to \(x_1 + \cdots + x_{N-2} + x_{N-1}\)
\(N\) \(x_1+\cdots + x_{N-1}+1\) to \(x_1 + \cdots + x_{N-1} + x_{N}\)

Finally, we draw a PPS sample by choosing \(n\) integers at random between \(1\) and \(X=x_1 + \cdots + x_{N-1} + x_{N}\) (with replacement) and by selecting the units associated with these integers.

Example: In a village, there are 8 orchards, each containing a certain number of apple trees. A sample of \(n=3\) orchards is drawn (with replacement), in proportion to the number of apple trees per orchard.

ID \(i\) Size \(x_i\) Cumulative Totals Associated Range
\(1\) \(50\) \({50}\) \({1 - 50}\)
\(2\) \(30\) \({80}\) \({51 - 80}\)
\(3\) \(25\) \({105}\) \({81 -105}\)
\(4\) \(40\) \({145}\) \({106 -145}\)
\(5\) \(26\) \({171}\) \({146 - 171}\)
\(6\) \(44\) \({215}\) \({172 - 215}\)
\(7\) \(20\) \({235}\) \({216 - 235}\)
\(8\) \(35\) \({270}\) \({236 - 270}\)

We choose \(n=3\) integers at random between \(1\) and \(270\): \(108\), \(140\), and \(201\), for example. The associated units are the \(4\)th, the \(4\)th, and the \(6\)th. \(\blacksquare\)

In the Lahiri method, we still denote the size of a unit by \(x_i\), \(1\leq i\leq N\), but without having to calculate and report the successive cumulative totals (which can be tedious to accomplish, even with a computer).

The method consists in selecting a pair of integers \((i,j)\), where \(1\leq i\leq N\) and \(1\leq j \leq M=\max\{x_i\mid 1\leq i\leq N\}\). If \(j \leq x_i\), the \(i\)th unit is added to the sample. Otherwise, the pair \((i,j)\) is rejected. We continue in this manner until \(n\) units have been selected.

There are other ways to do this, of course; the important thing is to have a mechanism for selecting a PPS sample. In general, we prefer sampling without replacement to sampling with replacement, but the latter provides a reasonable substitute to the former if \({\frac{n}{N}}\) is “sufficiently small”.

Estimation

Let us revisit the orchard example, where \(u_i\) is the yield of all apple trees in the \(i\)th orchard.

ID \(i\) # Trees \(x_i\) \(\pi_i\) Yield
\(1\) \(50\) \(50/270\) \(u_1=2250\)
\(2\) \(30\) \(30/270\) \(u_2=1080\)
\(3\) \(25\) \(25/270\) \(u_3=1300\)
\(4\) \(40\) \(40/270\) \(u_4=1400\)
\(5\) \(26\) \(26/270\) \(u_5=1196\)
\(6\) \(44\) \(44/270\) \(u_6=1716\)
\(7\) \(20\) \(20/270\) \(u_7=820\)
\(8\) \(35\) \(35/270\) \(u_8=1680\)

We are interested in the total apple production of the village, which we know in this case to be \(\tau=11,442\). Since in principle an orchard with more apple trees should produce more apples, we draw a PPS sample of \(n=3\) units (with replacement), where the number of apple trees in the orchard is used as the unit size.

In what follows, we illustrate the concepts using the sample \[{y_1={u_4=1400},y_2={u_4=1400},y_3={u_6=1716}}. \] If the sample \(\mathcal{Y}\), with \(|\mathcal{Y}|=n\), is drawn from \(\mathcal{U}\) using a PPS, the units \(y_1,\ldots, y_n\) are independent and distributed according to

\(y_i\) \(u_1\) \(\cdots\) \(u_j\) \(\cdots\) \(u_N\)
\(p(y_i)\) \(\pi_1\) \(\cdots\) \(\pi_j\) \(\cdots\) \(\pi_N\)

where \({0<\pi_j<1}\) for all \(1\leq j\leq N\) and \({\pi_1+\cdots+\pi_N=1}\).

For all \(1\leq i\leq n\), there is a \(1\leq j\leq N\) such that \(y_i=u_j\). Set \(w_i=\frac{u_j}{\pi_j}\). The sampling weights \(w_i\) are also independent and distributed according to \[P(y_i=u_j)={P\Big(w_i=\frac{u_j}{\pi_j}\Big)=\pi_j},\quad 1\leq i\leq n,\ 1\leq j\leq N.\]

We note that for any \(1\leq i\leq n\), the expected weight is \[\text{E}(w_i)=\sum_{j=1}^Nw_jP(w_i=w_j)={\sum_{j=1}^N \frac{u_j}{\pi_j}\cdot \pi_j=\sum_{j=1}^Nu_j=\tau}.\]

In other words, \[\hat{\tau}_{\text{pps}}=\overline{w}={\frac{1}{n}\sum_{i=1}^nw_i}\] is an unbiased estimator of the total \(\tau\). Its sampling variance is computed as follows: \[\begin{aligned} \text{V}(\hat{\tau}_{\text{pps}})&=\text{V}\left(\frac{1}{n}\sum_{i=1}^nw_i\right)=\underbrace{\strut \frac{1}{n^2}\sum_{i=1}^n\text{V}(w_i)}_{{\text{ind. des }\ w_i}}={\frac{1}{n^2}\sum_{i=1}^n\Big[\sum_{j=1}^N(w_j-\tau)^2P(w_i=w_j)\Big]} \\ &={\frac{1}{n^2}\sum_{i=1}^n\sum_{j=1}^N\!\!\Big(\frac{u_j}{\pi_j}-\tau\Big)^2\!\!\pi_j=\frac{1}{n}\sum_{j=1}^N\!\!\Big(\frac{u_j}{\pi_j}-\tau\Big)^2\!\!\pi_j=\frac{1}{n}\sum_{j=1}^N\!\!\Big(\frac{u_j^2}{\pi^j}-\frac{2\tau u_j}{\pi_j}+\tau^2\Big)\pi_j} \\ &={\frac{1}{n}\Big(\sum_{j=1}^N\frac{u_j^2}{\pi_j}-2\tau \underbrace{\strut \sum_{j=1}^Nu_j}_{=\tau}+\tau^2\underbrace{\strut \sum_{j=1}^N\pi_j}_{=1} \Big)=\frac{1}{n}\Big(\sum_{j=1}^N\frac{u_j^2}{\pi_j}-\tau^2 \Big)}.\end{aligned}\]

In practice, we do not typically know the true value of \(\tau\), so we use the unbiased estimator \[\hat{\text{V}}(\hat{\tau}_{\text{pps}})={\frac{1}{n(n-1)}\left(\sum_{i=1}^nw_i^2-n\hat{\tau}_{\text{pps}}^2\right)}.\]

Central Limit Theorem – PPS: if \(n\) and \(N-n\) are sufficiently large, then \[\hat{\tau}_{\text{pps}}\sim_{\text{approx.}}{\mathcal{N}\left(\tau,\hat{\text{V}}(\hat{\tau}_{\text{pps}})\right)}.\] The bound on the error of estimation and the 95% C.I. for \(\tau\) are therefore \[\hat{B}_{\tau;\text{pps}}={2\sqrt{\hat{\text{V}}(\hat{\tau}_{\text{pps}})}}\quad\text{and}\quad \text{C.I.}_{\text{pps}}(\tau;0.95)={\hat{\tau}_{\text{pps}}\pm \hat{B}_{\tau;\text{pps}}}.\]

Example: In the orchard dataset, we have \[\begin{aligned} \hat{\tau}_{\text{pps}}&=\frac{1}{3}\Big[\underbrace{\ {\frac{1400}{40/270}}}_{w_1}+\underbrace{\strut {\frac{1400}{40/270}}}_{w_2}+\underbrace{\strut {\frac{1716}{44/270}}}_{w_3}\Big]={9810};\\ \hat{\text{V}}(\hat{\tau}_{\text{pps}})&=\frac{1}{3(2)}\Big[\Big(\underbrace{\strut {\frac{1400}{40/270}}}_{w_1}\Big)^2+ \Big(\underbrace{\strut{\frac{1400}{40/270}}}_{w_2}\Big)^2+\Big(\underbrace{\strut {\frac{1716}{44/270}}}_{w_3}\Big)^2-3\cdot\underbrace{\strut {9810}^2}_{\hat{\tau}^2_{\text{pps}}}\Big] \\ &={129,600}.\end{aligned}\] Consequently, the 95% C.I. for the total apple yield in the village is \[\text{C.I.}_{\text{pps}}(\tau;0.95)={9810\pm 2\sqrt{129,600}\equiv (9090,10530)}.\]

The actual total yield (\(\tau=11,442\)) does not fall within the confidence interval – why might this be the case? Is this problematic? \(\blacksquare\)

In general, \(\text{V}(\hat{\tau}_{\text{pps}})\leq \text{V}(\hat{\tau}_{\text{SRS}})\). In the orchards example, we can show that \[\begin{aligned} \text{V}(\hat{\tau}_{\text{SRS}})&\approx {8^2\cdot \frac{172981.4375}{3}\Big(\frac{8-3}{8-1}\Big)=2,635,907.619}, \quad \text{and} \\ \text{V}(\hat{\tau}_{\text{pps}})&\approx {\frac{1}{3}\Big[\frac{2250^2}{50/270}+\cdots+\frac{1680^2}{35/270}-{11,442}^2\Big]=723,912 }.\end{aligned}\]

We can also give an estimate of the population average \(\mu\) using \[\hat{\mu}_{\text{pps}}={\frac{\hat{\tau}_{\text{pps}}}{N}},\quad \hat{\text{V}}(\hat{\mu}_{\text{pps}})={\frac{\hat{\text{V}}(\hat{\tau}_{\text{pps}})}{N^2}},\quad \text{C.I.}_{\text{pps}}(\mu;0.95)={\frac{\text{C.I.}_{\text{pps}}(\tau;0.95)}{N}}.\]

A lot more can be said on the topic; it usually provides a springboard to more sophisticated sampling designs and other theoretical considerations [47][49].

5.7.3 Multi-Stage Sampling

By splitting the sampling process into several stages, one can reduce costs and focus the logistical aspects of sampling on a few focal points. In multi-stage sampling (M\(n\)S), a sample of large units (primary units) is drawn, then sub-units (secondary units) are drawn from the large units, and so on.

Example: sampling units in a Canadian province could be decomposed into three steps:

  1. conduct a sample of municipalities (primary units);

  2. sample neighbourhoods in the sampled municipalities (secondary units), and

  3. sample households in the samples neighbourhoods (tertiary units).

In a M\(n\)S, the sample is concentrated around several pivots: in field studies, for example, this has the advantage of considerably reducing the survey area, which helps to reduce non-sampling errors (in addition to reducing operational costs). In addition, detailed information is often available for groups of sample units, but not for individual units: it is therefore not necessary to obtain a complete sampling frame for all sample units, but only for those belonging to the primary units selected in the first round, for example.

Any probability sampling method can be used at each stage, and they can change from stage to stage (e.g., a municipality SRS, a neighborhood SRS, and a household SRS).

Two-Stage Simple Random Sampling

In a 2-stage process, if sampling is conducted using a SRS for both stages, the method is known as two-stage simple random sampling (SRS2S).

Example: The biomass of a plant species in a forest area can be estimated by drawing a SRS of \(m=8\) compartments (primary units) from the \(M=40\) compartments composing the population under study.

For each of these compartments \(1\leq i\leq m\), we then draw a SRS of \(n_{i}\) plots, and measure the biomass in the plot. Estimates of the average or total amount of biomass in the forest area can be calculated using appropriate formulas. \(\blacksquare\)

Schematics of SRS2S: target population (left) and sample (right).Schematics of SRS2S: target population (left) and sample (right).

Figure 5.17: Schematics of SRS2S: target population (left) and sample (right).

Estimation

Let be a population consisting of \(M\) primary units, having \(N_{\ell}\) secondary units in the \(\ell\)th primary unit. Denote by \(u_{i,j}\) the value of the response variable of the \(j\)th secondary unit in the \(i\)th primary unit.

The population mean is \[\mu={\frac{\displaystyle{\sum_{\ell=1}^M}\sum_{j=1}^{N_\ell}u_{\ell,j}}{\displaystyle{\sum_{\ell=1}^MN_{\ell}}}}.\]

Suppose we draw a SRS of \(m\) primary units, and a SRS of \(n_{i}\) secondary units in the \(i\)th primary unit. The total sample size is thus \(n=n_1+cdots+n_m\). We obtain an unbiased estimator of \(\mu\) from: \[\overline{y}_{\text{SRS2S}}={\frac{1}{m\overline{N}}\sum_{i=1}^mN_i\overline{y}_i=\frac{1}{m\overline{N}}\sum_{i=1}^m\frac{N_i}{n_i}\sum_{k=1}^{n_i}y_{i,k}=\frac{1}{m\overline{N}}\sum_{i=1}^m\sum_{k=1}^{n_i}\frac{MN_i}{mn_i}y_{i,k}},\] where \[\overline{N}={\frac{1}{M}\sum_{\ell=1}^MN_{\ell}\approx \frac{N_1+\cdots+N_m}{m}}.\]

The sampling variance is composed of two components:

  • a measure of the variation between the primary units, and

  • a measure of the variation within the primary units.

When \(n_i = N_i\) for all \(1\leq i\leq m\), we are dealing with a ClS and the variance is only given by the first component (see Section 5.6). In the case where \(m=M\), we are dealing with a StS and the variance is only given by the second component (see Section 5.4).

When \(m\neq M\) and \(n_i\neq N_i\) for at least one primary unit \(i\), the variance is a combination of these two extremes: in that case, the second component represents the contribution of sub-sampling (another name for M\(n\)S). We use the law of total variance to estimate the sampling variance: \[\begin{aligned} \text{V}(\overline{y}_{\text{SRS2S}})&={\text{E}[\text{V}(\overline{y}_{\text{SRS2S}}\mid m)] + \text{V}(\text{E}[\overline{y}_{\text{SRS2S}}\mid m])}\\&={\frac{1}{\overline{N}^2}\cdot\frac{\sigma_T^2}{m}\Big(\frac{M-m}{M-1}\Big)+\frac{1}{mM\overline{N}^2}\sum_{i=1}^mN_i^2\cdot\frac{\sigma_i^2}{n_i}\Big(\frac{N_i-n_i}{N_i-1}\Big)}\\ & \approx \underbrace{\frac{1}{\overline{N}^2}\cdot \frac{s_T^2}{m}\Big(1-\frac{m}{M}\Big)}_{\text{between primary units}}+\underbrace{\frac{1}{mM\overline{N}^2}\sum_{i=1}^mN_i^2\cdot \frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)}_{\text{within primary units}},\end{aligned}\] where \[\begin{aligned} s_T^2&= {\frac{1}{m-1}\sum_{i=1}^n\left(N_i\overline{y}_i-\overline{N}\overline{y}_{\text{SRS2S}}\right)^2},\quad s_{i}^2={\frac{1}{n_i-1}\sum_{k=1}^{n_i}(y_{i,k}-\overline{y}_i)^2}.\end{aligned}\]

Example: The biomass of a plant species (kg) is measured in plots of \(0.025\) ha (secondary units) selected from \(m=8\) compartments (primary units), randomly selected themselves among the \(M=40\) compartments of a forested area. The summary of results is shown in the following table:

Comp. \(1\) \(2\) \(3\) \(4\) \(5\) \(6\) \(7\) \(8\)
\(\overline{y}_i\) \(118\) \(107\) \(109\) \(110\) \(120\) \(95\) \(93\) \(90\)
\(s_{i}^2\) \(436\) \(516\) \(586\) \(456\) \(412\) \(497\) \(755\) \(496\)
\(N_i\) \(1760\) \(1975\) \(1615\) \(1785\) \(1775\) \(2050\) \(1680\) \(1865\)
\(n_i\) \(9\) \(10\) \(8\) \(9\) \(9\) \(10\) \(8\) \(9\)

Find a 95% C.I. for the average biomass per plot and per compartment, and for its total in the forested area.

Solution: Since we do not know \(\overline{N}\), we approximate it with the mean \[\overline{N}\approx \frac{1}{8}(1760 + \cdots + 1865)={1813.125}.\] The totals in the selected primary units are then:

Comp. \(1\) \(2\) \(3\) \(4\) \(5\) \(6\) \(7\) \(8\)
\(N_i\overline{y}_i (\times 10^5)\) \({2.077}\) \({2.113}\) \({1.760}\) \({1.964}\) \({2.130}\) \({1.946}\) \({1.562}\) \({1.679}\)

The SRS2S estimators of the mean \(\mu\), of the mean of the totals in the compartments, and of the total are: \[\begin{aligned} \overline{y}_{\text{SRS2S}}&=\frac{1}{8({1813.125})}({2.077}+\cdots+{1.679})\times 10^5={105.01}; \\ \overline{N}\overline{y}_{\text{SRS2S}}&={1813.125}\cdot {105.01}={190,403.75};\quad \tau_{\text{SRS2S}}=M\cdot \overline{N}\overline{y}_{\text{SRS2S}}={7,616,150}.\end{aligned}\]

The variance between compartments (primary units) is thus: \[\begin{aligned} s_T^2&=\frac{1}{8-1}\sum_{i=1}^8(N_i\overline{y}_i-{190,403.75})^2={4.55 \times 10^8 }\end{aligned}\] Finally, we calculate the variance within the compartments:

Comp. \(1\) \(2\) \(3\) \(4\) \(5\) \(6\) \(7\) \(8\)
\(\frac{N_i^2}{\overline{N}^2}\cdot \frac{s_i^2}{n_i}\Big(1-\frac{n_i}{N_i}\Big)\) \({48.2}\) \({51.3}\) \({72.7}\) \({50.4}\) \({45.6}\) \({49.4}\) \({93.9}\) \({54.9}\)

The sampling variance is thus \[\begin{aligned} \hat{\text{V}}(\overline{y}_{\text{SRS2S}})&= \frac{{4.55 \times 10^8}}{8({1813.125})^2}\Big(1-\frac{8}{40}\Big)+\frac{1}{8(40)}({48.2}+\cdots + {54.9})= {14.03}\end{aligned}\]

The variances of the other two estimators are easily calculated: \[\begin{aligned} \hat{\text{V}}(\overline{N}\overline{y}_{\text{SRS2S}})&=\overline{N}^2\hat{\text{V}}(\overline{y}_{\text{SRS2S}}) = ({1813.125})^2\cdot{14.03}={46,141,324.55}; \\ \hat{\text{V}}(\tau_{\text{SRS2S}})&=M^2\overline{N}^2\hat{\text{V}}(\overline{y}_{\text{SRS2S}}) = (40)^2\cdot({1813.125})^2\cdot{14.03}={73,826,119,284};\end{aligned}\] the confidence intervals are thus \[\begin{aligned} \text{C.I.}_{\text{SRS2S}}(\mu;0.95):\quad & {105.01 \pm 2\sqrt{14.03}\equiv (97.5,112.5)} \\ \text{C.I.}_{\text{SRS2S}}(\textstyle{\frac{N_0}{M}\mu};0.95):\quad & {190,403.75 \pm 2\sqrt{46,141,324.55}\equiv (176818,203989.2312)} \\ \text{C.I.}_{\text{SRS2S}}(\tau;0.95):\quad & {7,616,150 \pm 2\sqrt{73,826,119,284}\equiv (7072730,8159569)}\end{aligned},\] assuming of course that the centered limit theorem remains valid in the context of a SRS2S. \(\blacksquare\)

5.7.4 Multi-Phase Sampling

Multi-stage sampling (M\(n\)P) plays a crucial role in many types of surveys, including those conducted by remote sensing.

In the first phase, a selected number of units are sampled, but only a small number of characteristics are captured for each unit. In each successive phase, a larger number of features is measured on a smaller sub-sample of units.

In this way, the target parameter can be estimated with more accuracy and at lower cost, by studying the relationship between the features measured in the different sampling phases.

Two-Phase Random Sampling

A M\(n\)P with only two phases is called a two-phase sampling (M2P). M2Ps are particularly useful in a situation where enumeration of the feature of interest (or main trait) is expensive (in terms of $$$ or labor), but in which an auxiliary trait correlated to the main trait can easily be observed.

Thus, it is sometimes preferable to draw a large SRS in the first phase in order to analyze the auxiliary variables, which leads to more accurate estimates of \(\tau\) or \(\mu\) for that auxiliary variable (at least, that is the hope). In the second phase, a smaller sample is drawn, usually a sub-sample of the sample obtained in the first phase, in which both the principal characteristic and the auxiliary variable are measured.

Estimates of the main characteristic are then obtained using the information obtained in the first phase, using the ratio method or the regression method, for instance. The precision of the final estimates can be increased by including several correlated auxiliary variables, instead of just the one.

Example: If we want to estimate the total volume of wood \(\tau\) in a forest, we could first measure the circumference \(c_i\) and height \(h_i\) of the trees \(i\) in some sample, then the volume \(v_{i_k}\) of the trees \(i_k\) in a subsample. We then determine the statistical relationship between \(\tau_v\), \(\tau_c\), and \(\tau_h\), and voilà! \(\blacksquare\)

The M\(n\)P sampling method helps to reduce the cost of enumeration and increase the accuracy of estimates. It can also be used to stratify a population: an initial sample is taken based on the auxiliary characteristic, which is used to subdivide the population into strata in which the main characteristic is more or less homogeneous.

As long as the two characteristics are correlated, accurate estimates of the main characteristic are obtained from a second, relatively small sample.

M2P can also be paired with M2S, for example (or with any other sampling design). If both selection steps are performed with SRS, the method is called two-phase simple random sampling (SRS2P).

Schematics of SRS2P: target population (left) and sample (right).Schematics of SRS2P: target population (left) and sample (right).

Figure 5.18: Schematics of SRS2P: target population (left) and sample (right).

In the first phase, the population is divided into well-defined sampling units; a SRS \(\mathcal{Y}_1\) of size \(n_1\) is drawn from these units; the auxiliary variable \(x\) is measured on all units of \(\mathcal{Y}_1\). Next, a sub-SRS \(\mathcal{Y}_2\) of size \(n_2\) is drawn from \(\mathcal{Y}_1\); the main characteristic \(y\) is measured on all units of \(\mathcal{Y}_2\).

We can evaluate \(r_{\mathcal{Y}_2}\) or \(b_{\mathcal{Y}_2}\) from the observations in \(\mathcal{Y}_2\) (using either the ratio method or the regression method), which yields \[\hat{\mu}_{Y;R;\text{SRS2P}}={r_{\mathcal{Y}_2}\cdot \overline{x}_{\mathcal{Y}_1}}\quad\text{or}\quad \hat{\mu}_{Y;L;\text{SRS2P}}={\overline{y}_{\mathcal{Y}_2}+b_{\mathcal{Y}_2}(\overline{x}_{\mathcal{Y}_1}-\overline{x}_{\mathcal{Y}_2})}.\]

Estimation

Due to the double sampling, two terms contribute to sampling variances of the estimators (the first when going from \(\mathcal{U}\) to \(\mathcal{Y}_1\), and the second from \(\mathcal{Y}_1\) to \(\mathcal{Y}_2\)): \[\begin{aligned} \hat{\text{V}}(\hat{\mu}_{Y;R;\text{SRS2P}})& = {\frac{1}{n_2}(s^2_Y-2r_{\mathcal{Y}_2}s_{XY}+(r_{\mathcal{Y}_2})^2s^2_X) +\frac{1}{n_1}(2r_{\mathcal{Y}_2}s_{XY}-(r_{\mathcal{Y}_2})^2s^2_X)} \\ \hat{\text{V}}(\hat{\mu}_{Y;L;\text{SRS2P}})& = {\frac{1}{n_2}s^2_{XY;L}+\frac{1}{n_1}(s^2_{XY;L}-s^2_Y)}\end{aligned}\] where \(s^2_Y\), \(s_{XY}\), and \(s^2_X\) are the usual quantities (in \(\mathcal{Y}_2\)), \[r_{\mathcal{Y}_2}={\frac{\overline{y}_{\mathcal{Y}_2}}{\overline{x}_{\mathcal{Y}_2}}},\ b_{\mathcal{Y}_2}={\frac{s_{XY}}{s^2_X}}, \quad\text{and}\quad s^{2}_{XY;L}={\frac{n_2-1}{n_2-2}\cdot\left\{s_Y^2-b_{\mathcal{Y}_2}^2s_X^2 \right\}}.\]

Example: we are interested in the biomass of any plant in a region, which is divided into plots of 0.025 ha each. First, we measure the number \(x\) of groves per unit in a SRS \(\mathcal{Y}_1\) of \(n_1=200\) plots. Then, the biomass \(y\) of the plant in question is calculated in each unit of a \(\mathcal{Y}_2\) sub-SRS of \(n_2=40\) plots: \[\begin{aligned} \overline{x}_{\mathcal{Y}_1}&=374.4; \quad \sum_{i=1}^{40}x_i=15,419;\quad \sum_{i=1}^{40}y_i=2104;\\ \sum_{i=1}^{40}x_i^2&=7,744,481; \quad \sum_{i=1}^{40}x_iy_i=960,320; \quad \sum_{i=1}^{40}y_i^2=125,346. \end{aligned}\] Find a 95% C.I. for the average biomass per plot.

Solution: let us compute the required intermediate quantities: \[\begin{aligned} \overline{x}_{\mathcal{Y}_2}&={\frac{15419}{40}=385.5};\quad \overline{y}_{\mathcal{Y}_2}={\frac{2104}{40}=52.6};\quad r_{\mathcal{Y}_2}=\frac{\overline{y}_{\mathcal{Y}_2}}{\overline{x}_{\mathcal{Y}_2}}={\frac{52.6}{385.5}= 0.14};\\ s_{X}^2&={\frac{1}{39}[7744481-40(385.5)^2]\approx 46175};\ s_{Y}^2={\frac{1}{39}[125346-40(52.6)^2]\approx 376} \\ s_{XY}&={\frac{1}{39}[960320-40(385.5)(52.6)]\approx 3827.7;} \quad b_{\mathcal{Y}_2}=\frac{s_{XY}}{s_X^2}={\frac{3827.7}{46175.4}\approx 0.08};\\ s^2_{XY;L}&={\frac{39}{38}[376.3-0.08^2(46175.4)]\approx 82.9}; \end{aligned}\] which gives us \[\hat{\mu}_{Y;R;\text{SRS2P}}={0.14(374.4)\approx 51.1};\quad \hat{\mu}_{Y;L;\text{SRS2P}}={52.6+0.08(374.4-385.5)\approx 51.7}\] and \[\begin{aligned}\hat{\text{V}}(\hat{\mu}_{Y;R;\text{SRS2P}})& = {\frac{376.3-2(0.14)(3827.7)+(0.14)^246175.4}{40}} \\ &\ \ \ \ \ \ \ {+\ \frac{2(0.14)3827.7-(0.14)^246175.4}{200}\approx 5.67}; \\ \hat{\text{V}}(\hat{\mu}_{Y;L;\text{SRS2P}})& = {\frac{82.9}{40}+\frac{82.9-376.3}{200}\approx 3.54 };\end{aligned}\] from which we conclude that \[\begin{aligned} \text{C.I.}_{R;\text{SRS2P}}(\mu_Y;0.95)&= {51.1\pm 2\sqrt{5.67}\equiv (46.3,55.8)}\\ \text{C.I.}_{L;\text{SRS2P}}(\mu_Y;0.95)&= {51.7\pm 2\sqrt{3.54}\equiv (47.9,55.5)}. \quad \blacksquare\end{aligned}\]

5.7.5 Miscellaneous

We end the module by briefly discussing a few notions that did not find a natural place in the previous sections:

  • design effects;

  • adjusting for nonresponse;

  • estimating the size of a population,

  • randomized responses, and

  • Bernoulli sampling.

Design Effect

The design effect compares the estimator for a given sampling design and for a SRS. It is the ratio of the sampling variance of the estimator under the given sampling design to the sampling variance of the estimator under a SRS (assuming samples of the same size).

This value is often applied to compare the efficiency of estimators from different sampling designs. If the ratio \(<1\), the sampling design is more efficient than SRS; if it is \(>1\), it is less efficient than SRS.

We directly compared the theoretical variances of several sampling designs in sections 5.4.3, 5.5.4, and 5.6.3, but in practice we compute the design effect using the achieved samples (assuming that they had been drawn under various sampling plans).

Design effects also help to obtain approximate variance estimates for complex sampling designs. If a design effect estimate is available from a previous survey (that used the sampling design we will be using for this survey), it can be used to determine the sample size required to meet some pre-determined condition(s).

Adjusting for Nonresponse

Non-response is a problem in all surveys. Total nonresponse (when all or almost all data from a sampled unit are missing) occurs when:

  • a sample unit refuses to participate in the survey;

  • we cannot establish contact with a sample unit;

  • the sampled unit cannot be found, or

  • the information obtained form the unit is useless/invalid.

The simplest way to deal with such nonresponse is to ignore it; in some exceptional circumstances (when the affected observations are not in any way different from those for whom we have valid and complete measurements), proportions or means that are estimated without adjusting for nonresponse are more or less the same as those produced by applying adjustment for nonresponse.

If one neglects to compensate for nonresponding units, however, the totals are generally underestimated (e.g., the size of a population, total revenue, or total acres harvested, say).

The most common way to deal with total nonresponse is to adjust the base sampling weights by assuming that the responding units represent both responding and nonresponding units. If the nonrespondents are equivalent to the respondents for the characteristics measured in the survey, this is a reasonable approach.

The base weights for nonrespondents are then redistributed among respondents, using a adjustment factor for nonrespondents that is multiplied by the base weight, to obtain an adjusted weight.

Example: if we draw a SRS of size \(n=25\) from a stratum of size \(N=1000\), the probability of inclusion of each of these units and the corresponding basic weight are \[\begin{aligned} \pi &=\frac{n}{N}={\frac{25}{1000}=0.025} \\ w&=\frac{1}{\pi}={\frac{1}{0.025}=40}.\end{aligned}\] In other words, each selected unit represents $40 units in the stratum.

If we only get a response from \(n_r=20\) of the \(n=25\) selected units, the nonresponse adjustment factor (NRAF) and the adjusted weight (for nonresponse) become: \[begin{aligned} \text{NRAF} &=\frac{n}{n_r}={\frac{25}{20}=1.25} \\ w_{\text{nr}}&=w\cdot \text{NRAF}={1.25(40)=50};\end{aligned}\] each responding unit then represents \(50\) units in the stratum. This adjusted weighting is what we would end up working with. \(\blacksquare\)

Of course, the adjusted weight may vary from stratum to stratum, depending on the sample design and the sample size/allocation.

When we want to determine the optimal sample size/allocation across various strata, what we obtain is the target sample size (assuming that the target and study populations coincide). We then have to resort to inflation of the sample size to achieve the target.

Example: the allocation of a StS of size \(n=29\) is found to be \((17,9,3)\). In a prior study, the nonresponse rates by stratum were determined to be \((16.2\%, 20.8\%, 31.2\%)\). Which allocation optimizes the likelihood of achieving the target allocation?

Solution: we only need to solve \[n_1(1-0.162)=17,\quad n_2(1-0.208)=9, \quad n_3(1-0.312)=3,\] which gives a practical sample allocation of \((n_1,n_2,n_3)=(20.3,11.3,4.3)\approx (21,12,5)\), and a practical sample size of \(n=38\). \(\blacksquare\)

Estimating a Population Size

How do we proceed if the size \(N\) of the population \(\mathcal{U}\) is unknown? When the population is large enough, we can always use the approximation \(N\approx \infty\) in the sampling variance formulas.

But sometimes it is the parameter \(N\) that represents the quantity of interest; as an example, how would we find out the number \(N\) of $5 bill in circulation?

We approach such a problem using the catch-and-release method (compare with the approach used in Module 18):

  1. we capture \(n_1\) bills at random (without replacement) from the population;

  2. we mark them and release them back into circulation;

  3. at a later time, \(n_2\) bills are captured at random (without replacement) from the population;

  4. we count the number \(X\) of marked bills, \(0< X\leq n_2\).

If we wait long enough (to let the marked bills propagate in the population, say), we obtain \[\frac{n_1}{N}\approx \frac{X}{n_2},\quad \text{from which we have }\hat{N}={\frac{n_1n_2}{X}},\] where \(X\) follows a hypergeometric distribution with parameters \(n_1,N-n_1,n_2\), and probability mass function \[\begin{aligned} P(X=x)&={\frac{\displaystyle{\binom{n_1}{x}\binom{N-n_1}{n_2-x}}}{\displaystyle{\binom{N}{n_2}}}}, \quad 0\leq x\leq n_2 \\ \mu_X=\text{E}[X]&={n_2\underbrace{\strut \left( \frac{n_1}{N}\right)}_p=n_2p},\quad \sigma_X^2=\text{V}[X]={n_2p(1-p)\left(\frac{N-n_2}{N-1}\right)}. \end{aligned}\]

If \(\frac{n_2}{N}<0.05\), we can ignore the FPCF term in the variance: \[\sigma_X^2=\text{V}[X]\approx{n_2p(1-p)}.\] We can now develop expressions for \(\text{E}[\hat{N}]\) and \(\text{V}[\hat{N}]\), using a Taylor series of order 2 near \(X\approx \mu_X=n_2p\): \[f(X)\approx {f(\mu_X)+f'(\mu_X)(X-\mu_X)+\frac{f''(\mu_X)}{2}(X-\mu_X)^2}.\] Si \(\hat{N}=f(X)=\frac{n_1n_2}{X}\), so that \[\begin{aligned} \hat{N}&\approx {\frac{n_1n_2}{\mu_X}-\frac{n_1n_2}{\mu_X^2}(X-\mu_X) +\frac{n_1n_2}{\mu_X^3}(X-\mu_X)^2} \\ &={\frac{n_1}{p}-\frac{n_1}{n_2p^2}(X-n_2p)+\frac{n_1}{n_2^2p^3}(X-n_2p)^3}.\end{aligned}\]

Consequently, \[\begin{aligned} \text{E}[\hat{N}]&={\text{E}\left[\frac{n_1n_2}{\mu_X}-\frac{n_1n_2}{\mu_X^2}(X-\mu_X) +\frac{n_1n_2}{\mu_X^3}(X-\mu_X)^2\right]} \\ &={\text{E}\left[\frac{n_1n_2}{\mu_X}\right]-\text{E}\left[\frac{n_1n_2}{\mu_X^2}(X-\mu_X)\right] +\text{E}\left[\frac{n_1n_2}{\mu_X^3}(X-\mu_X)^2\right]} \\ &={\frac{n_1n_2}{\mu_X}-\frac{n_1n_2}{\mu_X^2}(\underbrace{\strut \text{E}[X]}_{\mu_X}-\mu_X) +\frac{n_1n_2}{\mu_X^3}\text{E}\left[(X-\mu_X)^2\right] } \\ &={\frac{n_1n_2}{\mu_X}+\frac{n_1n_2}{\mu_X^3}\text{V}[X]\approx \frac{n_1}{p}+\frac{n_1}{n_2^2p^3}\cdot n_2p(1-p)=\frac{n_1}{p}+\frac{n_1}{n_2p^2}(1-p)} \\ &={\frac{n_1}{p}\left(1+\frac{1-p}{n_2p}\right)=N\left(1+\frac{1-p}{n_2p}\right).}\end{aligned}\]

Since \(\frac{1-p}{n_2p}>0\), \(\text{E}[\hat{N}]\neq N\), and so \(\hat{N}\) is an asympotically unbiased estimator of \(N\) when the sample size \(n_2\) increases.

We can provide an approximation of the variance using a Taylor series of order 1 near \(X\approx \mu_X=n_2p\): \[\begin{aligned} \hat{N}&\approx {\frac{n_1n_2}{\mu_X}-\frac{n_1n_2}{\mu_X^2}(X-\mu_X)}= {\frac{n_1}{p}\left(1-\frac{X-n_2p}{n_2p}\right)}={\frac{n_1}{p}\left(2-\frac{X}{n_2p}\right)}.\end{aligned}\]

Putting all this together, we get \[\begin{aligned} \text{V}[\hat{N}]&\approx {\text{V}\left[\frac{n_1}{p}\left(2-\frac{X}{n_2p}\right)\right]}={\frac{n_1^2}{p^2}\cdot \text{V}\left[-\frac{X}{n_2p}\right]}={\frac{n_1^2}{n_2^2p^4}\cdot\text{V}[X]}\\ &\approx {\frac{n_1^2n_2p(1-p)}{n_2^2p^4}}={\frac{n_1^2(1-p)}{n_2p^3}}.\end{aligned}\]

In practice, we do not know the true \(p\), so we use \[\hat{\text{V}}[\hat{N}]= {\frac{n_1^2(1-\hat{p})}{n_2\hat{p}^3}},\quad\text{where }\hat{p}={\frac{X}{n_2}}.\]

Central Limit Theorem – Population Size \(N\): if \(n_2\) and \(N\) are sufficiently large, we have \[\hat{N}\sim_{\text{approx.}}{\mathcal{N}\left(\text{E}[\hat{N}],\hat{\text{V}}[\hat{N}]\right)\approx \mathcal{N}\left(\frac{n_1n_2}{X},\frac{n_1^2(1-\hat{p})}{n_2\hat{p}^3}\right)},\] and the corresponding 95% C.I. for \(N\) is thus \[\text{C.I.}(N;0.95):\quad {\frac{n_1n_2}{X}\pm 2\sqrt{\frac{n_1^2(1-\hat{p})}{n_2\hat{p}^3}}}.\]

Example: say that \(n_1=500\) bills were initially captured, marked, and releases; of the \(n_2=300\) bills recaptured at a later date, \(X=127\) were marked. Give a 95% C.I. for the total number of $5.

Solution: the point estimate is \(\hat{N}=\frac{500\cdot 300}{127}\approx 1181.102\). We also have \(\hat{p}=\frac{X}{n_2}=\frac{127}{300}\approx 0.423\), from which we get the bound on the error of estimation \[2\sqrt{\hat{\text{V}}(\hat{N})}=2\sqrt{\frac{500^2\cdot (1-0.42)}{300\cdot (0.42)^3}}=159.176,\] and \[\text{C.I.}(N;0.95):\quad 1181.102 \pm 159.176\equiv (1021.9,1340.3). \quad \blacksquare\]

Randomized Response

Let’s say we ask students whether they cheated on a test or an assignment during the pandemic. If the answer is “Yes,” we can likely conclude that it is the true answer. But since there is a social cost associated with such an answer, we can expect that some cheaters will answer “No”. What can we do to reduce the measurement error for sensitive questions?

First approach: with such questions, the skill of the interviewer plays a crucial role – this aspect should not be overlooked.

Second approach: the randomized response technique requires the use of two questions:

  • the sensitive question, and

  • an innocent question,

as well as a random mechanism with known parameters (heads or tails, etc.).

Randomized responses work as follows: the respondent flips a coin (without announcing the result to the interviewer), and answers honestly one of the 2 questions:

  • head”: “Have you ever cheated on a test?”;

  • tail”: “Were you born in January?”;

Since the interviewer does not know the outcome of the draw, they do not know whether the respondent is answering the sensitive question or the innocent one. In theory, the anonymity provided by the randomized response is freeing (the social cost is diminished, if not eliminated altogether) – therefore, we could expect an honest answer, regardless of the question.

But we have to be careful: this approach can only be successful if we know the probabilities:

  • \(\theta\) of observing a positive response to the innocent question;

  • \(\rho\) of the question being answered actually being the sensitive question, and

  • \(\phi\) of observing a positive response, whatever the question.

Let \(p\) be the proportion of positive responses to the sensitive question, which is the quantity of interest. According to the Law of Total Probability, we have \[\begin{aligned} \phi&=P(\text{positive response})\\ &={\underbrace{\strut P(\text{positive}\mid \text{sensitive})}_{p}\underbrace{\strut P(\text{sensitive}}_{\rho})+\underbrace{\strut P(\text{positive}\mid \text{innocent})}_{\theta}\underbrace{\strut P(\text{innocent})}_{1-\rho}}, \\ &={p\rho+\theta(1-\rho)}\end{aligned}\] or \[p={\frac{\phi-\theta(1-\rho)}{\rho}}.\]

If \(\hat{\phi}\) is the proportion of positive responses in the achieved sample, then the randomized response estimator is \[\hat{p}_{\text{rr}}={\frac{\hat{\phi}-\theta(1-\rho)}{\rho}}, \quad \theta,\rho \text{ constants},\] whose sampling variance is \[\begin{aligned} \text{V}(\hat{p}_{\text{rr}})&={\text{V}\left(\frac{\hat{\phi}-\theta(1-\rho)}{\rho}\right)=\text{V}\left(\frac{\hat{\phi}}{\rho}\right)=\frac{1}{\rho^2}\cdot \text{V}(\hat{\phi})}.\end{aligned}\]

Since \(\hat{\phi}\) is a SRS proportion estimator obtained from a sample of size \(n\) in a population \(\mathcal{U}\) of size \(N\), its sampling variance is \[\text{V}(\hat{\phi})={\frac{\phi(1-\phi)}{n}\Big(\frac{N-n}{N-1}\Big)},\] from which we conclude that \[\text{V}(\hat{p}_{\text{rr}})={\frac{1}{\rho^2}\cdot \frac{\phi(1-\phi)}{n}\Big(\frac{N-n}{N-1}\Big)}.\]

As the true value of \(\phi\) is typically not known, we instead use the unbiased estimator \[\hat{\text{V}}(\hat{p}_{\text{rr}})={\frac{1}{\rho^2}\cdot \frac{\hat{\phi}(1-\hat{\phi})}{n-1}\Big(1-\frac{n}{N}\Big)},\] and we build a 95% C.I. for \(p\) via \[\text{C.I.}_{\text{rr}}(p;0.95):\quad {\hat{p}_{\text{rr}} \pm 2\sqrt{\hat{\text{V}}(\hat{p}_{\text{rr}})}}.\]

The factor \(1/\rho^2\) penalizes the uncertainty brought by the randomized response – the higher \(\rho\) is, the lower \(\hat{\text{V}}(\hat{p}_{\text{rr}})\) is. But there are practical considerations that limit how high \(\rho\) can get: if it is too large, the anonymity conferred by the approach evaporates, and we risk ruining the study by causing an increase in nonresponse.

Example: we seek to determine the incidence of cheating in online courses among students in the Department of Mathematics and Statistics \((N=442)\), using a SRS with \(n=65\). We use the scheme described in this section with \(\rho=1/2\), and observe \(\theta=\frac{52}{442}\) and \(\hat{\phi}=\frac{21}{65}\). Find a 95% C.I. for the proportion of students who cheated during the pandemic.

Solution: we only need compute \[\begin{aligned} \hat{p}_{\text{rr}}&=\frac{21/65-52/442(1-1/2)}{1/2}=0.53\\ \hat{\text{V}}(\hat{p}_{\text{rr}})&=\frac{1}{1/2^2}\cdot \frac{21/65(1-21/65)}{65-1}\Big(1-\frac{65}{442}\Big)=0.012, \end{aligned}\] which yields \(\text{C.I.}_{\text{rr}}(p;0.95)=0.53\pm 2\sqrt{0.012}\equiv (0.31, 0.74)\). \(\blacksquare\)

Bernoulli Sampling

Bernoulli sampling (BS) is a random sampling design – we do not know the sample size before it is drawn.

Each unit of the population \(\mathcal{U}=\{u_1,\ldots, u_N\}\) is assigned the same probability of inclusion in the sample \(\mathcal{Y}\): \(\pi_j=\pi\in (0,1)\), for all \(j\). We denote the achieved sample size by \(n_a\).

The BS design65 consists of performing \(N\) independent Bernoulli trials, each with probability of success \(\pi\) (where a success means that the unit is included in the sample, and a failure means that it rejected). The probability of obtaining a sample \(\mathcal{Y}\) of size \(n_a\) is then: \[P(|\mathcal{Y}|=n_a)={\pi^{n_a}(1-\pi)^{N-n_a}}.\] There are \(2^N\) possible samples, with size varying from \(n_a=0\) to \(n_a=N\).

The sample size follows a binomial distribution \(n_a\sim {B(N,\pi)}\): \[P(n_a=n)={\binom{N}{n}\pi^n(1-\pi)^{N-n}}, \quad \text{E}[n_a]={N\pi},\quad \text{V}[n_a]={N\pi(1-\pi)}.\]

When \(N\) is sufficiently large, this distribution is approximately normal; the 95% C.I. for \(n\) is thus \[\text{C.I.}(n_a;0.95):\quad {N\pi \pm 2\sqrt{N\pi(1-\pi)}}.\]

Let \(\pi_{j,k}\) be the probability of inclusion of units \(u_j\) and \(u_k\), \(j\neq k\) in the smaple \(\mathcal{Y}\). Since the Bernouilli trials are independent of one another, \[\pi_{j,k}=P(\{u_j,u_k\}\in \mathcal{Y})={P(u_j\in \mathcal{Y})\cdot P(u_k\in \mathcal{Y})=\pi_j\pi_k=\pi^2}.\]

The estimator \[\hat{\tau}_{\text{BS}}=\frac{1}{\pi}\sum_{i=1}^{n_a}y_i\] is an unbiased estimator of the total \(\tau\) in \(\mathcal{U}\): indeed, \[\begin{aligned} \text{E}[\hat{\tau}_{\text{BS}}]&={\frac{1}{\pi}\text{E}[n_a\overline{y}]=\frac{\text{E}[n_a]\text{E}[\overline{y}]}{\pi}=\frac{N\pi\mu}{\pi}=N\mu=\tau},\end{aligned}\] as \(n_a\) and \(\overline{y}\) are independent of each other.

In the same vein, the sampling variance of \(\hat{\tau}_{\text{BS}}\) is approximately \[\begin{aligned} \hat{\text{V}}[\hat{\tau}_{\text{BS}}]&={\frac{1}{\pi}\left(\frac{1}{\pi}-1\right)\sum_{i=1}^{n_a}y_i^2}.\end{aligned}\]

If \(N\) and \(n_a\) are sufficiently large, the Central Limit Theorem comes into play again, and we build a 95% C.I. for \(\tau\) using \[\text{C.I.}_{\text{BS}}(\tau;0.95):\quad {\hat{\tau}_{\text{BS}}\pm 2\sqrt{\hat{\text{V}}[\hat{\tau}_{\text{BS}}]}}.\]

The corresponding estimators for the mean \(\overline{y}_{\text{BS}}\) and the proportion \(\hat{p}_{\text{BS}}\) are obtained in the usual manner.

Example: a teacher has to correct \(600\) exam papers. For each paper, she rolls a die and only corrects it (at this stage) if it shows a \(6\). At the end of the process, she has graded \(90\) papers, of which \(60\) have received a passing grade. Find a 95% C.I. for the total number of passes in her class.

Solution: let \(y_i=1\) if the \(i\)th marked examen received a passing grade, and \(y_i=0\) otherwise. We have \(N=600\), \(\pi=1/6\), \(n_a=90\), \[\begin{aligned} {\sum_{i=1}^{90}y_i}&{=60,\quad \sum_{i=1}^{90}y_i^2=60, \quad \hat{\tau}_{\text{BS}}=\frac{1}{1/6}\sum_{i=1}^{90}y_i=6(60)=360} \\ {\hat{\text{V}}[\hat{\tau}_{\text{BS}}]}&{=\frac{1}{1/6}\left(\frac{1}{1/6}-1\right)\sum_{i=1}^{90}y_i^2=6(5)(60)=1800}.\end{aligned}\] The 95% C.I. is thus \(\text{C.I.}_{\text{BS}}(\tau;0.95)=360\pm 2\sqrt{1800} \equiv [277,443]\). It’s not looking good for the students… \(\blacksquare\)

References

[47]
R. Latpate, J. Kshirsagar, V. K. Gupta, and G. Chandra, Advanced sampling methods. Springer Nature Singapore, 2021.
[49]
S. L. Lohr, Sampling: Design and Analysis. Duxbury Press, 1999.