5.5 Ratio, Regression, and Difference Estimation

In what follows we present ways to obtain estimates of the mean, the total, or of a proportion with the help of auxiliary information. So far, we have only discussed univariate SRS and StS estimators. Can we use more than one response per unit to obtain better approximations?

In the gapminder.csv dataset, there are \(N=168\) countries in 2011 for which the life expectancy \(Y\) and the (logarithm of the) gross domestic product per capita \(X\) are available.

Suppose it is known that \(\text{E}[X]=\mu_X=7.84\). If we draw a sample \(\{(x_1,y_1),\ldots,(x_{10},y_{10})\}\subseteq \mathcal{U}\) for which the mean of \(y_i/x_i\) is 8.67, can we expect that \(\mu_Y\approx 8.67 \mu_X = 68.00\)? (see Figure 5.6).

5.5.1 Ratio Estimation

Let \(\mathcal{U}=\{(X_1,Y_1),\ldots,(X_N,Y_N)\}\) be a finite population of size \(N\) for which each unit \(u_j\) has 2 observed values: \(X_j\) and \(Y_j\). The ratio of the means \(R\) is the ratio of the means (or totals): \[R={\frac{\displaystyle{\sum_{j=1}^NY_j}}{\displaystyle{\sum_{j=1}^NX_j}}=\frac{\mu_Y}{\mu_X}=\frac{\tau_Y}{\tau_X}, \quad\text{tant que }\mu_X,\tau_X\neq 0}.\] We are interested in such quotients when we try to determine the average wage \(Y\) as a function of years of scholarity \(X\) in Canada, for example.

Ratio Estimator

Let \(\mathcal{Y}=\{(x_{i_1},y_{i_1}),\ldots,(x_{i_n},y_{i_n})\}\subseteq \mathcal{U}\) a bivariate simple random sample of size \(n\). We often simplify the notation by writing \[\mathcal{Y}=\{(x_1,y_1),\ldots,(x_n,y_n)\}.\] The sample ratio of means \(r\) is an estimator of \(R\): \[r={\frac{\displaystyle{\sum_{i=1}^ny_i}}{\displaystyle{\sum_{i=1}^nx_i}}=\frac{\overline{y}}{\overline{x}}=\frac{\hat{\tau}_{Y}}{\hat{\tau}_X},\quad \text{as long as }\overline{x},\hat{\tau}_X\neq 0}.\]

Warning: this is a biased estimator!

Example: consider a finite bivariate population with \(N=4\) units: \[u_1=(1,2),\quad u_2=(1,0),\quad u_3=(2,1),\quad u_4=(4,5).\] The population ratio of means \(R\) is simply \[R={\frac{2+0+1+5}{1+1+2+4}=\frac{8}{8}=1}.\] Suppose that provide an estimate of \(R\) by drawing a SRS of size \(n=3\) from \(\mathcal{U}\). There are \({\binom{4}{3}=4}\) such samples.

Sample \(y\) Values \(\overline{y}\) \(x\) Values \(\overline{x}\) \(r\) \(P(r)\)
\(u_1,u_2,u_3\) \(2,0,1\) \(1\) \(1,1,2\) \(4/3\) \(3/4\) \(1/4\)
\(u_1,u_2,u_4\) \(2,0,5\) \(7/3\) \(1,1,4\) \(2\) \(7/6\) \(1/4\)
\(u_1,u_3,u_4\) \(2,1,5\) \(8/3\) \(1,2,4\) \(7/3\) \(8/7\) \(1/4\)
\(u_2,u_3,u_4\) \(0,1,5\) \(2\) \(1,2,3\) \(2\) \(1\) \(1/4\)

We can compute the expectation of the estimator \(r\) directly: \[\begin{aligned} \text{E}[r]&=\sum_{r}rP(r)={\frac{1}{4}\left(3/4+7/6+8/7+1\right)=\frac{341}{336}\approx 1.014881 \neq R=1.} \quad \blacksquare \end{aligned}\] What is the sampling bias of \(r\) as an estimator of \(R\)? \[\text{E}[r-R]={\text{E}\Big[\frac{\overline{y}}{\overline{x}}-R\Big]=\text{E}\Big[\frac{1}{\overline{x}}(\overline{y}-R\overline{x})\Big]=\text{??}}\]

Ratio Estimator Bias

In this last expression for the sampling bias, only \(\overline{x}\) and \(\overline{y}\) change when the sample changes: \(R\) remains constant. But there is no simple expression allowing us to compute exactly the expectation of a quotient of random variables; we must use approximations.

Let \(f:[a,b]\to \mathbb{R}\) be \(C^2\) over \([a,b]\) (i.e., \(f,f',f''\) are all continuous over \([a,b]\)). According to Taylor’s theorem, for all \(c\in (a,b)\), there exists a \(\xi\) between \(c\) and \(z\) such that \[f(z)=f(c)+f'(c)(z-c)+\frac{f''(\xi)}{2}(z-c)^2.\] Since \(f''\) is continuous over \([a,b]\), \(f''\) is bounded on \([a,b]\): \(\exists M>0\) such that \(|f''(z)|\leq M\) for all \(z\in [a,b]\).

Thus, if \(z\) is sufficiently close to \(c\), \[|f(c)+f'(c)(z-c)|\gg \frac{M}{2}(z-c)^2\geq \left|\frac{f''(\xi)}{2}(z-c)^2\right|,\] from which we conclude that \[f(z)\approx {f(c)+f'(c)(z-c)};\] this is the linear approximation of \(f\) at \(z=c\). If \(f(z)=\frac{1}{z}\), we know that \(f'(z)=-\frac{1}{z^2}\). Set \(z=\overline{x}\) and \(c=\mu_X\).

Since \(f\) is \(C^2\) over any interval \([a,b]\) with \(a>0\), if \(\overline{x}\) is sufficiently close to \(\mu_X\), then the liner approximation becomes \[\frac{1}{\overline{x}}\approx {\frac{1}{\mu_X}-\frac{1}{\mu_X^2}(\overline{x}-\mu_X)}\] (the constant approximation would be \(\frac{1}{\overline{x}}\approx \frac{1}{\mu_X}\)).

But \(\text{E}(\overline{x})=\mu_X\), \(\text{E}(\overline{y})=\mu_Y\) (SRS), and \(\mu_Y=R\mu_X\), so that \[\begin{aligned} \text{E}[r-R]&=\text{E}\Big[\frac{\overline{y}-R\overline{x}}{\overline{x}}\Big]\approx {\text{E}\Big[\Big(\frac{1}{\mu_X}-\frac{1}{\mu_X^2}(\overline{x}-\mu_X)\Big)(\overline{y}-R\overline{x})\Big]} \\ &={\text{E}\Big[\frac{1}{\mu_X}(\overline{y}-R\overline{x})\Big]-\text{E}\Big[\frac{1}{\mu_X^2}(\overline{x}-\mu_X)(\overline{y}-R\overline{x})\Big]} \\ &={\frac{1}{\mu_X}\Big(\text{E}(\overline{y})-R\cdot \text{E}(\overline{x})\Big)-\frac{1}{\mu_X^2}\Big(\text{E}\big[\overline{x}\overline{y}-\mu_X\overline{y}-R\overline{x}^2-R\mu_X\overline{x}\big]\Big)} \\ &={\frac{1}{\mu_X}\underbrace{\strut \Big( \mu_Y-R\mu_X\Big)}_{=0}-\frac{1}{\mu_X^2}\Big(\text{E}(\overline{x}\overline{y})-\mu_X\text{E}(\overline{y})-R\big(\text{E}(\overline{x}^2)-\mu_X\text{E}(\overline{x})\big)\Big)} \\ &={-\frac{1}{\mu_X^2}\Big(\text{E}(\overline{x}\overline{y})-\mu_X\mu_Y-R\big( \text{E}(\overline{x}^2)-\mu_X^2\big)\Big)}\end{aligned}\]

We further simplify the sampling bias \(\text{E}[r-R]\) with the help of \(\text{E}(\overline{x}\overline{y})=\mu_X\mu_Y+\text{Cov}(\overline{x},\overline{y})\), and \(\text{E}(\overline{x}^2)=\mu_X^2+\text{V}(\overline{x})\). Thus, \[\begin{aligned} \text{E}[r-R]&\approx {-\frac{1}{\mu_X^2}\big[\text{Cov}(\overline{x},\overline{y})-R\cdot\text{V}(\overline{x})\big]}.\end{aligned}\]

In an SRS of size \(n\), drawn from a finite population withsize \(N\) and variance \(sigma^2\), we have already seen that \[V(\overline{x})=\frac{\sigma_X^2}{n}\Big(\frac{N-n}{N-1}\Big) \quad\text{and}\quad V(\overline{y})=\frac{\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big).\]

Consider the random variable \(Z=X+Y\). The SRS estimator of \(\mu_Z=\mu_X+\mu_Y\) is \[\overline{z}=\overline{x}+\overline{y}\] and its sampling variance is \[\begin{aligned} &\text{V}(\overline{z})=\frac{\sigma_Z^2}{n}\Big(\frac{N-n}{N-1}\Big),\quad \text{where} \\ \sigma_Z^2&=\frac{1}{N}\sum_{j=1}^N(z_j-\mu_Z)^2={\frac{1}{N}\sum_{j=1}^N\big\{(x_j+y_j)-(\mu_X+\mu_Y)\big\}^2} \\ &={\frac{1}{N}\sum_{j=1}^N(x_j-\mu_X)^2+\frac{2}{N}\sum_{j=1}^N(x_j-\mu_X)(y_j-\mu_Y)+\frac{1}{N}\sum_{j=1}^N(y_j-\mu_Y)^2} \\ &={\sigma_X^2+2\sigma_{XY}+\sigma_Y^2=\sigma_X^2+2\rho\sigma_X\sigma_Y+\sigma_Y^2},\end{aligned}\] if \(\rho=\frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}\) is the correlation coefficient between \(X\) and \(Y\).

On the one hand, \[\text{V}(\overline{z})={\frac{\sigma_X^2+2\rho \sigma_X\sigma_Y+\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)};\] on the other, \[\begin{aligned} \text{V}(\overline{z})&=\text{V}(\overline{x}+\overline{y})=\text{V}(\overline{x})+2\text{Cov}(\overline{x},\overline{y})+\text{V}(\overline{y}) \\ &={\frac{\sigma_X^2}{n}\Big(\frac{N-n}{N-1}\Big)+2\text{Cov}(\overline{x},\overline{y})+\frac{\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)};\end{aligned}\] we can thus conclude that \[\text{Cov}(\overline{x},\overline{y})={\frac{\rho\sigma_X\sigma_Y}{n}\Big(\frac{N-n}{N-1}\Big)}.\]

Consequently, \[\begin{aligned} \text{E}[r-R]&\approx {-\frac{1}{\mu_X^2}\big[\text{Cov}(\overline{x},\overline{y})-R\cdot\text{V}(\overline{x})\big]} \\ &={-\frac{1}{\mu_X^2}\left[\frac{\rho\sigma_X\sigma_Y}{n}\Big(\frac{N-n}{N-1}\Big)-R\frac{\sigma_X^2}{n}\Big(\frac{N-n}{N-1}\Big)\right]} \\ &=\frac{1}{\mu_X^2}\cdot\frac{R\sigma_X^2-\rho\sigma_X\sigma_Y}{n} \Big(\frac{N-n}{N-1}\Big) \end{aligned}\]

But the systematic error is not the only way to qualify the magnitude of the error made when using \(r\) to estimate \(R\): the mean square error (\(\text{MSE}\)) of \(r\) is \[\begin{aligned} \text{MSE}(r)&={\text{E}\left((r-R)^2\right)=\text{V}(r)+\left(\text{E}(r)-R\right)^2}. \end{aligned}\]

Ratio Estimator Variability

We can obtain an approximation of \(\text{V}(r)\) using the constant Taylor approximation (of order \(0\)): \[\frac{1}{\overline{x}}\approx {\frac{1}{\mu_X}}.\]

Thus, \[\text{V}(r)={\text{V}(r-R)=\text{V}\Big[\frac{\overline{y}}{\overline{x}}-R\Big]=\text{V}\Big[\frac{\overline{y}-R\overline{x}}{\overline{x}}\Big]\approx \text{V}\Big[\frac{\overline{y}-R\overline{x}}{\mu_X}\Big]}.\]

Consider the random variable \(W=Y-RX\). Since \(\mu_Y=R\mu_X\), \[\mu_W={\mu_Y-R\mu_X=0}.\] The SRS sample mean of \(W\) in \(\mathcal{Y}\) is thus \[\overline{w}=\overline{y}-R\overline{x} \implies \text{V}(r)\approx {\text{V}\Big[\frac{\overline{w}}{\mu_X}\Big]=\frac{1}{\mu_X^2}\text{V}(\overline{w})=\frac{1}{\mu_X^2}\cdot \frac{\sigma_W^2}{n}\Big(\frac{N-n}{N-1}\Big)},\] where \[\sigma_W^2={\frac{1}{N}\sum_{j=1}^N(W_j- \mu_W)^2=\frac{1}{N}\sum_{j=1}^NW_j^2=\frac{1}{N}\sum_{j=1}^N(Y_j-RX_j)^2}.\]

We thus have \[\text{V}(r)\approx {\frac{1}{\mu_X^2}\cdot\frac{1}{n}\cdot \frac{1}{N}\sum_{j=1}^N(Y_j-RX_j)^2\Big(\frac{N-n}{N-1}\Big)}.\] The ratio between the systematic error \(\text{E}[r-R]\) and the standard deviation of \(r\) is then \[\frac{\text{E}[r-R]}{\text{SD}(r)}\approx{\frac{1}{\sqrt{n}}\cdot \frac{R\sigma_X^2-\rho\sigma_X\sigma_Y}{\sigma_W}\sqrt{\frac{N-1}{N-n}}};\] when \(n,N\to \infty\) (while \(N\gg n\)), we must have \[\frac{\text{E}[r-R]}{\text{SD}(r)}\to 0.\] In other words, although it is impossible to get rid of the bias, the estimation error \[\text{MSE}(r)=\text{V}(r)+\left(\text{E}(r)-R\right)^2\] is dominated by the variance \(\text{V}(r)\) if the sample size \(n\) is sufficiently large.

Example: The list of countries for which both life expectancy and (logarithm of) gross domestic product per capita are available in 2011 contains \(N=168\) observations.

gapminder.RLD <- gapminder |>  filter(year==2011) |>
    select(life_expectancy,gdp,population)

# we keep only the observations that have both  
gapminder.RLD <-  gapminder.RLD[complete.cases(gapminder.RLD),]
gapminder.RLD <- gapminder.RLD |> mutate(lgdppc=log(gdp/population))
(N=nrow(gapminder.RLD))
[1] 168

We draw \(m=500\) SRS samples of \(n=20\), and we compute the estimator \(r\) of the ratio \(R\) for each of these samples.

set.seed(12) # replicability
n=20
m=500

quotients <- c()
for(k in 1:m){
    samp <- gapminder.RLD[sample(1:N,n, replace=FALSE),c("life_expectancy","lgdppc")]
    quotients[k] <- mean(samp$life_expectancy/samp$lgdppc)
  }

The average of the 500 estimators is shown below:

quotients <- data.frame(quotients)
mean(quotients$quotients)
[1] 9.238648

We already know that \(\mu_X=7.84\). It would be reasonable to expect that \(\mu_Y \approx \overline{r}\mu_X\):

mean(gapminder.RLD$lgdppc)*mean(quotients$quotients)
[1] 72.45559

Is this a better approximation than the one we obtained at the beginning of the section: \(\mu_Y\approx 68.00\)?

This is a question that cannot be answered without knowing the distribution of the estimator \(r\) (it is a random variable, since its value depends on the sample \(\mathcal{Y}\) selected).

ggplot(quotients, aes(quotients)) + geom_rug(aes(quotients)) +
    geom_histogram(breaks=seq(8, 10.5, by = .125), col="black", fill="blue", alpha=.2) + 
    geom_vline(xintercept=mean(quotients$quotients), color="red") 

quotients
Min. : 8.428
1st Qu.: 9.073
Median : 9.246
Mean : 9.239
3rd Qu.: 9.401
Max. :10.002

What do you think?

Ratio Estimator Confidence Intervals

We can show that the estimator \(r\) follows approximately a normal distribution \(\mathcal{N}(\text{E}(r),\text{V}(r)),\) from which we conclude that the bound on the error of estimation is \[B_{R}\approx \hat{B}_{R}=2\sqrt{\hat{\text{V}}(r)}\approx{2\sqrt{\frac{1}{\mu_X^2}\cdot\frac{s_W^2}{n}\Big(1-\frac{n}{N}\Big)}\approx 2\sqrt{\frac{1}{\overline{x}^2}\cdot\frac{s_W^2}{n}\Big(1-\frac{n}{N}\Big)} },\] where \[s_W^2= \frac{1}{n-1}\sum_{i=1}^n(y_i-rx_i)^2.\] Thus \[\text{C.I.}(R;0.95): \quad {r\pm \hat{B}_R}\] is an approximate 95% C.I. for \(R\).

We notice that \[\begin{aligned} \sigma_W^2&=\frac{1}{N}\sum_{j=1}^N W_j^2=\frac{1}{N}\sum_{j=1}^N(Y_j-RX_j)^2 ={\frac{1}{N}\sum_{j=1}^N(Y_j-\mu_Y+\mu_Y-RX_j)} \\ &={\frac{1}{N}\sum_{j=1}^N(Y_j-\mu_Y+R\mu_X-RX_j)^2=\frac{1}{N}\sum_{j=1}^N[(Y_j-\mu_Y)-R(X_j-\mu_X)]^2} \\ &={\frac{1}{N}\sum_{j=1}^N(Y_j-\mu_Y)^2 -2R\frac{1}{N}\sum_{j=1}^N(X_j-\mu_X)(Y_j-\mu_Y)+R^2\frac{1}{N}\sum_{j=1}^N(X_j-\mu_X)^2} \\ &={\sigma_Y^2-2R\text{Cov}(X,Y)+R^2\sigma_X^2}=\sigma_Y^2-2R\rho\sigma_X\sigma_Y+R^2\sigma_X^2,\end{aligned}\] where \(\rho=\frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}.\)

By analogy, we then have \(s_W^2={s_Y^2-2r\hat{\rho}s_Xs_Y+r^2s_X^2}\), where \[\begin{aligned} s_X^2&={\frac{1}{n-1}\left(\sum_{i=1}^nx_i^2-n\overline{x}^2\right)},\quad s_Y^2={\frac{1}{n-1}\left(\sum_{i=1}^ny_i^2-n\overline{y}^2\right)}, \\ s_{XY}&={\frac{1}{n-1}\left(\sum_{i=1}^nx_iy_i-n\overline{x}\overline{y}\right)},\quad \text{and }\hat{\rho}={\frac{s_{XY}}{s_Xs_Y}}. \end{aligned}\] In practice, we can also use the following formula: \[s_W^2={\frac{1}{n-1}\left(\sum_{i=1}^ny_i^2-2r\sum_{i=1}^nx_iy_i+r^2\sum_{i=1}^nx_i^2\right)}.\]

Example: consider a SRS \(\mathcal{Y}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) of size \(n=132\), drawn from a population \(\mathcal{U}\) of size \(N=37,444\). Find a 95% C.I. for \(R\) if \[\begin{aligned} \sum_{i=1}^nx_i&=9464.6,\quad \sum_{i=1}^ny_i=14691.6,\\ \sum_{i=1}^nx_i^2&=686773.2,\quad \sum_{i=1}^nx_iy_i=1062186,\quad \sum_{i=1}^ny_i^2=1670194.\end{aligned}\]

Solution: with this sample, we have \(r=\frac{14691.6}{9464.6}\approx 1.55\), so that \[s_W^2=\frac{1670194-2(1.55)(1062186)+(1.55)^2(686773.2)}{132-1}\approx 209.2\] and \[{\hat{\text{V}}(r)\approx \frac{132^2}{9464.6^2} \frac{209.2}{132}\Big(1-\frac{132}{37444}\Big)=0.0003 \implies \text{C.I.}(R;0.95)\approx 1.552 \pm 0.035.\quad\blacksquare}\]

Example: find a 95% C.I. for the ratio of life expectancy by the logarithm of the gross domestic product per capita in 2011 with the help of a SRS of size \(n=20\).

Solution: the true ratio is

N=nrow(gapminder.RLD)
(R = mean(gapminder.RLD$life_expectancy)/mean(gapminder.RLD$lgdppc))
[1] 9.046742

We draw a sample of size \(n=20\), and we calculate the intermediate sums:

n=20
set.seed(123456) # replicability
index = sample(1:N,n, replace=FALSE)
samp = gapminder.RLD[index,c("life_expectancy","lgdppc")]

(sum.xi = sum(samp$lgdppc))
(sum.yi = sum(samp$life_expectancy))
(sum.xi.2 = sum(samp$lgdppc^2))
(sum.yi.2 = sum(samp$life_expectancy^2))
(sum.xiyi = sum(samp$lgdppc*samp$life_expectancy))
[1] 167.2794
[1] 1450.82
[1] 1430.912
[1] 106117.4
[1] 12245.93

Finally, we compute the estimator \(r\) and its variance, as well as the desired confidence interval.

r = sum.yi/sum.xi
s2.W = 1/(n-1)*(sum.yi.2-2*r*sum.xiyi+r^2*sum.xi.2)
V = n^2/sum.xi^2*(1/n)*s2.W*(1-n/N)
B = 2*sqrt(V)
c(r-B,r+B)
[1] 8.252515 9.093552

We would expect the quotient \(R\) to be in the interval \((8.25,9.09)\) with 95% probability (according to the frequentist interpretation of confidence intervals): since \(R=9.046742\), it is indeed the case.63

Estimation of the Mean and the Total Using the Ratio Estimator

In practice, we often know \(\tau_X\) and/or \(\mu_X\). It is possible to use the relation \[\mu_Y={R\mu_X,\quad \text{where }R=\frac{\mu_Y}{\mu_X}}\] in order to approximate \(\mu_Y\) (if \(\mu_X\) is unknown, one uses \(\mu_X\approx \overline{x}\)).

Since \(r=\overline{y}/\overline{x}\), the ratio-based estimator for \(\hat{\mu}_{Y;R}\) is simply: \[\hat{\mu}_{Y;R}={r\cdot\mu_X}.\] But we have already observed that \(r\) is a biased estimator of \(R\), so we expect \(\hat{\mu}_{Y;R}\) to be a biased estimator of \(\mu_Y\), with a normal distribution: \(\hat{\mu}_{Y;R}\sim_{\text{approx}}\mathcal{N}\left(\text{E}(\hat{\mu}_{Y;R}),\text{V}(\hat{\mu}_{Y;R})\right)\).

It is easy to show \[\begin{aligned} \text{E}[\hat{\mu}_{Y;R}-\mu_Y]&={\mu_X\text{E}[r-R]\approx \frac{1}{\mu_X}\cdot\frac{R\sigma_X^2-\rho\sigma_X\sigma_Y}{n} \Big(\frac{N-n}{N-1}\Big)} \\ \text{V}(\hat{\mu}_{Y;R})&={\text{V}(r\cdot\mu_X)=\mu_X^2\text{V}(r)\approx \frac{\sigma_W^2}{n}\Big(\frac{N-n}{N-1}\Big).} \end{aligned}\]

The **bound of error on the estimation of $_{Y;R}** is thus \[B_{\mu_{Y;R}}\approx \hat{B}_{\mu_{Y;R}}=2\sqrt{\text{V}(\hat{\mu}_{Y;R})}\approx {2\sqrt{\frac{s_W^2}{n}\Big(1-\frac{n}{N}\Big)}, \ s_W^2= \frac{1}{n-1}\sum_{i=1}^n(y_i-rx_i)^2,}\] from which we see that \(\text{C.I.}_R(\mu_Y;0.95)\equiv {\hat{\mu}_{Y;R}\pm \hat{B}_{\mu_{Y;R}}}\) is an approximate 95% C.I. for \(\mu_Y\).

It is also possible to use the relationship \[\tau_Y={R\tau_X,\quad \text{where }R=\frac{\mu_Y}{\mu_X}=\frac{\tau_Y}{\tau_X}}\] to approximate \(\tau_Y\) (if \(\tau_X\) is unknown, we use \(\tau_X\approx N\overline{x}\)).

Since \(r=\overline{y}/\overline{x}\), the ratio-based estimator for \(\hat{\tau}_{Y;R}\) is simply: \[\hat{\tau}_{Y;R}={r\cdot\tau_X}.\] But we have already observed that \(r\) is a biased estimator of \(R\), so we expect \(\hat{\tau}_{Y;R}\) to be a biased estimator of \(\tau_Y\), which follows a normal distribution: \[\hat{\tau}_{Y;R}\sim_{\text{approx}} \mathcal{N}\left(\text{E}(\hat{\tau}_{Y;R}),\text{V}(\hat{\tau}_{Y;R})\right).\]

It is easy to show \[\begin{aligned} \text{E}[\hat{\tau}_{Y;R}-\tau_Y]&={\tau_X\text{E}[r-R]\approx \frac{N}{\mu_X}\cdot\frac{R\sigma_X^2-\rho\sigma_X\sigma_Y}{n} \Big(\frac{N-n}{N-1}\Big)} \\ \text{V}(\hat{\tau}_{Y;R})&={\text{V}(r\cdot\tau_X)=\tau_X^2\text{V}(r)=N^2\mu_X^2\text{V}(r)\approx N^2\cdot \frac{\sigma_W^2}{n}\Big(\frac{N-n}{N-1}\Big).} \end{aligned}\]

The **bound of error on the estimation of $_{Y;R}** is thus \[B_{\tau_{Y;R}}\approx \hat{B}_{\tau_{Y;R}}=2\sqrt{\text{V}(\hat{\tau}_{Y;R})}\approx {2N\sqrt{\frac{s_W^2}{n}\Big(1-\frac{n}{N}\Big)} ,\ s_W^2= \frac{1}{n-1}\sum_{i=1}^n(y_i-rx_i)^2,}\] from which we conclude that \(\text{C.I.}_R(\tau_Y;0.95)\equiv {\hat{\tau}_{Y;R}\pm \hat{B}_{\tau_{Y;R}}}\) is an approximate 95% C.I. for \(\tau_Y\).

Example: consider a SRS \(\mathcal{Y}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) o size \(n=132\), drawn from a population of size \(N=37,444\). Find a 95% C.I. for \(\mu_Y\) using ratio-based estimation, given that \[\begin{aligned} \sum_{i=1}^nx_i&=9464.6,\quad \sum_{i=1}^ny_i=14691.6,\\ \sum_{i=1}^nx_i^2&=686773.2,\quad \sum_{i=1}^nx_iy_i=1062186,\quad \sum_{i=1}^ny_i^2=1670194.\end{aligned}\]

Solution: with this sample, we have \(r\approx 1.55\), \(s_W^2\approx 209.2\), \(\hat{\text{V}}(r)\approx 0.00031\), and \(\text{C.I.}(R;0.95)\approx 1.552 \pm 0.035.\) Moreover, \(\overline{x}=9464.6/132=71.70\). Thus \[{\text{C.I.}_R(\mu_Y;0.95)=\mu_X\cdot\text{C.I.}(R;0.95)\approx \overline{x}\cdot\text{C.I.}(R;0.95)\equiv 111.29 \pm 2.51.\quad\blacksquare}\]

Example: find a 95% C.I. for the average life expectancy by country \(\mu_Y\), in 2011, using ratio estimation and the logarithm of the gross domestic product per capita in 2011 (\(X\)), with a SRS sample of size \(n=20\).

Solution: we use the same sample as in the preceding example on the topic. We have already obtained a confidence interval for the ratio: \[\text{C.I.}(R;0.95)=(8.25,9.09).\] The sample mean of \(X\) was \(\overline{x}=\frac{167.2794}{20}=8.364.\) The 95% confidence interval for the average life expectancy using ratio estimation is thus \[\text{C.I.}_R(\mu_Y;0.95)=\mu_X\cdot \text{C.I.}(R;0.95)\approx \overline{x} \cdot(8.25,9.09)=(69.00, 76.03).\] Recall that the true value is \(\mu_Y=70.95\).

Sample Size

Just as was the case with SRS and StS, we can determine the required sample size assuming that we have some information about the population distribution.

To give an estimate for \(R\), use: \[\begin{aligned} B_{R}\approx 2\sqrt{\frac{1}{\mu_X^2}\cdot\frac{\sigma_W^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{R}\mu_X^2}{4}}_{=D_{R}}=\frac{\sigma_W^2}{n}\Big(\frac{N-n}{N-1}\Big)} \\ & \Longleftrightarrow {\frac{(N-1)D_{R}}{\sigma_W^2}=\frac{N-n}{n}=\frac{N}{n}-1} \\ & \Longleftrightarrow {\frac{(N-1)D_{R}+\sigma_W^2}{\sigma_W^2}=\frac{N}{n}} \\ & \Longleftrightarrow n_{R}=\frac{N\sigma_W^2}{(N-1)D_{R}+\sigma_W^2}.\end{aligned}\]

To give an estimate for \(\mu_Y\) with ratio estimation, use: \[\begin{aligned} B_{\mu_{Y;R}}\approx 2\sqrt{\frac{\sigma_W^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow n_{\mu_Y}=\frac{N\sigma_W^2}{(N-1)D_{\mu_Y}+\sigma_W^2}, \quad D_{\mu_Y}={\frac{B_{\mu_{Y;R}}^2}{4}}; \end{aligned}\] for \(\tau_Y\), use: \[\begin{aligned} B_{\tau_{Y;R}}\approx 2\sqrt{N^2\cdot\frac{\sigma_W^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow n_{\tau_Y}=\frac{N\sigma_W^2}{(N-1)D_{\tau_Y}+\sigma_W^2}, \quad D_{\tau_Y}={\frac{B_{\tau_{Y;R}}^2}{4N^2}}. \end{aligned}\]

Since we do not typically know \(\sigma_W^2\), we often use a small preliminary sample and use the empirical variance \(s_W^2\) as an estimator of \(\sigma_W^2\).

Example: consider a SRS \(\mathcal{Y}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) of size \(n\), drawn from a population of size \(N=37,444\). Assume that we have \(\sigma_W^2\approx 209.2\) and \(\mu_X\approx 71.7\), perhaps from a previous study. Determine the minimum sample size required to ensure that the bound on the error of estimation of the:

  1. ratio \(R\) using \(r\) is at most \(0.025\);

  2. mean \(\mu_Y\) using \(\hat{\mu}_{Y;R}\) is at most \(5\), and

  3. total \(\tau_Y\) using \(\hat{\tau}_{Y;R}\) is at most \(25\).

Solution: we simply use the formulas.

  1. since \(D_R=\frac{B_R^2\mu_X^2}{4}=\frac{0.025^2(71.7)^2}{4}\approx 0.8033,\) we have \[n_R=\frac{37444(209.2)}{(37444-1)(0.8033)+209.2}=258.6453 \implies n_R \geq 259;\]

  2. since \(D_{\mu_Y}=\frac{B_{\mu_{Y;R}}^2}{4}=\frac{5^2}{4}\approx 6.25,\) we have \[n_{\mu_Y}=\frac{37444(209.2)}{(37444-1)(6.25)+209.2}=33.443 \implies n_{\mu_Y} \geq 34;\]

  3. since \(D_{\tau_Y}=\frac{B_{\tau_{Y;R}}^2}{4N^2}=\frac{25^2}{4(37444)}\approx 0.001502243,\) we have \[n_{\tau_Y}=\frac{37444(209.2)}{(37444-1)(0.001502243)+209.2}=29509.62 \implies n_{\tau_Y} \geq 29510.\] In this last case, the desired bound \(B_{\tau_{Y;R}}\) is probably too tight (the resulting sample size is way too large). \(\blacksquare\)

5.5.2 Regression Estimation

Ratio estimation is a special case of a more general method, regression estimation. In the gapminder.csv dataset for 2011, we recognize that there is a more or less linear relationship between the life expectancy \(Y\) and the logarithm of the GDP per capita \(X\) for \(N=168\) countries.

Health and wealth of nations for the 2011 Gapminder data, with superimposed line of best fit.

Figure 5.12: Health and wealth of nations for the 2011 Gapminder data, with superimposed line of best fit.

When we compute \[r=\overline{y}/\overline{x}\] using a SRS of size \(n\), we are really assuming that the true relationship between \(Y\) and \(X\) takes the form \(Y=RX\approx r X\), i.e., that it is a straight line of slope \(r\) passing through the origin. But this last condition does not seem to be met. What to do in this case?

Regression Estimator

As above, let \(\mathcal{U}\) be a finite bivariate population of size \(N\), and \(\mathcal{Y}\subseteq \mathcal{U}\) be a finite bivariate random sample of size \(n\). We assume that the relationship between \(Y\) and \(X\) takes the form \[Y-\mu_Y=\beta(X-\mu_X).\]

If \(\mu_X\) is known (as we had assumed was the case for ratio estimation), the regression estimator \(\hat{\mu}_{Y;L}\) of \(\mu_Y\) obtained with the SRS \(\mathcal{Y}\) is \[\hat{\mu}_{Y;L}=\overline{y}+\beta(\mu_X-\overline{x}).\]

For now, we treat \(\beta\) as an unknown constant (since \(\mu_Y\) is also unknown). Since \(\mathcal{Y}\) is drawn in a SRS context, \(\text{E}(\overline{x})=\mu_X\) and \(\text{E}(\overline{y})=\mu_Y\), so that \[\text{E}(\hat{\mu}_{Y;L})={\text{E}(\overline{y})+\beta(\mu_X-\text{E} (\overline{x}))=\mu_Y+\beta(\mu_X-\mu_X)=\mu_Y}.\]

Consider the random variable \(W=Y+\beta(\mu_X-X)\). As \(\beta\) is constant, we have \[\mu_W=\mu_Y+\beta(\mu_X-\mu_X)=\mu_Y.\] The sample mean of \(W\) is thus \[\overline{w}=\overline{y}+\beta(\mu_X-\overline{x})=\hat{\mu}_{Y;L}\implies \text{V}(\hat{\mu}_{Y;L})=\text{V}(\overline{w})={\frac{\sigma_{W;L}^2}{n}\Big(\frac{N-n}{N-1}\Big)}.\]

But \[\begin{aligned} \sigma_{W;L}^2&={\frac{1}{N}\sum_{j=1}^N(W_j- \mu_W)^2=\frac{1}{N}\sum_{j=1}^N(Y_j+\beta(\mu_X-X_j)-\mu_Y)^2} \\& ={\frac{1}{N}\sum_{j=1}^N\left\{(Y_j-\mu_Y)-\beta(X_j-\mu_X)\right\}^2=\sigma_Y^2-2\beta\rho\sigma_X\sigma_Y+\beta^2\sigma^2_X},\end{aligned}\] where \(\rho=\frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}=\frac{\sigma_{XY}}{\sigma_X\sigma_Y}\). Consequently, \[\text{V}(\hat{\mu}_{Y;L})={\frac{\sigma_Y^2-2\beta\rho\sigma_X\sigma_Y+\beta^2\sigma^2_X}{n}\Big(\frac{N-n}{N-1}\Big)}.\]

In general, for a given systematic error (bias), preference is given to the estimator with the lowest variance. The value of \(\beta\) which minimizes \(\text{V}(\hat{\mu}_{Y;L})\) would then satisfy \[\frac{\partial \text{V}(\hat{\mu}_{Y;L})}{\partial \beta}(\beta^*)={\frac{1}{n}\Big(\frac{N-n}{N-1}\Big)(-2\rho\sigma_X\sigma_Y+2\beta^*\sigma^2_X)=0},\] which is to say that \[\beta^*={\rho\frac{\sigma_Y}{\sigma_X}=\frac{\sigma_{XY}}{\sigma_X\sigma_Y}\cdot\frac{\sigma_Y}{\sigma_X}=\frac{\sigma_{XY}}{\sigma_X^2}},\] from which we conclude that \[\begin{aligned} \text{V}(\hat{\mu}_{Y;L})&={\frac{\sigma_Y^2-2\beta^*\rho\sigma_X\sigma_Y+(\beta^*)^2\sigma^2_X}{n}\Big(\frac{N-n}{N-1}\Big)} \\ &= {\frac{\sigma_Y^2-2\rho\frac{\sigma_Y}{\sigma_X}\rho\sigma_X\sigma_Y+(\rho\frac{\sigma_Y}{\sigma_X})^2\sigma^2_X}{n}\Big(\frac{N-n}{N-1}\Big)} \\ &= {\frac{\sigma_Y^2-2\rho^2\sigma_Y^2+\rho^2\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)}\\ &=\frac{\sigma_Y^2(1-\rho^2)}{n}\Big(\frac{N-n}{n-1}\Big). \end{aligned}\]

If \(\sigma_Y\) and \(\rho\) are known, the formula is exact. Otherwise, it is not unlike simple linear regression.

Regression Estimator Bias

The task is to determine the coefficients \(\alpha,\beta\) that “best describe” the linear relationship between \(X\) and \(Y\), \[y_i=\alpha+\beta x_i+\varepsilon_i,\quad i=1,\ldots, n,\] where we assume that \((\varepsilon_1,\ldots,\varepsilon_n)\sim_{\text{approx.}}\mathcal{N}(\mathbf{0},\sigma^2\mathbf{I}_n).\)

There are several ways to interpret the phrase “best describe” – the least squares estimators \(\hat{\alpha}\) and \(\hat{\beta}\) are those that minimize the residual sum of squares \[Q(\alpha,\beta)={\sum_{i=1}^n\varepsilon_i^2=\sum_{i=1}^n(y_i-\hat{y}_ i)^2=\sum_{i=1}^n(y_i-\alpha-\beta x_i)^2}.\]

We solve the system of equations \[\frac{\partial Q}{\partial \alpha}(a,b)={\sum_{i=1}^n-2(y_i-a-bx_i)}=0, \quad \frac{\partial Q}{\partial \beta}(a,b)={\sum_{i=1}^n-2x_i(y_i-a-bx_i)}=0\], which yields \[\hat{\alpha}=a={\overline{y}-b\overline{x}}\quad\text{and}\quad \hat{\beta}=b={\frac{\displaystyle{\sum_{i=1}^n}(x_i-\overline{x})(y_i-\overline{y})}{\displaystyle{\sum_{i=1}^n(x_i-\overline{x})^2}}}.\]

In practice, it is this \(b={\hat{\rho}\frac{s_Y}{s_X}}\) that plays the role of the estimator \(\beta^*\); note that it varies from one SRS to another. Since \(b\) is not constant, we cannot conclude that \[\text{E}(b\overline{x})=\text{E}(b)\text{E}(\overline{x}),\] so that \[\text{E}(\hat{\mu}_{Y;L})={\text{E}(\overline{y})+\mu_X\text{E}(b)-\text{E}(b\overline{x})\neq \mu_Y},\] in general.

However, if the sample size \(n\) is large, it is possible to show that \[\text{E}[\hat{\mu}_{Y;L}-\mu_Y]\] is of order \(\frac{1}{n}\) (as was the case for the systematic error in ratio estimation); \(\hat{\mu}_{Y;L}\) is therefore a biased estimator of \(\mu_{Y}\).

Regression Estimator Variability

The sampling variance of \(\hat{\mu}_{Y;L}\) is also of order \(\frac{1}{n}\), and so the quotient of the bias \(\text{E}[\hat{\mu}_{Y;L}-\mu_Y]\) by the standard deviation of \(\hat{\mu}_{Y;L}\) is of order \(\frac{1}{\sqrt{n}}\).

Thus, when \(n,N\to \infty\) (assuming that \(N\gg n\)), we have \[\frac{\text{E}[\hat{\mu}_{Y;L}-\mu_Y]}{\text{SD}(\hat{\mu}_{Y;l})}\to 0.\]

Although it is impossible to get rid of the bias, the estimation error \[\text{MSE}(\hat{\mu}_{Y;L})=\text{V}(\hat{\mu}_{Y;L})+\left(\text{E}(\hat{\mu}_{Y;L})-\mu_Y\right)^2\] is dominated byt the variance \(\text{V}(\hat{\mu}_{Y;L})\) when \(n\) is sufficiently large.

Regression Estimator Confidence Intervals

The regression estimator \(\hat{\mu}_{Y;L}\) follows approximately a normal distribution \(\mathcal{N}\left(\text{E}(\hat{\mu}_{Y;L}),\text{V}(\hat{\mu}_{Y;L})\right)\), from which we conclude that the bound on the error of estimation is \[B_{L}\approx \hat{B}_L=2\sqrt{\hat{\text{V}}(\hat{\mu}_{Y;L})}\approx {2\sqrt{\frac{s_{W;L}^2}{n}\Big(1-\frac{n}{N}\Big)}},\] where \(s_{W;L}^2\) is the regression mean square error, \[s^2_{W;L}={\frac{n-1}{n-2}(s_Y^2-b^2s_X^2)=\frac{n-1}{n-2}\cdot s_Y^2(1-\hat{\rho}^2)}.\] Consequently \(\text{C.I.}_L(\mu_{Y};0.95): {\hat{\mu}_{Y;L}\pm \hat{B}_L}\) is an approximate 95% C.I. for \(\mu_Y\) (we tackle \(\tau_{Y}\) and \(p_{Y}\) in the usual manner).

Example: consider a SRS \(\mathcal{Y}=\{(x_1,y_1),\ldots,(x_n,y_n)\}\) with \(n=132\), drawn from population of size \(N=37,444\). In a preceding study, we have shown that \(\mu_X\approx 70.3\). Find a 95% C.I. for \(\mu_Y\) using regression estimation if \[\begin{aligned} \sum_{i=1}^nx_i&=9464.6,\quad \sum_{i=1}^ny_i=14691.6,\\ \sum_{i=1}^nx_i^2&=686773.2,\quad \sum_{i=1}^nx_iy_i=1062186,\quad \sum_{i=1}^ny_i^2=1670194.\end{aligned}\]

Solution: we must evaluate \(\overline{x}\), \(\overline{y}\), \(s^2_X\), \(s_{XY}\), \(s_Y^2\), and \(\hat{\rho}\). But \[\begin{aligned} \overline{x}&=\frac{9464.6}{132}\approx 71.7,\quad \overline{y}=\frac{14691.6}{132}\approx 111.3,\\ s^2_X&=\frac{686773.2-132(71.7)^2}{132-1}\approx 62.2,\quad s^2_Y=\frac{1670194-132(111.3)^2}{132-1}\approx 267.3 \\ s_{XY}&=\frac{1062186-132(71.7)(111.3)}{132-1}\approx 67.2,\quad \hat{\rho}=\frac{67.2}{\sqrt{(62.2)(267.3)}} \approx 0.521.\end{aligned}\] The estimator for the regression slope is therefore \(b=\hat{\rho}\frac{s_Y}{s_X}=1.08\). Moreover, \[s^2_{W;L}=\frac{131}{130}\cdot 267.3 \cdot (1-0.521^2) \approx 196.77.\] Consequently, \[\hat{\mu}_{Y;L}=111.3+1.08(\underbrace{\strut 70.3}_{\mu_X}-71.7)=109.8\] et \[\hat{B}_L\approx 2\sqrt{\frac{196.77}{132}\Big(1-\frac{132}{37444}\Big)}= 2.43,\] from which we conclude that \[\text{C.I.}_L(\mu_{Y};0.95)\equiv 109.8 \pm 2.43.\quad \blacksquare\]

Of course, if the linearity assumption is not valid, we should not expect the bound on the error of estimation using regression estimation to be substantially tighter than the one obtained in a SRS, say.

Example: find a 95% C.I. for the average life expectancy by country in 2011 using regression estimation against the logarithm of the GDP per capita, with \(n=20\), assuming that it is known that \(\mu_X=7.84\).

Solution: we draw a sample of size \(n=20\), and we calculate the required quantities:

n=20
set.seed(123456) # replicability
index = sample(1:N,n, replace=FALSE)
samp = gapminder.RLD[index,c("life_expectancy","lgdppc")]
mu.X = mean(gapminder.RLD$lgdppc)

The sample means are:

(y.bar = mean(samp$life_expectancy))
(x.bar = mean(samp$lgdppc))
[1] 72.541
[1] 8.363971

The intermediate sums and the correlation coefficient are:

sum.xi = sum(samp$lgdppc)
sum.yi = sum(samp$life_expectancy)
sum.xi.2 = sum(samp$lgdppc^2)
sum.yi.2 = sum(samp$life_expectancy^2)
sum.xiyi = sum(samp$lgdppc*samp$life_expectancy)

s2.X = (sum.xi.2-n*x.bar^2)/(n-1)
s2.Y = (sum.yi.2-n*y.bar^2)/(n-1)
s.XY = (sum.xiyi-n*x.bar*y.bar)/(n-1)

(rho = s.XY/sqrt(s2.X*s2.Y))
[1] 0.667983

Next, we evaluate the MSE:

(s2.W.L = (n-1)/(n-2)*s2.Y*(1-rho^2))
[1] 26.8736

The bound on the error of estimation is thus:

(B = 2*sqrt(s2.W.L/n*(1-n/N)))
[1] 2.175976

and the corresponding 95% C.I. for the mean life expectancy by country is:

(hat.mu.Y.L = y.bar + rho*sqrt(s2.Y/s2.X)*(mu.X-x.bar))
c(hat.mu.Y.L-B,hat.mu.Y.L+B)
[1] 70.71572
[1] 68.53974 72.89170

For comparison’s sake, the true mean is \(\mu_Y=70.95\).

We can also compute the estimate and the confidence interval directly, with the base lm() function.

reg.lin = lm(life_expectancy~lgdppc, data=samp)
summary(reg.lin)

Call:
lm(formula = life_expectancy ~ lgdppc, data = samp)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.2467   0.1592   1.6513   2.6614   5.8812 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  43.2559     7.7768   5.562  2.8e-05 ***
lgdppc        3.5013     0.9194   3.808  0.00129 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.184 on 18 degrees of freedom
Multiple R-squared:  0.4462,    Adjusted R-squared:  0.4154 
F-statistic:  14.5 on 1 and 18 DF,  p-value: 0.001287

The required quantities can be extracted as follows:

(b = as.numeric(reg.lin$coefficients[2]))
[1] 3.501336
(s2.W.L = summary(reg.lin)$sigma^2)
[1] 26.8736

Sample Size

If we seek an regression estimate of \(\mu_Y\), we use: \[\begin{aligned} B_{L}\approx 2\sqrt{\frac{\sigma_{W;L}^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow {\underbrace{\strut \frac{B^2_{L}}{4}}_{=D_{L}}=\frac{\sigma_{W;L}^2}{n}\Big(\frac{N-n}{N-1}\Big)} \\ & \Longleftrightarrow {\frac{(N-1)D_{L}}{\sigma_{W;L}^2}=\frac{N-n}{n}=\frac{N}{n}-1} \\ & \Longleftrightarrow {\frac{(N-1)D_{L}+\sigma_{W;L}^2}{\sigma_{W;L}^2}=\frac{N}{n}}, \\ & \Longleftrightarrow n_{L}=\frac{N\sigma_{W;L}^2}{(N-1)D_{L}+\sigma_{W;L}^2}.\end{aligned}\]

For \(\tau_Y\), we use: \[\begin{aligned} B_{\tau;L}\approx 2N\sqrt{\frac{\sigma_{W;L}^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow n_{\tau;L}=\frac{N\sigma_{W;L}^2}{(N-1)D_{\tau;L}+\sigma_{W;L}^2}, \\ \text{where }D_{\tau; L}&={\frac{B_{\tau;L}^2}{4N^2}}. \end{aligned}\]

Since we do not know usually know \(\sigma_{W;L}^2\), we often draw a small preliminary sample on which we compute the sample \(s_{W;L}^2\), which is used as an estimator of \(\sigma_{W;L}^2\).64

Example: determine the sample size \(n\) required to estimate the average life expectancy \(\mu_Y\) using regression estimation against the logarithm of GDP per capita in 2011, with a bound of error on the estimation of \(B_{L}=1\), if \(\sigma_{W;L}\approx 5.194\) and \(N=168\).

Solution: using the formal, we have: \[{n_{L}=\frac{168(5.194)^2}{167(1^2/4)+(5.194)^2}=65.94498 \implies n_{L}\geq 66.}\]

Since there are good reasons to trust that the relationship between life expectancy and log GNP per capita in 2011 is approximately linear (see Figure 5.12), the regression approach is a strong one (assuming, of course, that \(\mu_X\) is known). How does it compare with the example that uses ratio estimation? \(\blacksquare\)

5.5.3 Difference Estimation

Difference estimation is another special case of regression estimation, where the slope \(\beta\) is now assumed to be \(1\).

If \(\mu_X\) is known, the difference estimator \(\hat{\mu}_{Y;D}\) of \(\mu_Y\) computed from a SRS \(\mathcal{Y}\) is \[\hat{\mu}_{Y;D}=\overline{y}+(\mu_X-\overline{x}).\]

Difference estimation is a good strategy when the relationship between \(X\) and \(Y\) is approximately linear and of slope \(1\) (passing or not through the origin), as long as the variance of \(Y\) along this line is constant for all \(X\).

Since \(\mathcal{Y}\) is drawn according to a SRS, \(\text{E}(\overline{x})=\mu_X\) and \(\text{E}(\overline{y})=\mu_Y\), from which we conclude that \[\text{E}(\hat{\mu}_{Y;D})={\text{E}(\overline{y})+(\mu_X-\text{E}(\overline{x}))=\mu_Y+(\mu_X-\mu_X)=\mu_Y}.\]

Consider the random variable \(D=Y-X\), whose expectation is \[\mu_D=\mu_Y-\mu_X.\] The sample mean of \(D\) is thus \[\overline{d}=\overline{y}-\overline{x} \implies \hat{\mu}_{Y;D}=\mu_X+(\overline{y}-\overline{x})=\mu_X+\overline{d}.\]

Consequently, \[\text{V}(\hat{\mu}_{Y;D})=\text{V}(\mu_X+\overline{d})=\text{V}(\overline{d})={\frac{\sigma_{D}^2}{n}\Big(\frac{N-n}{N-1}\Big)}.\]

But \[\begin{aligned} \sigma_{D}^2&={\frac{1}{N}\sum_{j=1}^N(D_j- \mu_D)^2=\frac{1}{N}\sum_{j=1}^N\left\{(Y_j-X_j)-(\mu_Y-\mu_X)\right\}^2} \\& ={\frac{1}{N}\sum_{j=1}^N\left\{(Y_j-\mu_Y)-(X_j-\mu_X)\right\}^2=\sigma_Y^2-2\rho\sigma_X\sigma_Y+\sigma^2_X},\end{aligned}\] where \(\rho=\frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}=\frac{\sigma_{XY}}{\sigma_X\sigma_Y}\). As such, \[\text{V}(\hat{\mu}_{Y;D})={\frac{\sigma_Y^2-2\rho\sigma_X\sigma_Y+\sigma^2_X}{n}\Big(\frac{N-n}{N-1}\Big)}.\]

The difference estimator \(\hat{\mu}_{Y;D}\) follows approximately a normal distribution \(\mathcal{N}\left(\text{E}(\hat{\mu}_{Y;D}), \text{V}(\hat{\mu}_{Y;D})\right)\), from which we obtain the bound on the error of estimation \[B_{D}\approx \hat{B}_D=2\sqrt{\hat{\text{V}}(\hat{\mu}_{Y;D})}\approx {2\sqrt{\frac{s_{D}^2}{n}\Big(1-\frac{n}{N}\Big)}},\] where \[s^2_{D}={\frac{1}{n-1}\sum_{i=1}^n(d_i-\overline{d})^2=s_Y^2-2\hat{\rho}s_Xs_Y+s_X^2},\] so that \(\text{C.I.}_D(\mu_{Y};0.95): {\hat{\mu}_{Y;D}\pm \hat{B}_D}\) is an approximate 95% C.I. for \(\mu_{Y}\) (we tackle \(\tau_{Y}\) and \(p_{Y}\) in the usual manner).

Example: Auditors are often interested in comparing the audited value \(Y\) of items with their book value \(X\). Suppose that \(N=180\) items in inventory have a book value of \(\tau_X=13,320\). A SRS of \(n=10\) items yields the following data:

item \(i\) \(1\) \(2\) \(3\) \(4\) \(5\) \(6\) \(7\) \(8\) \(9\) \(10\)
Audit \(y_i\) \(9\) \(14\) \(7\) \(29\) \(45\) \(109\) \(40\) \(238\) \(60\) \(170\)
Book \(x_i\) \(10\) \(12\) \(8\) \(26\) \(47\) \(112\) \(36\) \(240\) \(59\) \(167\)
\(d_i=y_i-x_i\) \(-1\) \(2\) \(-1\) \(3\) \(-2\) \(-3\) \(4\) \(-2\) \(1\) \(3\)

Find a 95% C.I. for the mean audit value \(\mu_Y\) using difference estimation.

Scatterplot of $X$ and $Y$.

Figure 5.13: Scatterplot of \(X\) and \(Y\).

Solution: from the scatterplot, we surmise that the slope of the linear fit of \(Y\) against \(X\) is approximately 1. We must compute \(\overline{d}\) and \(s_D^2\): \[\begin{aligned} \sum_{i=1}^{10}d_i=4,\quad \sum_{i=1}^{10}d_i^2=58,\implies \overline{d}=\frac{4}{10}\quad\text{and}\quad s_D^2=\frac{58-10(0.4)^2}{10-1}=6.27.\end{aligned}\]

Since \(\mu_X=\frac{\tau_X}{N}=\frac{13320}{180}=74\), the difference estimator is \[\hat{\mu}_{Y;D}=\mu_X+\overline{d}=74+0.4=74.4\] and the bound is \[\hat{B}_D\approx 2\sqrt{\hat{\text{V}}(\hat{\mu}_D)}=2\sqrt{\frac{6.27}{10}\Big(1-\frac{10}{180}\Big)}= 1.54,\] from which \[\text{C.I.}_D(\mu_{Y};0.95):\quad 74.4 \pm 1.54\equiv (72.86,75.94).\quad \blacksquare\]

Example: consider a bivariate SRS sample \(\mathcal{Y}=\{(x_i,y_i)\}\) of size \(n=132\), drawn from a population of size \(N=37,444\). In a preceding study, we found that \(\mu_X\approx 70.3\). Find a 95% C.I. for \(\mu_Y\) using difference estimation, assuming that \[\begin{aligned} \sum_{i=1}^nx_i&=9464.6,\quad \sum_{i=1}^ny_i=14691.6, \\ \sum_{i=1}^nx_i^2&=686773.2,\quad \sum_{i=1}^nx_iy_i=1062186,\quad \sum_{i=1}^ny_i^2=1670194.\end{aligned}\]

Solution: in a previous example, we have already computed \[\begin{aligned} \overline{x}&= 71.7,\ \overline{y}\approx 111.3,\ s^2_X\approx 62.2,\ s^2_Y\approx 267.3,\ s_{XY}\approx 67.2.\end{aligned}\] The difference estimator is thus \[\hat{\mu}_{Y;D}=\overline{y}+(\mu_x-\overline{x})=111.3+(70.3-71.7)= 109.9,\] so that \[\hat{B}_D\approx 2\sqrt{\frac{267.3-2(67.2)+62.2}{132}\Big(1-\frac{132}{37444}\Big)}=2.427 ,\] and \[\text{C.I.}_D(\mu_{Y};0.95)\equiv 109.9 \pm 2.427.\quad \blacksquare\]

Example: find a 95% C.I. for the average life expectancy by country in 2011 \(\mu_Y\) using the difference method with the logarithm of GDP per capita per country (\(X\)), using a sample of size \(n=20\). Assume that \(\mu_X=7.84\) is known.

Solution: we draw a sample of size \(n=20\) and compute the various required quantities.

(mu.X = mean(gapminder.RLD[,"lgdppc"]))
n=20
set.seed(1234567)
index = sample(1:N,n, replace=FALSE)
samp = gapminder.RLD[index,c("life_expectancy","lgdppc")]
d = samp$life_expectancy - samp$lgdppc
[1] 7.842661
(y.bar = mean(samp$life_expectancy))
(x.bar = mean(samp$lgdppc))
(d.bar = mean(d))
(s2.d = var(d))
[1] 70.105
[1] 7.577646
[1] 62.52735
[1] 47.69057

Note that the regression slope does not seem to be 1 (if that was the case, we would expect \(\overline{y}/\overline{x}\approx 1\)). Difference estimation is not recommended in this case, but we will continue the example nonetheless.

The bound on the error of estimation and the difference estimate are computed below, and the confidence interval is:

B = 2*sqrt(s2.d/n*(1-n/N))
hat.mu.Y.D = y.bar + (mu.X-x.bar)
c(hat.mu.Y.D-B,hat.mu.Y.D+B)
[1] 67.47129 73.26874

In spite of the difference estimation assumptions not being met, the 95% C.I. for \(Y\) DOES contain the true value, \(\mu_Y=70.95\)! A happy coincidence, no more.

Sample Size

As with the other methods, we can determine the sample size required to achieve a certain bound on the error of estimation.

In order to estimate \(\mu_Y\) and \(\tau_Y\) via difference estimation, use: \[\begin{aligned} B_{\mu;D}\approx 2\sqrt{\frac{\sigma_{D}^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow n_{\mu;D}={\frac{N\sigma_{D}^2}{(N-1)D_{\mu;D}+\sigma_{D}^2}};\\ B_{\tau;D}\approx 2N\sqrt{\frac{\sigma_{D}^2}{n}\Big(\frac{N-n}{N-1}\Big)} & \Longleftrightarrow n_{\tau;D}={\frac{N\sigma_{D}^2}{(N-1)D_{\tau;D}+\sigma_{D}^2}}, \end{aligned}\] where \[D_{\mu; D}={\frac{B_{\mu;D}^2}{4}}\quad \text{and}\quad D_{\tau; D}={\frac{B_{\tau;D}^2}{4N^2}}.\]

As we do not know usually know \(\sigma_{D}^2\), we often draw a small preliminary sample and use the empirical variance \(s_{D}^2\) as an estimator of \(\sigma_{D}^2\).

Warning! Even if formal manipulations can still be performed, the estimate may not be valid if the relationship between the variables \(X\) and \(Y\) is not linear with slope \(\approx 1\).

5.5.4 Comparisons

We have already compared the bounds on the error of estimation for SRS, StS (Prop), and StS (Neyman), and discussed contexts in which one might expect a StS to be preferable to an SRS, or a StS (Neyman) preferable to a StS (Prop).

What can be said about ratio, regression, and difference estimation, both compared to SRS and to each other?

Comparaison Between SRS and the Ratio Method

In what context can we expect ratio estimation to perform “well”? Obviously, the relationship between \(Y\) and \(X\) must at least be linear and pass through the origin, i.e., \[y_i=\beta x_i+\varepsilon_i,\quad i=1,\ldots, n.\] It is generally assumed that the observations \(\{x_i>0\}\) are fixed, and that the error terms \(\{\varepsilon_i\}\) are independent of each other, with \[\text{E}(\varepsilon_i)=0\quad\text{and}\quad \text{V}(\varepsilon_i)=f(x_i)\sigma^2>0.\]

The question becomes: what form must \(f(x_i)\) take so that the least squares solution \(\hat{\beta}\) is exactly the estimator \(r\) of the ratio \(R\)?

If we set \[\begin{aligned} \underbrace{\strut \frac{y_i}{\sqrt{f(x_i)}}}_{{y_i'}}&=\beta\underbrace{\strut\frac{x_i}{\sqrt{f(x_i)}}}_{{x_i'}}+\underbrace{\strut \frac{\varepsilon_i}{\sqrt{f(x_i)}} }_{{\varepsilon_i'}},\quad i=1,\ldots, n, \end{aligned}\] we get \[\text{E}(\varepsilon_i')={\frac{1}{\sqrt{f(x_i)}}\text{E}(\varepsilon)=0} \quad \text{and}\quad \text{V}(\varepsilon_i')={\frac{1}{f(x_i)}\text{V}(\varepsilon_i')=\frac{f(x_i)\sigma^2}{f(x_i)}=\sigma^2},\] and the assumptions of the least squares problem are satisfied. The estimator \(\beta\) is obtained by minimizing \[\begin{aligned} Q(\beta)={\sum_{i=1}^n(\varepsilon_i')^2=\sum_{i=1}^n(y_i'-\beta x_i')^2=\sum_{i=1}^n \frac{1}{f(x_i)}\left(y_i-\beta x_i\right)^2};\end{aligned}\] since \[Q'(\beta)={-2\sum_{i=1}^n\frac{x_i}{f(x_i)}(y_i-\beta x_i)},\] this is equivalent to solving \[\begin{aligned} 0&={\sum_{i=1}^n\frac{x_i}{f(x_i)}(y_i-\hat{\beta} x_i)} \Longleftrightarrow 0={\sum_{i=1}^n\left(\frac{x_iy_i}{f(x_i)}-\hat{\beta}\frac{x_i^2}{f(x_i)}\right)} \Longleftrightarrow \hat{\beta} ={\frac{\displaystyle{\sum_{i=1}^n \frac{x_iy_i}{f(x_i)}}}{\displaystyle{\sum_{i=1}^n \frac{x_i^2}{f(x_i)}}}}.\end{aligned}\]

If \(\frac{x_i}{f(x_i)}=k>0\) for all \(i=1,\ldots,n\), the estimator \(\hat{\beta}\) becomes \[\hat{\beta}={\frac{\displaystyle{k\sum_{i=1}^n y_i}}{k\displaystyle{\sum_{i=1}^n x_i}}=\frac{\displaystyle{\sum_{i=1}^n y_i}}{\displaystyle{\sum_{i=1}^n x_i}}=r}.\]

Thus, when the variance of \(Y\) along the line \(Y=\beta X\) is \[\text{V}(y_i)=\text{V}(\beta x_i+\varepsilon_i)=\text{V}(\varepsilon_i)=x_i\sigma^2\] (i.e., the variance of \(Y\) is proportional to \(X\)), the estimator \(r\) of the ratio \(R\) is exactly the least squares solution, \(\hat{\beta}=r\), and we can expect ratio estimation to produce “good” results.

Of course, one can use the ratio estimation method with a SRS \(\mathcal{Y}\) to obtain an estimate \(\hat{\mu}_{Y;R}\) of \(\mu_Y\) even if \(\text{V}(\varepsilon)\neq x\sigma^2\).

We have already determined the variance of this estimator: \[\begin{aligned} \text{V}(\hat{\mu}_{Y;R})&=\text{V}(r\mu_X)=\mu_X^2\text{V}(r)\approx \frac{1}{n}(\sigma_Y^2+R^2\sigma_X^2-2R\rho \sigma_X\sigma_Y)\Big(\frac{N-n}{N-1}\Big) \\& ={\underbrace{\strut \frac{\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)}_{\text{V}(\overline{y}_{\text{SRS}})}+\frac{R^2\sigma_X^2-2R\rho\sigma_X\sigma_Y}{n}\Big(\frac{N-n}{N-1}\Big)}. \end{aligned}\]

Consequently, \(\text{V}(\overline{y}_{\text{SRS}})\gg \text{V}(\hat{\mu}_{Y;R})\) if and only if \({R^2\sigma_X^2-2R\rho\sigma_X\sigma_Y} \ll 0,\) which is to say if \[\rho \gg {\frac{R\sigma_X}{2\sigma_Y}=\frac{\mu_Y\sigma_X}{2\mu_X\sigma_Y}=\frac{1}{2}\cdot \frac{\text{CV}_X}{\text{CV}_Y}}.\]

Comparaison Between SRS and the Regression Method

We have already determined the variance of the estimator \(\hat{\mu}_{Y;L}\) of \(\mu_Y\): \[\begin{aligned} \text{V}(\hat{\mu}_{Y;L})&\approx (1-\rho^2)\frac{\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)={\underbrace{\strut \frac{\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)}_{\text{V}(\overline{y}_{\text{SRS}})}-\rho^2\cdot\frac{\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)} \\&={(1-\rho^2)\text{V}(\overline{y}_{\text{SRS}})}.\end{aligned}\]

Consequently, \(\text{V}(\hat{\mu}_{Y;L})\ll \text{V}(\overline{y}_{\text{SRS}})\) when \({(1-\rho^2)\text{V}(\overline{y}_{\text{SRS}})\ll \text{V}(\overline{y}_{\text{SRS}})}\), which is to say that \[{1-\rho^2\ll 1} \Longleftrightarrow {0\ll |\rho|\leq 1}.\]

Comparaison Between SRS and the Difference Method

We have already determined the variance of the estimator \(\hat{\mu}_{Y;D}\) of \(\mu_Y\): \[\begin{aligned} \text{V}(\hat{\mu}_{Y;D})&=\frac{\sigma_Y^2-2\rho\sigma_X\sigma_Y+\sigma^2_X}{n}\Big(\frac{N-n}{N-1}\Big) \\ &= {\underbrace{\strut \frac{\sigma_Y^2}{n}\Big(\frac{N-n}{N-1}\Big)}_{\text{V}(\overline{y}_{\text{SRS}})}+\frac{\sigma^2_X-2\rho\sigma_X\sigma_Y}{n}\Big(\frac{N-n}{N-1}\Big)}.\end{aligned}\]

Consequently, \(\text{V}(\hat{\mu}_{Y;D})\ll \text{V}(\overline{y}_{\text{SRS}})\) when \({\sigma_X^2-2\rho\sigma_X\sigma_Y\ll 0} \Longleftrightarrow {\sigma_X^2\ll 2\sigma_{XY}}.\)

Comparaison Between the Ratio, Regression, and Difference Methods

For each of the estimators \(\hat{\mu}_{Y;\alpha}\), \(\alpha\in\{R,L,D\}\), we have shown that the sampling variance takes the (approximate) form \[\begin{aligned} \text{V}(\hat{\mu}_{Y;\alpha})&\approx {\text{V}(\overline{y}_{\text{SRS}})+\frac{A_{\alpha}}{n}\Big(\frac{N-n}{N-1}\Big)},\end{aligned}\] where \[A_{\alpha}=\begin{cases}{R^2\sigma_X^2-2R\rho\sigma_X\sigma_Y}, & \alpha=R \\ {-\rho^2\sigma_Y^2}, & \alpha=L \\ {\sigma_X^2-2\rho\sigma_X\sigma_Y}, & \alpha=D \end{cases}\]

In general, \(\text{V}(\hat{\mu}_{Y;\alpha})\ll \text{V}(\hat{\mu}_{Y;\gamma})\) if and only if \(A_{\alpha}\ll A_{\gamma}\); these are the terms that must be compared to one another.

For instance, \[\begin{aligned} \text{V}(\hat{\mu}_{Y;R})\gg \text{V}(\hat{\mu}_{Y;L}) & \Longleftrightarrow { R^2\sigma_X^2-2R\rho\sigma_X\sigma_Y \gg -\rho^2\sigma_Y^2} \\ & \Longleftrightarrow {R^2\sigma_X^2-2R\rho\sigma_X\sigma_Y +\rho^2\sigma_Y^2\gg 0} \\ & \Longleftrightarrow {(R\sigma_X-\rho\sigma_Y)^2 \gg 0} \Longleftrightarrow {|R\sigma_X-\rho\sigma_Y|\gg 0} \\ & \Longleftrightarrow {R\gg \rho \frac{\sigma_Y}{\sigma_X}=\hat{\beta}} \quad \text{ou} \quad {R\ll \hat{\beta}}\end{aligned}\]

All things being equal, the regression estimator is preferable to the ratio estimator (according to their bounds on the error of estimation) when the ratio is quite different from the slope of the regression line.

Similarly, \[\begin{aligned} \text{V}(\hat{\mu}_{Y;D})\gg \text{V}(\hat{\mu}_{Y;L}) & \Longleftrightarrow { \sigma_X^2-2\rho\sigma_X\sigma_Y \gg -\rho^2\sigma_Y^2} \\ & \Longleftrightarrow {\sigma_X^2-2\rho\sigma_X\sigma_Y +\rho^2\sigma_Y^2\gg 0} \\ & \Longleftrightarrow {(\sigma_X-\rho\sigma_Y)^2 \gg 0} \Longleftrightarrow {|\sigma_X-\rho\sigma_Y|\gg 0} \\ & \Longleftrightarrow {1\gg \rho \frac{\sigma_Y}{\sigma_X}=\hat{\beta}} \quad \text{or} \quad {1\ll \hat{\beta}}.\end{aligned}\]

All things being equal, the regression estimator is preferable to the difference estimator (according to their bounds on the error of estimation) when the slope of the regression line takes a value far from \(1\).

But the regression estimator is always at least as good as the other two since the latter two are special cases of regression estimation.

Finally, we can also compare the estimators by the ratio and by the difference: \[\begin{aligned} \text{V}(\hat{\mu}_{Y;R})\gg \text{V}(\hat{\mu}_{Y;D}) & \Longleftrightarrow { R^2\sigma_X^2-2R\rho\sigma_X\sigma_Y \gg \sigma_X^2-2\rho\sigma_X\sigma_Y} \\ & \Longleftrightarrow {|R|\neq 1} \quad\text{and}\quad {\sigma_X^2\gg \frac{2}{R+1}\sigma_{XY}} \end{aligned}\] and \[\begin{aligned} \text{V}(\hat{\mu}_{Y;D})\gg \text{V}(\hat{\mu}_{Y;R}) & \Longleftrightarrow {|R|\neq 1} \quad\text{and}\quad {\sigma_X^2\ll \frac{2}{R+1}\sigma_{XY}}\end{aligned}\] Otherwise, the variances are of relatively similar magnitude.