## 5.8 Exercises

1. You are tasked with estimating the annual salary of data scientists in Canada. Determine the:

• populations (target, study, respondent);

• sampling frames;

• samples (target, achieved);

• information about units (units, response variable, attributes);

• sources of error (coverage, nonresponse, sampling, measurement and processing) and variability (sampling, measurement).

2. We seek to estimate the average daily distance travelled by Ontario cars, as well as their daily fuel consumption. Discuss various approaches to be used. What are some of the issues and challenges that could be encountered?

3. We seek an estimate of the average daily distance travelled in Winter 2012 in Ontario, as are the average daily fuel consumption and the proportion of vehicles not in use. An SRS is selected from the Ontario fleet (size $N=$7,868,359); the responses are collected in the file Autos_SRS.xlsx. Discuss issues that may affect the quality of the data. Provide a numerical and visual summary of the data for the sample. Give an approximate 95% C.I. for each population mean sought, with corresponding coefficient of variation.

4. We seek an estimate of the average daily distance travelled in Winter 2012 in Ontario, as are the average daily fuel consumption and the proportion of vehicles not in use. An StS is selected from the Ontario fleet (size $N=$7,868,359), with information concerning vehicle type and age; the responses are collected in the file Autos_StS.xlsx. Discuss issues that may affect the quality of the data. Provide a numerical and visual summary of the data for the sample. Give an approximate 95% C.I. for each population mean sought, with corresponding coefficient of variation. Conduct the same exercise for each stratum.

5. We seek an estimate of the average daily distance travelled in Winter 2012 in Ontario. An SRS is selected from the Ontario fleet (size $N=$7,868,359). The responses, as well as the corresponding daily fuel consumption, are collected in the file Autos_RLD.xlsx. Give an approximate 95% C.I. for the characteristic of interest using quotient, regression, and difference estimation.

6. Could cluster sampling be used to provide estimates of average daily distance travelled, average daily fuel consumption, and proportion of vehicles not in use in Winter 2012 in Ontario? Treat the vehicle type and age information found in Autos_StS.xlsx as cluster information.

7. Repeat the previous exercise using multi-phase and multi-stage sampling.

8. Draw $$m=1000$$ SRS samples of size $$n$$ from the $$N=183$$ countries (excluding China and India) in the 2011 Gapminder dataset to estimate the average propulation by country $$\mu$$. For $$n=30,60,90,120$$, what proportion of the $$m$$ samples yield an approximate 95% C.I. containing $$\mu$$? Assume that $$\sigma^2$$ is not known.

9. Find an approximate 95% C.I. for the average life expectancy $$\mu$$ of the $$N=185$$ countries in the 2011 Gapminder dataset using a SRS of size $$n=20$$. Is the true average life expectancy in your confidence interval? Repeat this task $$m=1000$$ times, with different SRS samples. What proportion of the $$m$$ samples yield approximate 95% C.I. containing $$\mu$$? Assume that $$\sigma^2$$ is not known. Compare with the results of the previous exercise. How do you explain the discrepancy?

10. Find an approximate 95% C.I. for the proportion $$p$$ of countries whose life expectancy fell below 60 years in the 2011 Gapminder dataset ($$N=185$$), using a SRS of size $$n=20$$. Is the true proportion in the confidence interval? Repeat this task $$m=1000$$ times, with different SRS samples. What proportion of the $$m$$ samples yield approximate 95% C.I. containing the true $$p$$? Assume that $$\sigma^2$$ is not known. Compare with the results of exercises 8 and 9.

11. Find an approximate 95% C.I. for the total population of the planet in the 2011 Gapminder dataset ($$N=185$$), using a StS of size $$n=20$$. What variable will you use to stratify the data? Repeat this task $$m=1000$$ times, with different StS samples. What proportion of the $$m$$ samples yield approximate 95% C.I. containing the true total $$\tau$$? Is the distribution of the obtained totals (approximately) normal? How do you explain the shape of this distribution?

12. Find an approximate 95% C.I. for the proportion $$p$$ of countries whose life expectancy fell below 60 years in the 2011 Gapminder dataset ($$N=185$$), using a StS of size $$n=20$$. What variable will you use to stratify the data? Is the true proportion in the confidence interval? Repeat this task $$m=1000$$ times, with different StS samples. What proportion of the $$m$$ samples yield approximate 95% C.I. containing the true $$p$$? Compare with the results of exercise 10.

13. Consider a sample $$\mathcal{Y}=\{(x_1,y_1),\ldots,(x_n,y_n)\}$$ drawn from a population of size $$N=37,444$$. In a preceding study, we have found that $$\sigma_{W;L}^2\approx 188.2$$. Find the minimal $$n$$ which ensures that the bound on the error of (regression) estimation of the mean $$\mu_Y$$ is at most $$5$$. Do the same for the total $$\tau_Y$$ and a bound of at most $$250$$.

14. Find a 95% C.I. for the proportion of countries in the 2011 Gapminder dataset ($$N=185$$) whose life expectancy is above 75 years, using a ClS with $$m=8$$, assuming that the countries are grouped into $$M=22$$ clusters determined by geographic regions. Assume further that the average cluster size is known to be $$\overline{N}=8.41$$.

15. Consider a ClS $$\mathcal{Y}$$ consisting of $$m$$ clusters drawn from a population $$\mathcal{U}$$ of size $$N$$, distributed in $$M$$ clusters. Let $$\mu$$ be the mean and $$\sigma^2$$ the variance of the population $$\mathcal{U}$$. If the clusters are all of size $$n$$, show that $\text{V}(\overline{y}_G)\approx {\frac{\sigma^2-\overline{\sigma^2}}{m}\Big(1-\frac{m}{M}\Big)},\quad \text{where }\overline{\sigma^2}={\frac{1}{M}\sum_{\ell=1}^M\sigma_{\ell}^2},$ where $$\sigma_{\ell}^2$$ is the variance in the $$\ell$$th cluster.