5.1 Background

To call in the statistician after the experiment is done may be no more than asking them to perform a post-mortem examination: at best, they may be able to say what the experiment died of. [R.A. Fisher, Presidential Address to the First Indian Statistical Congress, 1938]

Data analysis tools and techniques work in conjunction with collected data. The type of data that needs to be collected to carry out such analyses, as well as the priority placed on the collection of quality data relative to other demands, will dictate the choice of data collection strategies.

The manner in which the resulting outputs of these analyses are used for decision support will, in turn, influence appropriate data presentation strategies and system functionality, which is an important access of the analytical process. Although analysts should always endeavour to work with representative and unbiased data, there will be times when the available data is flawed and not easily repaired.

Analysts have a professional responsibility to explore the data, looking for potential fatal flaws prior to the start of the analysis and to inform their client and stakeholders of any findings that could halt, skew, or simply hinder the analytical process or its applicability to the situation at hand.

Unless a clause has specifically been put in the contract to allow a graceful exit at this point, consultants will have to proceed with the analysis, flaws and all. It is EXTREMELY IMPORTANT that one does not simply sweep these flaws under the carpet. Address them repeatedly in meetings with the clients, and make sure that the analysis results that are presented or reported on include an appropriate caveat.

Formulating the Problem

The objectives drive all other aspects of quantitative analysis. With a question (or questions) in mind, an investigator can start the process that leads to model selection.

With potential models in tow, the next step is to consider:

  • what variates (fields, variables) are needed,

  • the number of observations required to achieve a pre-determined precision, and

  • how to best go about collecting, storing and accessing the data.

Another important aspect of the problem is to determine whether the questions are being asked of the data in and of itself, or whether the data is used as a stand-in for a larger population. In the later case, there are other technical issues to incorporate into the analysis in order to be able to obtain generalizable results.

Questions do more than just drive the other aspects of data analysis – they also drive the development of quantitative methods. They come in all flavours and their variability and breadth make attempts to answer them challenging: no single approach can work for all of them, or even for a majority of them, which leads to the discovery of better methods, which are in turn applicable to new situations, and so on, and so on.

Not every question is answerable, of course, but a large proportion of them may be answerable partially or completely; quantitative methods can provide insights, estimates, and ranges for possible answers, and they can point the way towards possible implementations of the solutions.

As an illustration, consider the following questions:

  • Is cancer incidence higher for second-hand smokers than it is for smoke-free individuals?

  • Using past fatal collision data and economic indicators, can we predict future fatal collision rates given a specific national unemployment rate?

  • What effect would moving a central office to a new location have on average employee commuting time?

  • Is a clinical agent effective in the treatment against acne?

  • Can we predict when border-crossing traffic is likely to be higher than usual, in order to appropriately schedule staff rotations?

  • Can personalized offers be provided to past clients to increase the likelihood of them becoming repeat customers?

  • Has employee productivity increased since the company introduced mandatory language training?

  • Is there a link between early marijuana use and heavy drug use later in life?

  • How do selfies from over the world differ in everything from mood to mouth gape to head tilt?

Next steps nearly always requires obtaining relevant data.

Data Types

Data has attributes and properties. Fields are classified as response, auxiliary, demographic or classification variables; they can be quantitative or qualitative; categorical, ordinal or continuous; text-based or numerical.

Furthermore, data is collected through experiments, interviews, censuses, surveys, sensors, scraped from the Internet, etc. Collection methods are not always sophisticated, but new technologies usually improves the process in many ways (while introducing new issues and challenges): modern data collection can occur over one pass, in batches, or continuously.

How does one decide which data collection method to use? The type of question to answer obviously has an effect, as do the required precision, cost and timeliness. Statistics Canada’s Survey Methods and Practices [42] provides a wealth of information on probabilistic sampling and questionnaire design, which remain relevant in this day of big (and real-time) data.

The importance of this step cannot be overstated: without a well-designed plan to collect meaningful data, and without safeguards to identify flaws (and possible fixes) as the data comes in, subsequent steps are likely to prove a waste of time and resources. As an illustration of the potential effect that data collection can have on the final analysis results, contrast the two following “ways” to collect similar data.

The Government of Québec has made public its proposal to negotiate a new agreement with the rest of Canada, based on the equality of nations; this agreement would enable Québec to acquire the exclusive power to make its laws, levy its taxes and establish relations abroad – in other words, sovereignty – and at the same time to maintain with Canada an economic association including a common currency; any change in political status resulting from these negotiations will only be implemented with popular approval through another referendum; on these terms, do you give the Government of Québec the mandate to negotiate the proposed agreement between Québec and Canada? [1980 Québec sovereignty referendum question]

Should Scotland be an independent country? [2014 Scotland independence referendum question]

The end result was the same in both instances (no to independence), but an argument can easily be made that the 2014 Scottish ‘No’ was a much clearer ‘No’ than the Québec ‘No’ of 34 years earlier, in spite of the smaller 2014 victory margin (55.3%-44.7% in the Scotland referendum, as opposed to 59.6%-40.4% in the Québec referendum).

Data Storage and Access

Data storage is also strongly linked with the data collection process, in which decisions need to be made to reflect how the data is being collected (one pass, batch, continuously), the volume of data that is being collected, and the type of access and processing that will be required (how fast, how much, by whom).

Stored data may go stale (e.g., people move, addresses are no longer accurate, etc.), so it may be necessary to implement regular updating collection procedures. Until very recently, the story of data analysis has only been written for small datasets: useful collection techniques yielded data that could, for the most part, be stored on personal computers or on small servers.

The advent of Big Data has introduced new challenges vis-à-vis the collection, capture, access, storage, analysis and visualisation of datasets; some effective solutions have been proposed and implemented, and intriguing new approaches are on the way (such as DNA storing [43], to name but one).

We shall not discuss those challenges in detail in this module, but we urge analysts and consultants alike to be aware of their existence.

5.1.1 Survey Sampling Generalities

The latest survey shows that 3 out of 4 people make up 75% of the world’s population. [David Letterman]

While the World Wide Web does contain troves of data, web scraping (see Module 17) does not address the question of data validity: will the extracted data be useful as an analytical component? Will it suffice to provide the quantitative answers that clients and stakeholders are seeking?

A survey [42] is any activity that collects information about characteristics of interest:

  • in an organized and methodical manner;

  • from some or all units of a population;

  • using well-defined concepts, methods, and procedures, and

  • compiles such information into a meaningful summary form.

A census is a survey where information is collected from all units of a population, whereas a sample survey uses only a fraction of the units.

Sampling Model

When survey sampling is done properly, we may be able to use various statistical methods to make inferences about the target population by sampling a (comparatively) small number of units in the study population.

The relationship between the various populations (target, study, respondent) and samples (sample, intended, achieved) is illustrated in Figure 5.1.

Various populations and samples in the sampling model.

Figure 5.1: Various populations and samples in the sampling model.

  • Target population: population for which we want to obtain information;

  • Study population (survey population): population covered by the survey (it may be different from the target population, but ideally the two are very similar);53 conclusions drawn from the survey results only apply to the study population;

  • Respondent population: units of the study population that would participate in the survey if they were asked to do so; it may be different from the study population if the respondents are not representative of the study population;

  • Survey frame: provides the means to identify and communicate with the units in the survey population; it takes the form of a list, which is linked to the population under study;

  • Intended sample: subset of the study population targeted by the survey;

  • Achieved sample: subset of the study population whose characteristics were in fact measured.

In general, a survey is preferred to a census if it is expensive/laborious to measure the characteristics of interest for each unit, or if the units are destroyed by measuring the characteristics.

Deciding Factors

In some instances, information about the entire population is required in order to solve the client’s problem, whereas in others it is not necessary. How does one determine which type of survey must be conducted to collect data? The answer depends on multiple factors:

  • the type of question that needs to be answered;

  • the required precision;

  • the cost of surveying a unit;

  • the time required to survey a unit;

  • size of the population under investigation, and

  • the prevalence of the attributes of interest.

Once a choice has been made, each survey typically follows the same general steps:

  1. statement of objective

  2. selection of survey frame

  3. sampling design

  4. questionnaire design

  5. data collection

  6. data capture and coding

  7. data processing and imputation

  8. estimation

  9. data analysis

  10. dissemination and documentation

The process is not always linear, in that preliminary planning and data collection may guide the implementation (selection of a frame and of a sampling design, questionnaire design), but there is a definite movement from objective to dissemination.

5.1.2 Survey Frames

The frame provides the means of identifying and contacting the units of the study population. It is generally costly to create and to maintain (in fact, there are organisations and companies that specialize in building and/or selling such frames).

Useful frames contain:

  • identification data,

  • contact data,

  • classification data,

  • maintenance data, and

  • linkage data.

The ideal frame must minimize the risk of undercoverage or overcoverage, as well as the number of duplications and misclassifications (although some issues that arise can be fixed at the data processing stage).

Unless the selected frame is relevant (which is to say, it corresponds, and permits accessibility to, the target population), accurate (the information it contains is valid), timely (it is up-to-date), and competitively priced, the statistical sampling approach is contraindicated.

5.1.3 Fundamental Sampling Concepts

In general, a survey is conducted to estimate certain attributes of a population (statistics), such as, for example

  • a mean;

  • a total, or

  • a proportion.

A population (either target, study, or respondent) has a finite number \(N\) of members, called units or items. The response associated with the \(j-\)th unit of the population is represented by \(u_j\).

Let \(\mathcal{U}=\{u_1,\ldots,u_N\}\) be a population of size \(N<\infty\). If \(u_j\) represents a numerical variable (e.g., if \(u_j\) represents a numerical variable such as the salary of the \(j-\)th unit in the population), the mean, variance, and total of the response in the population are respectively \[\mu={\frac{1}{N}\sum_{j=1}^Nu_j}, \quad \sigma^2={\frac{1}{N}\sum_{j=1}^N(u_j-\mu)^2},\quad \mbox{and}\quad \tau = {\sum_{j=1}^Nu_j=N\mu}.\]

If \(u_j\) represents a binary variable (e.g., \(1\) if the \(j-\)th unit earns more than \(\$70K\) per year, \(0\) otherwise), the proportion of the response in the population is \[p={\frac{1}{N}\sum_{j=1}^Nu_j}.\]

We seek to estimate \(\mu\), \(\tau\), \(\sigma^2\) and/or \(p\) using the values of the response variable for the units in the achieved sample \(\mathcal{Y}=\{y_1,\ldots,y_n\}\subseteq \mathcal{U}\).

The relationship between \(\mathcal{Y}\) and \(\mathcal{U}\) is simple: in general, \(n\ll N\) and \(\forall i\in \{1,\ldots, n\}\), \(\exists! j\in \{1,\ldots, N\}\) such that \(y_i=u_j\).

The empirical mean, the empirical total, and the empirical variance are respectively \[\overline{y}(,\hat{p})=\frac{1}{n}\sum_{i=1}^ny_i, \quad S^2=\frac{1}{n-1}\sum_{i=1}^n(y_i-\overline{y})^2,\quad\text{and}\quad \hat{\tau}=\left(\frac{N}{n}\right)\sum_{i=1}^ny_i=N\overline{y}.\]

Let \(X_1,\ldots,X_n\) be random variables, \(b_1,\ldots,b_n\in \mathbb{R}\), and \(\text{E}\), \(\text{V}\), and \(\text{Cov}\) be the expectation, variance and covariance operators, respectively. Recall that \[\begin{aligned} \text{E} \left(\sum_{i=1}^nb_iX_i\right) &=\sum_{i=1}^nb_i\text{E}(X_i) \\ \text{V}\left(\sum_{i=1}^nb_iX_i\right)&=\sum_{i=1}^n b_i^2\text{V}(X_i)+\sum_{1\leq i\neq j}^nb_ib_j\text{Cov}(X_i,X_j) \\ \text{Cov}(X_i,X_j)&=\text{E}(X_iX_j)-\text{E}(X_i)\text{E}(X_j)\\ \text{V}(X_i)&=\text{Cov}(X_i,X_i)=\text{E}\left(X_i^2\right)-\text{E}^2(X_i).\end{aligned}\] The bias in an error component is the average of that error component if the survey is repeated many times independently under the same conditions. The variability in an error component is the extent to which that component would vary about its average value in the ideal scenario described above.

The mean square error of an error component is a measure of the size of the error component: \[\begin{aligned} \text{MSE}(\hat{\beta})&=\text{E}\left((\hat{\beta}-\beta)^2\right)=\text{E}\left((\hat{\beta}-\text{E}(\hat{\beta})+\text{E}(\hat{\beta})-\beta)^2\right)\\&=\text{V}(\hat{\beta})+\left(\text{E}(\hat{\beta})-\beta\right)^2=\text{V}(\hat{\beta})+\text{Bias}^2(\hat{\beta}) \end{aligned}\] where \(\hat{\beta}\) is an estimate of \(\beta\). Incidentally, the unusual denominator in the sample variance insures that it is an unbiased estimator of the population variance.

Finally, if the estimate is unbiased, then an approximate 95% confidence interval (95% C.I.) for \(\beta\) is given by \[\hat{\beta}\pm 2\sqrt{\hat{\text{V}}(\hat{\beta})},\] where \(\hat{\text{V}}(\hat{\beta})\) is a sampling design-specific estimate of \(\text{V}(\hat{\beta})\).

Survey Error

One of the strengths of statistical sampling is in its ability to provide estimates of various quantities of interest in the target population, and to provide some control over the total error (TE) of the estimates. The TE of an estimate is the amount by which it differs from the true value for the target population: \[\begin{aligned} \text{Total Error} & = \text{Measurement Error} + \text{Sampling Error} + \text{Non-response Error} + \text{Coverage Error}, \end{aligned}\] where the

  • coverage error is due to differences in the study and target populations;

  • nonresponse error is due to differences in the respondent and study populations;

  • sampling error is due to differences in the achieved sample and the respondent population;

  • measurement error is due to true value in the achieved sample not being assessed correctly;

  • processing error is due to the fact that the real value of the characteristic of interest can be affected by the data transformations performed throughout the analysis.

If we let

  • \(\overline{x}\) be the computed attribute value in the achieved sample;

  • \(\overline{x}_{\mathrm{true}}\) be the true attribute value in the achieved sample under perfect measurement;

  • \(x_{\mathrm{resp}}\) be the attribute value in the respondent population;

  • \(x_{\mathrm{study}}\) be the attribute value in the study population, and

  • \(x_{\mathrm{target}}\) be the attribute value in the target population,

then \[\begin{aligned} \underbrace{\overline{x} - x_{\mathrm{target}}}_{\text{total error (TE)}} = \underbrace{(\overline{x} - \overline{x}_{\mathrm{true}})}_{\text{meas. & proc. error}} + \underbrace{(\overline{x}_{\mathrm{true}}-x_{\mathrm{resp}})}_{\text{sampling error}} + \underbrace{(x_{\mathrm{resp}}-x_{\mathrm{study}})}_{\text{nonresponse error}} + \underbrace{(x_{\mathrm{study}}-x_{\mathrm{target}})}_{\text{coverage error}}.\end{aligned}\] In an ideal scenario, \(\text{TE}=0\). In practice, there are two main contributions to \(\text{Total Error}\): sampling errors (which are this module’s main concern) and nonsampling errors, which include every contribution to survey error which is not due to the choice of sampling scheme.

The latter can be controlled, to some extent:

  • coverage error can be minimized by selecting a high quality, up-to-date survey frame;

  • nonresponse error can be minimized by careful choice of the data collection mode and questionnaire design, and by using “call-backs” and “follow-ups”;

  • measurement error can be minimized by careful questionnaire design, pre-testing of the measurement apparatus, and cross-validation of answers.

These suggestions are perhaps less useful than one could hope in modern times: survey frames based on landline telephones are quickly becoming irrelevant in light of an increasingly large and younger population who eschew such phones, for instance, while response rates for surveys that are not mandated by law are surprisingly low. This explains, in part, the impetus towards automated data collection and the use of non-probabilistic sampling methods.

5.1.4 Data Collection Basics

How is data traditionally captured, then? There are paper-based approaches, computer-assisted approaches, and a suite of other modes.

  • Self-administered questionnaires are used when the survey requires detailed information to allow the units to consult personal records (which reduces measurement errors), they are useful to measure responses to sensitive issues as they provide an extra layer of privacy, and are typically not as costly as other collection modes, but they tend to be associated with high nonresponse rate since there is less pressure to respond.

  • Interviewer-assisted questionnaires use trained interviewers to increase the response rate and overall quality of the data. Face-to-face personal interviews achieve the highest response rates, but they are costly (both in training and in salaries). Furthermore, the interviewer may be required to visit any selected respondents many times before contact is established. Telephone interviews, on the other hand produce “reasonable” response rates at a reasonable cost and they are safer for the interviewers, but they are limited in length due to respondent phone fatigue. With random dialing, 4-6 minutes of the interviewer’s time is spent in out-of-scope numbers for each completed interview.

  • Computer-assisted interviews combine data collection and data capture, which saves valuable time, but the drawback is that not every sampling unit may have access to a computer/data recorder (although this is becomine less prevalent). All paper-based modes have a computer-assisted equivalent: computer-assisted self-interview (CASI), computer-assisted interview (CAI), computer-assisted telephone interview (CATI), and computer-assisted personal interview (CAPI).

  • Other approaches include unobtrusive direct observation; diaries to be filled (paper or electronic); omnibus surveys; email, Internet (e.g., surveymonkey.com), social media, etc.

5.1.5 Types of Sampling Methods

There exists a large variety of methods to select sampling units from the target population.

Non-Probabilistic Sampling

Those that use subjective, non-random approaches are called Non-Probabilistic Sampling (NPS) methods; these methods tend to be quick, relatively inexpensive and convenient in that a survey frame is not needed.

NPS methods are ideal for exploratory analysis and survey development. Unfortunately, they are sometimes used instead of probabilistic sampling designs, which is problematic; the associated selection bias makes NPS methods unsound when it comes to inferences, as they cannot be used to provide reliable estimates of the sampling error (the only component of the total error \(\text{TE}\) on which the analysts has direct control).

Automated data collection often fall squarely in the NPS camp, for instance. While we can still analyse data collected with a NPS approach, we may not generalize the results to the target population (except in rare, census-like situations).

NPS methods include

  • Haphazard sampling, also known as “person on the street” sampling; it assumes that the population is homogeneous, but the selection remains subject to interviewer biases and the availability of units;

  • Volunteer sampling in which the respondents are self-selected; there is a large selection bias since the silent majority does not usually volunteer; this method is often imposed upon analysts due to ethical considerations; it is also used for focus groups or qualitative testing;

  • Judgement sampling is based on the analysts’ ideas of the target population composition and behaviour (sometimes using a prior study); the units are selected by population experts, but inaccurate preconceptions can introduce large biases in the study;

  • Quota sampling is very common (and is used in exit polling to this day in spite of the infamous “Dewey Defeats Truman” debacle of 1948 [44]); sampling continues until a specific number of units have been selected for various sub-populations; it is preferable to other NPS methods because of inclusion of sub-populations, but it ignores nonresponse bias;

  • Modified sampling starts out using probability sampling (more on this later), but turns to quota sampling in its last stage, in part as a reaction to high nonresponse rates;

  • Snowball sampling asks sampled units to recruit other units among their acquaintances; this NPS approach may help locate hidden populations, but it biased in favour of units with larger social circles and units that are charming enough to convince their acquaintances to participate.

There are contexts where NPS methods might fit a client’s need (and that remains their decision to make, ultimately), but the analyst MUST still inform the client of the drawbacks, and present some probabilistic alternatives.

Probabilistic Sampling

The inability to make sound inferences in NPS contexts is a monumental strike against their use. While probabilistic sample designs are usually more difficult and expensive to set-up (due to the need for a quality survey frame), and take longer to complete, they provide reliable estimates for the attribute of interest and the sampling error, paving the way for small samples being used to draw inferences about larger target populations (in theory, at least; the non-sampling error components can still affect results and generalisation).

In this module, we take a deeper look at the traditional probability sample designs:

  • simple random sampling (SRS), see Section 5.3;

  • stratified random sampling (StS), see Section 5.4;

  • systematic random sampling (SyS), see Section 5.7.1;

  • cluster random sampling (ClS), see Section 5.6;

  • sampling with probability proportional to size (PPS), see Section 5.7.2, and

  • more advanced designs, see Section 5.7.

Schematics of various sampling designs (from left to right, top to bottom): simple random sampling, stratified sampling, systematic sampling, cluster sampling, multi-stage sampling, multi-phase sampling.Schematics of various sampling designs (from left to right, top to bottom): simple random sampling, stratified sampling, systematic sampling, cluster sampling, multi-stage sampling, multi-phase sampling.Schematics of various sampling designs (from left to right, top to bottom): simple random sampling, stratified sampling, systematic sampling, cluster sampling, multi-stage sampling, multi-phase sampling.Schematics of various sampling designs (from left to right, top to bottom): simple random sampling, stratified sampling, systematic sampling, cluster sampling, multi-stage sampling, multi-phase sampling.Schematics of various sampling designs (from left to right, top to bottom): simple random sampling, stratified sampling, systematic sampling, cluster sampling, multi-stage sampling, multi-phase sampling.Schematics of various sampling designs (from left to right, top to bottom): simple random sampling, stratified sampling, systematic sampling, cluster sampling, multi-stage sampling, multi-phase sampling.

Figure 5.2: Schematics of various sampling designs (from left to right, top to bottom): simple random sampling, stratified sampling, systematic sampling, cluster sampling, multi-stage sampling, multi-phase sampling.

In this module, the analysis is made easier by assuming that the sampling error dominates the survey error, i.e., that

  • the study population is representative of the target population \((x_{\mathrm{study}} \approx x_{\mathrm{target}})\);

  • the respondent population and the study population coincide, as are the achieved sample and the target sample \((x_{\mathrm{resp}}\approx x_{\mathrm{study}})\), and

  • the response is measured without error in the achieved sample \((\overline{x} \approx \overline{x}_{\mathrm{true}})\).

Our objective is thus to control and evaluate the sampling error \((\overline{x}_{\mathrm{true}} - \overline{x}_{\mathrm{resp}})\) for various random sampling designs.

References

[42]
Survey Methods and Practices, Catalogue no.12-587-X. Statistics Canada.
[43]
[44]