7.4 Analytics Workflows

An overriding component of the discussion so far has been the importance of context. And although the reader may be eager at this point to move into data analysis proper, there is one more bit of context that should be considered first – the project context.

We have alluded to the idea that data science is much more than merely data analysis, and this is apparent when we look at the typical steps involved in a data science project. Inevitably, data analysis pieces take place within this larger project context, as well as in the context of a larger technical infrastructure or pre-existing system.

7.4.1 The “Analytical” Method

As with the scientific method, there is a “step-by-step” guide to data analysis:

  1. statement of objective

  2. data collection

  3. data clean-up

  4. data analysis/analytics

  5. dissemination

  6. documentation

Notice that data analysis only makes up a small segment of the entire flow.

In practice, the process often end up being a bit of a mess, with steps taken out of sequence, steps added-in, repetitions and re-takes (see Figure 7.6).

The messy reality of the analytic workflow.

Figure 7.6: The reality of the analytic workflow – it is definitely not a linear process! [peronsal file].

And yet… it tends to work on the whole, if conducted correctly.

J. Blitzstein and H. Pfister (who teach a well-rated data science course at Harvard) provide their own workflow diagram, but the similarities are easy to spot (see Figure 7.7).

Blitzstein and Pfister's data science workflow.

Figure 7.7: Blitzstein and Pfister’s data science workflow [reference lost].

The Cross Industry Standard Process, Data Mining (CRISP-DM) is another such framework, with projects consisting of 6 steps:

  1. business understanding

  2. data understanding

  3. data preparation

  4. modeling

  5. evaluation

  6. deployment

The process is iterative and interactive – the dependencies are highlighted in Figure 7.8.

Theoretical and corrupted CRISP-DM processes.

Figure 7.8: Theoretical (on the left) and corrupted (on the right) CRISP-DM processes [134].

In practice, data analysis is often corrupted by:

  1. lack of clarity;

  2. mindless rework;

  3. blind hand-off to IT, and

  4. failure to iterate.

CRISP-DM has a definite old-hat flavour (as exemplified by the use of the outdated expression “data mining”), but it can be useful to check off its sub-components, if only as a sanity check.

Business Understanding
  • understanding the business goal

  • assessing the situation

  • translating the goal in a data analysis objective

  • developing a project plan

Data Understanding
  • considering data requirements

  • collecting and exploring data

Data Preparation
  • selection of appropriate data

  • data integration and formatting

  • data cleaning and processing

Modeling
  • selecting appropriate techniques

  • splitting into training/testing sets

  • exploring alternatives methods

  • fine tuning model settings

Evaluation
  • evaluation of model in a business context

  • model approval

Deployment
  • reporting findings

  • planning the deployment

  • deploying the model

  • distributing and integrating the results

  • developing a maintenance plan

  • reviewing the project

  • planning the next steps

 
All these approaches have a common core: data science projects are iterative and (often) non-sequential.

Helping the clients and/or stakeholders recognize this central truth will make it easier for analysts and consultants to plan the data science process and to obtain actionable insights for organizations and sponsors.

The main take-away from this section, however, is that there is a great deal to consider in advance of modeling and analysis – once more, data analysis is not solely about data analysis.

7.4.2 Data Collection, Storage, Processing, and Modeling

Data enters the data science pipeline by first being collected. There are various ways to do this:

  • data may be collected in a single pass;

  • it may be collected in batches, or

  • it may be collected continuously.

The mode of entry may have an impact on the subsequent steps, including how frequently models, metrics, and other outputs are updated.

Once it is collected, data must be stored. Choices related to storage (and processing) must reflect:

  • how the data is collected (mode of entry);

  • how much data there is to store and process (small vs. big), and

  • the type of access and processing that will be required (how fast, how much, by whom).

Unfortunately, stored data may go stale (both figuratively, as in, for example, addresses no longer accurate, names have changed, etc., and literally, as in the physical decay of the data and storage space); regular data audits are recommended.

The data must be processed before it can be analyzed. This is discussed in detail in Data Preparation, but the key point is that raw data has to be converted into a format that is amenable to analysis, by

  • identifying invalid, unsound, and anomalous entries;

  • dealing with missing values;

  • transforming the variables and the datasets so that they meet the requirements of the selected algorithms.

In contrast, the analysis step itself is almost anti-climactic – simply run the selected methods/algorithms on the processed data. The specifics of this procedure depend, of course, on the choice of method/algorithm.

We will not yet get into the details of how to make that choice124, but data science teams should be familiar with a fair number of techniques and approaches:

  • data cleaning

  • descriptive statistics and correlation

  • probability and inferential statistics

  • regression analysis (linear and other variants)

  • survey sampling

  • bayesian analysis

  • classification and supervised learning

  • clustering and unsupervised learning

  • anomaly detection and outlier analysis

  • time series analysis and forecasting

  • optimization

  • high-dimensional data analysis

  • stochastic modeling

  • distributed computing

  • etc.

 
These only represent a small slice of the analysis pie. It is difficult to imagine that any one analyst/data scientist could master all (or even a majority of them) at any moment, but that is one of the reasons why data science is a team activity (more on this in Roles and Responsibilities).

7.4.3 Model Assessment and Life After Analysis

Before applying the findings from a model or an analysis, one must first confirm that the model is reaching valid conclusions about the system of interest.

All analytical processes are, by their very nature, reductive – the raw data is eventually transformed into a small(er) numerical outcome (or summary) by various analytical methods, which we hope is still related to the system of interest (see Conceptual Frameworks for Data Work).

Data science methodologies include an assessment (evaluation, validation) phase. This does not solely provide an analytical sanity check (i.e., are the results analytically compatible with the data?); it can also be used to determine when the system and the data science process have stepped out of alignment. Note that past successes can lead to reluctance to re-assess and re-evaluate a model (the so-called tyranny of past success); even if the analytical approach has been vetted and has given useful answers in the past, it may not always do so.

At what point does one determine that the current data model is out-of-date? At what point does one determine that the current model is no longer useful? How long does it take a model to react to a conceptual shift?125 This is another reason why regular audits are recommended – as long as the analysts remain in the picture, the only obstacle to performance evaluation might be the technical difficulty of conducting said evaluation.

When an analysis or model is ‘released into the wild’ or delivered to the client, it often takes on a life of its own. When it inevitably ceases to be current, there may be little that (former) analysts can do to remedy the situation.

Data analysts and scientists rarely have full (or even partial) control over model dissemination. Consequently, results may be misappropriated, misunderstood, shelved, or failed to be updated, all without their knowledge. Can conscientious analysts do anything to prevent this?

Unfortunately, there is no easy answer short of advocating that analysts and consultants not only focus on data analysis, but also recognize the opportunity that arises during a project to educate clients and stakeholders on the importance of these auxiliary concepts.

Finally, because of analytic decay, it is crucial not to view the last step in the analytical process as a static dead end, but rather as an invitation to return to the beginning of the process.

7.4.4 Automated Data Pipelines

In the service delivery context, the data analysis process is typically implemented as an automated data pipeline to enable the analysis process to occur repeatedly and automatically.

Data pipelines usually consist of 9 components (5 stages and 4 transitions, as in Figure 7.13):

  1. data collection

  2. data storage

  3. data preparation

  4. data analysis

  5. data presentation

Each of these components must be designed and then implemented. Typically, at least one pass of the data analysis process has to be done manually before the implementation is completed.

We will return to this topic in Structuring and Organizing Data.

References

[134]
J. Taylor, Four problems in using CRISP-DM and how to fix them,” KDnuggets.com, 2017.