10.1 Background and Context

When data science became a sought after career path in the 2010s, most of the focus was on developing algorithms that could find patterns in the massive amounts of data being generated by digital services, or from technology that was able to continuously monitor existing services.

This was a departure from the traditional application of data analysis techniques, which were frequently intended to operate on small datasets in a scientific context. Specifically, in traditional statistical learning, primary data collection was (and still is) dominated by surveys (see Survey Sampling Methods for an in-depth discussion on the topic), or through methods somewhat divorced from the actual actions users were taking. For example, after a user logs off of a website, they might be given a short survey asking “What was the purpose of your visit?” and “How easy was it to find the information you were looking for/accomplish the desired task(s)?”.149

Such approaches lead to theoretical challenges (mainly related to dealing with small-ish samples), but the goal remains simple: given a series of reasonable assumptions, can we confirm with any confidence the existence of relationships between variables (features) and actions (outcomes).

Research and isolated applications typically run on disjointed and clunky infrastructure, which are mostly adequate; for repeated, automated, or larger tasks, or tasks that use datasets that are generated in real-time, however, the one-and-one approach is not recommended.150

In the present applied environment, the volume of available data is much greater: most integrated digital services can record all interactions a user has with a service. For example, a cross-sectional dataset on a user might consist of some of the words they were speaking out loud at home (via a smart speaker like Amazon’s Alexa), a series of Google searches (possibly spread over days), frequent visits to view a product (and derivatives) on different websites and (eventual) transaction details from purchase(s).

Instead of sampling or only collecting data when a user delivers it, all interactions are recorded. Quite apart from the ethical issues associated with such indiscriminate use of a user’s personal data (see Ethics in the Data Science Context), there are technical difficulties involved with processing and summarizing the data in meaningful ways, as with receiving and making “useful” decisions from these summaries. The volume of data is also accompanied by higher levels of messy data and less constrained collection environments.

With such data, questions tend to live in reporting (“what happened?”), real-time analytics (“what is happening?”), and predictive modeling (“what is going to happen?”) spaces, as opposed to causal inference (“how did it happen?”).

Many of the challenges facing data scientists and analysts involve putting these large troves of data into formats that can be read by algorithms.

Because of this, it’s arguable that a major task of present day Data engineering, as will be discussed further in subsequent sections, is related to processing an ever-increasing supply of data.

Once that is achieved, data scientists or statisticians use machine learning (ML) methods to develop proofs-of-concept; AI/ML engineers can translate these proofs-of-concept into deployable models within the context of data pipelines, which is the domain of the data engineering more broadly.

Although data and AI/ML engineering have at this point been around for awhile (software companies have been processing log files and building data models for about as long as software has existed), with the rise of cloud computing, it could be argued that expertise in these fields is becoming even more sought after than expertise in data science (at least, in some circles).

Organizations with low data maturity use Excel (or similar software) to cobble together software elements that can carry out some typical data pipeline tasks. These ad-hoc pipelines may be good enough for their intended purposes,151 but spreadsheet software breaks down when the data’s scale grows “too” large (and we are not necessarily talking about very “big” datasets).

Organizations with somewhat more data maturity will instead use a combination of SQL queries to a warehouse and R/Python scripts, aggregating data using the full population to identify counts and conduct reporting tasks, to then sample from that population in order to build proof-of-concept solutions on a local machine, and then often generate predictions manually.

None of these approaches truly leverages the space of the possibilities opened by modern tools and data stacks.

By contrast, data engineering focuses on collecting, storing, and analyzing data at scale.152

For these latter tasks, it is worth the investment in time and tools to develop data engineering components. We discuss some of them below.

In smaller companies, data engineering and data science may be blended into the same role, particularly if the company has a great need for a data engineer than a data scientist. Some larger companies have dedicated data engineers on staff, building data pipelines or managing data warehouses (populating them with data and creating table schemas to keep track of the stored data).