10.2 Data Engineering

Data engineering can be viewed as an offshoot of computer engineering with a focus on designing, implementing, and maintaining computer systems that are created to collect, ingest, process, analyse, and provide or present data. Due to data engineering’s origin in computer engineering, it may be helpful to understand some key elements of computer engineering before getting into the details of data engineering.

Fundamentally, computers consist of two elements:

  • memory, which can be thought of as labelled boxes storing ones and zeros, and

  • circuits that treat the contents of memory as inputs, and then generate new outputs, which subsequently can also get stored in memory.

The collection of circuits available in a computer processor can be viewed as an instruction set. The instructions in this set are carried out in a particular order when a computer program is run.

In this context, data are specific patterns stored in memory. These patterns can be:

  • copied to new locations in memory,

  • moved by copying and then deleting the original, and

  • transformed, by using the patterns as inputs to a series of instructions that in turn produce new patterns, also stored in memory.

It is worth noting that computer programs themselves also exist as data stored in memory. They are loaded into the processor and turned into a set of more basic instructions that are hard-coded into this computer processor. These basic instructions then carry out the actions described in the program.

Computer software engineering, also known as software engineering, focuses on the software aspect of computers rather than the hardware. IEEE defines it as: “The systematic application of scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software” [192].

Fundamentally, the goal of computer engineering is to create programs that manipulate patterns of 0s and 1s (move them, copy them, transform them) by means of appropriate sets of instructions.

Data Engineering and IT

Where does information technology (IT) fit into this picture? The term information technology typically refers to technology that focuses on the manipulation of data and information. The scope of IT is broader than just computer systems – for example, communications systems can also be viewed as a type of IT, as can television. In this sense, computer engineering and data engineering could be viewed as subfields of IT.

However, in day-to-day usage, IT tends to refer to the use of existing software and hardware to manage data and information. IT professionals typically work to assemble these existing technologies into larger systems that provide particular information processing functionality.

Consequently, data engineers have some overlap with software engineers and IT professionals: they may be responsible for creating customized software applications and designing customized architectures to work with particular types of data, but they will also do this in the context of using pre-built applications to create data pipeline infrastructure.

Given the prevalence of data in today’s world, data engineering is a broad field with applications in just about every industry. Organizations can now collect massive amounts of data, and need to invest in the right people and technology to ensure it is in a highly usable state by the time it reaches data scientists and analysts.

Data Team Roles

As part of the data team, data engineers allow an organization to efficiently and effectively collect data from various sources; database analysts then manage the collected data and make it available for analysis and inclusion in data solutions. Let us now look at some data team roles in more depth (contrast with Roles and Responsibilities):

Data Engineers
Receive data from the source (such as paper tax forms manually entered into a database, or real-time data from online tax software that also streams into a database), and then structure, distribute, and store the data into data lakes and warehouses (SQL for storage, etc.). They create tools and data models which data scientists can use to query the data.
Data Scientists
Receive data procured and provided by data engineers, extract value from the data, build proof-of-concept predictive models, measure and improve results, and build data models. Data scientists typically work with languages such as Python or R, and inside analytics notebooks such as those provided by RMarkdown (in which this book is written) or Jupyter. The notebooks run against a cluster to translate queries into big data platform-specific engines (Apache Spark, etc.).
ML Engineers
Apply and deploy data models, bridge gaps between data engineers and data scientists, take proof-of-concept ideas to a large scale. They create feedback loops so data scientists can view aggregate performance and adjust proof-of-concept solutions when there are errors or issues.

10.2.1 Data Pipelines

The day-to-day work of data engineers centres around data pipelines, which carry out chains or sequences of common data manipulation tasks, preferably in an automated fashion. In this respect, the work of data engineers has interesting similarities to the work of chemical systems and process engineers, but with data transformation being the focus instead of chemical transformation.

Common data pipeline tasks include:

  1. acquiring datasets that align with business needs;

  2. developing algorithms to transform data into useful, actionable insight;

  3. building, testing, and maintaining database pipeline architectures;

  4. collaborating with management to better understand company objectives;

  5. creating new data validation methods and data analysis tools, and

  6. ensuring compliance with data governance and security policies.

A working data pipeline is comprised of a set of interfaces and mechanisms that support the flow and access of information. Data engineers set up and operate this data infrastructure, using it to prepare data for analysis by data analysts and scientists as well as serve the results of this data transformation and analysis to the data consumers waiting at the end of the pipeline.

As we have seen, data can arise from many sources (and types of sources), and in a variety of formats and sizes. Transforming this mass of raw data into a process that data scientists can use and from which they can derive meaning is known as building a data pipeline.

Roughly-speaking, pipelines require the following phases:

  1. ingestion (collection): gathering data (from multiple sources);

  2. processing (preparation): cleaning and moving the data to an appropriate data store;

  3. storage: storing the data in an accessible location and developing a data model (or a set of logic and documentation to access the data),

  4. access: enabling a tool or user to access cleaned data, for analysis and presentation.

The number of steps and their order can change from one framework to the other (see Automated Data Pipelines for a slightly different treatment and Figure 10.1 for an illustration, for instance), as long as they are consistent within a program.

Example of a conceptual data pipeline, with 9 components: the boxes and the transitions between boxes.

Figure 10.1: Example of a conceptual data pipeline, with 9 components: the boxes and the transitions between boxes.

The main data engineering challenge is building a pipeline that can run in (close to) real-time whenever it is requested, so that users get up-to-date information from the source with minimal delays.153

Data engineers may begin be designing and then creating a working pipeline proof-of-concept solution. After this is tested, a more robust and substantial pipeline may be designed and this design passed on to engineers for deployment and production.

Some of the work surrounding this process includes:

  • data quality checks;

  • optimizing query performance;

  • creating a continuous integration continuous delivery (CI/CD) ecosystem around changes to the model;

  • ingesting, aggregate and storing data from various sources, structured according to a pre-defined data model, and

  • carrying machine learning and data science techniques to distributed systems.

Let us start with a short motivating use case. The Canada Revenue Agency (CRA) wants to identify how many individuals in a certain region are not filing taxes and consequently missing out on net positive benefits,154 and identify which of them likely missed the filing date due to lack of awareness (as opposed to negligence).

The data pipeline components might include processes that:

  • ingest aggregate and store data from third party reporting (and possibly other third party datasets) that highlight how many individuals should be filing taxes in the region;

  • determine how many people filed taxes in that region historically;

  • perform predictive modeling to compare the characteristics of known non-filers to determine the likelihood of missing filing deadlines in good faith;

  • display the results via a dashboard

Data Pipeline Connections

In our data pipeline framework, the connections between data pipeline components are used to:

  • move from the collection technology (i.e., sensors, surveys, other user interactions with a product) to an efficient storage space for tools that need to query the data;

  • go from storage to preparation stages where data is transformed;

  • transport transformed data to an analytics, analysis, or modeling step, or

  • use the output of the modeling step as input in a presentation.

Common building challenges include:

  • moving data into a data lake from the data source can be costly, time-wise, with hand-coding of repetitive data ingestion tasks;

  • data platforms are always changing, so data engineers expend a lot resources building, maintaining, and then rebuilding and continuously maintaining complex infrastructure, in a never ending cycle,

  • with increasing demand for real-time data, low latency pipelines155 are required, for which it is more challenging to establish Service Level Agreements (SLA);156 SLA improvement require constant performance checks and tuning of the pipeline and architecture.

Without proper planning, these issues can quickly get out of hand.

Data Pipeline Operations

As can be seen from the above descriptions of data pipeline components and processes, a data pipeline is effectively an automated chain of operations performed on data. The chain can be as simple as moving data from location A to location B, or more complicated, such as a workflow that aggregates data from multiple sources, sends it to a data warehouse, conducts a clustering analysis, and presents the results in a dashboard.

Some common elements (operations/tasks/sources) for each of the steps include:

  • data sources: applications, mobile apps, microservices, Internet of Things (IoT) devices, websites, instrumentation, logging, sensors, external data, user generated content, etc.;

  • data integration: ETL, stream data integration, etc.;

  • data store: Master Data Management (MDM), warehouse, data lake, etc.;

  • data analysis: machine learning, predictive analytics, A/B testing, experiments, artificial intelligence (AI), deep learning, etc.,

  • delivery and presentations: dashboards, reports, microservices, push notifications, email, SMS, etc.

In addition, pipelines allow users to split large task into a series of smaller sequential steps, which can help optimize each step.157 For instance, it could be that using a different language or framework for one of the pipeline steps would be advantageous (using TensorFlow for the analysis component of a deep learning pipeline, for instance); if the pipeline consists of a single large script, then everything from data collection to presentation has to be done with TensorFlow, even if some of the other components (ETL, say) are not optimized with that framework.

A better approach, which is implemented in most data pipeline tools, is to select the best framework or language for each pipeline component/task.

A data visualization pipeline, with component options.

Figure 10.2: A data visualization pipeline, with component options.

ETL Framework

When designing data pipelines, it’s useful to be aware of the ETL framework: Extracting, Loading, and Transforming the data. The concept of ETL (and similarly ELT, Extract, Load Tranform) predates modern data pipelines; it originated as a framework in the context of data marts and data warehouses. Nonetheless it remains a crucial aspect of the basic operations involved in current data pipelines.

Data always has to be extracted from a source (or temporary storage) in some manner. Systems that need raw data extracted from multiple sources also typically need a loading step after extraction, so that various processes and systems can process data from the same extraction.

When joining data from a variety of systems and sources, it is beneficial to co-locate the data and store everything in a single site before transforming it. Sometimes transformed data is loaded again into another location for consumption (such as in a data warehouse). In contrast, transformations may also be conducted on each independent data source before loading the results into a file system.

Once the data has been collected from relevant up-stream systems, a data engineer can determine how to optimally join the datasets. This is done by building data pipeline elements that allow data to flow out of the source systems, the results of which are stored in a separate, highly available format for various business intelligence tools to query.

Data engineers are also responsible for ensuring these data pipelines have correct inputs and outputs. This frequently involves data reconciliation or additional data pipelines to validate against source systems. Engineers also have to ensure that data pipelines flow continuously and keep information up to date, utilizing various monitoring tools and site reliability engineering (SRE) practices.

Most user-centric products generate data from multiple sources, including at times multiple systems, and related third party integrations or vendors. The right analysis and data must be available to end-users, perform to their requirements, and be integral (i.e., accurate and consistent).

This is problematic, from a performance point of view, as data for each batch or stream is required from dependent or related systems. On each pipeline execution, the system must be re-queried, which increases the number of load steps and adds to the pipeline run time, a fair amount of it is spent waiting for data to become available. While this moves the user one step further from the raw data, the ETL framework can also simplify subsequent data pipelines.

Business units frequently want up-to-date information as soon as possible; without a good idea of moving pieces and stops along the data journey, data engineers cannot provide reasonable estimates to (and manage the expectations of) their clients regarding the data pipeline performance. In particular, the former must take into account how frequently new data is received, the run-time for the transformation steps, and how long it takes to update the data target destination.

Data Architecture

Best outcomes occur when there is a common understanding of how data is organized on the platform and how data moves about the platform.

Data architecture defines both, among other things, and it is an important piece of documentation that provides direction for so much of the data platform. A best-in-class data architecture would include at least the following for each data repository on the platform:

  • storage layout: a view of how and where data is stored – this would include standards applied to file paths and file type details for file/object storage, or database, schema, view, and table naming standards for database storage;

  • data landscape: a view of how data is categorized within a repository;

  • data abstractions: a detailed breakdown of the individual components of any data abstractions that are a part of the platform (including diagrams) – these are key to automation as they are the blueprints for building the ‘things’ that are a part of the data platform (user workspaces, team workspaces, data products, etc.) using the constructs provided by the individual repositories (databases, schemas, tables, views, roles, privileges, etc.),

  • data access: a view of the authorization setup for the data repository, including how roles map to users, the setup of the role hierarchy, and how privileges are assigned to roles.

Additionally, the data architecture should detail how data moves about the different data repositories within the data platform. This provides answers to the questions “What are the allowed sources and destinations of data?” and “What tooling is used to perform a movement?”.

Data Governance Considerations and Self-Serviceability

Data governance158 might also play a role. Impactful data platforms exist to accelerate innovation, which is achieved by liberating data to make it accessible to the users that need it for strategic business purposes. But it is not safe to assume that all users should be able to access all data on the platform, because different datasets will have different sensitivity derived from the nature of the data.

For example, sales and HR typically have sensitive datasets to which a company typically wants to restrict access. Furthermore, customer and health data is often protected by compliance requirements, restricting how the data might be used and who is allowed to access it.

For end-users hoping to access a dataset on the data platform, means must exist to request access to that dataset, and workflows must exist through which the request can be scrutinized by interested parties before being fulfilled. This is a self-serviceability use case and one that is central to liberating data on any platform.

Another primary self-serviceability situation to consider is the ability to request the creation of objects or structures within the data warehouse. As new use cases for the platform come up, new “workspaces” can enable data engineering teams to build new data transformations and datasets to support those use cases. Beyond these examples, there are many situations where self-serviceability can help the autonomy of the data platform.

10.2.2 Automatic Deployment and Operations

Automating data pipelines can be as straightforward as implementing processes that streamline moving data from one location to another, or as complex as creating automated processes that aggregate data from multiple sources, transform it, and store it in multiple destinations.

It is now quite feasible to automatically ingest petabytes of data with constantly changing schemas. This allows pipelines to deliver fast, reliable, scalable, and automatic data for analytics, data science, and machine learning.

Automatic operations include:

  • incrementally processing data as it arrives from files or streaming sources (such as Kafka, DBMS, NoSQL, etc.);

  • auto-inferring schema/column changes for structured/unstructured data formats;

  • auto-tracking data as it arrives,

  • auto-backup/rescue to avoid loss, etc.

As mentioned in the previous section, ETL provides the necessary decision points to build data pipelines. With modern tools, data engineers can reduce development time and effort and instead focus on implementing business logic and data quality checks within the data pipeline using SQL, Python, R, etc. This can be achieved by:

  • using intent-driven declarative development to define “what” to solve and simplify “how” to do it;

  • automatically creating high-quality lineage and managing table dependencies across the data pipeline, and

  • automatically checking for missing dependencies or syntax errors, and managing data pipeline recovery.

We can also improve data reliability by:

  • defining data quality and integrity controls within the pipeline with defined data expectations;

  • addressing data quality errors with predefined policies (fail, drop, alert, quarantine, etc.),

  • leveraging the data quality metrics that are captured, tracked and reported for the entire data pipeline.

Automated Pipeline Deployment

The older approach to software deployment frequently resulted in:

  • running a build;

  • copying and pasting the result onto a production server, and

  • performing a manual “test” to see if the application was working as expected.

The problem is that such an approach does not scale, and the manual components introduce risk. The major goal of automated pipeline development and subsequent deployment to production is the ability to design and test scalable pipeline components prior to deployment, as well as supporting a continuous development deployment cycle. Agile processes are frequently used in this context.

Testing

When live-testing in a production environment, any bugs or issues that have been missed in the testing phase (or any environment-specific influences on the code) will result in a poor customer experience since these bugs or errors will be presented to the end-user. The best code promotion practice is to put in place automated processes that verify that the code works as expected in different scenarios.

This is frequently done with unit and integration tests. Unit testing verifies that individual pieces of code, given a set of inputs, produce expected outputs independently of the other code that uses that piece of code. This verification of logic within the individual piece of code is a value added by unit tests, as is the proof that the code executes as expected.

The level above unit testing is integration testing. This ensures that pieces of code work together and produce the expected output(s) for a given set of inputs. This is often the most critical layer of testing, as it ensures that systems integrate as expected with each other.

By combining unit testing and integration testing with modern deployment strategies such as blue-green deployments, the probability of negative impact to customers and business when new code is introduced is significantly reduced.

Disaster Recovery

Before changes can be promoted to an environment, “everything” must be validated by the established tests. It is also critical to ensure that there is a plan in place in the event of a system failure. This means that systems must be designed to tolerate a critical system failure. Disaster recovery in data engineering generally falls into two metrics: Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

In the event of a disaster recovery scenario, businesses need to have standards in place to understand the impact to their customers and how long their systems will be unavailable. Data engineers are responsible for putting processes in place to ensure that data pipelines, databases, and data warehouses meet acceptable recovery thresholds.

Pipeline Development Best Practices

With so much data flowing into data platforms, it is becoming necessary to embrace development best practices in ways that work to instill confidence in data systems, faced with ever evolving data platforms. Common best practices include:

  • using Source Code Management (SCM) tools;

  • Continuous Integration (CI);

  • Continuous Delivery (CD);

  • using multiple deployment environments;

  • testing and data quality;

  • Infrastructure as Code (IaC);

  • using database change management;

  • implementing rollback strategy, and

  • continuous monitoring and alerting.

The underlying themes are automation, testing, and monitoring:

  • automate the building and testing of artifacts;

  • create deployment pipelines that deploy artifacts;

  • test the deployment and promote the artifacts to the next environments, and

  • build data quality checks into data pipelines and alert on anomalies.

When a test is failed, rollback procedures automatically trigger; code should never be a “set it and forget it” type of solution, as data governance requirements, tooling, best practices, security procedures, and business requirements are always quickly changing and adapting. This means that deployments need to be automated and verifiable.

10.2.3 Scheduled Pipelines and Workflows

There are three main data pipeline architectures with respect to supporting the scheduling of automated tasks and workflows:

  • batch data pipelines move large amounts of data at a specific time; this is a common architecture and is mostly used in situations where tables are updated daily or weekly for use in reporting or dashboard purposes;

  • streaming data pipelines move data from source to destination as it is created; these typically populate data lakes, are part of the integration of data warehouses, or can be used to publish data for messaging or streaming (i.e., updating stock prices, feeding data to a fraud detection algorithm in real time. etc.),

  • change capture data pipelines refresh big datasets and maintain consistency across systems, which is an especially important chore when two (or more) systems share the same datasets.

Building Efficient Pipelines

Conceptually, building an efficient data pipeline is a simple six-step process:

  1. cataloging and governing data, enabling access to trusted and compliant data at scale across an enterprise;

  2. efficiently ingesting data from various sources such as on-premises databases or data warehouses, software as a service (SaaS) applications, IoT sources, and streaming applications into a cloud data lake;

  3. integrating data (cleaning, enriching, and transforming it) by creating zones such as landing zones, enrichment zones, and enterprise zones;

  4. applying data quality rules to clean and manage data while making it available across the organization to support DataOps;159

  5. preparing data to ensure that refined and cleaned data moves to a cloud data warehouse for enabling self-service analytics and data science use cases, and

  6. stream processing to derive insights from real-time data coming from streaming sources such as Kafka and then moving it to a cloud data warehouse for analytics consumption.

To support ML/AI and process big data at reasonable service level objectives, an efficient pipeline should also:

  • seamlessly deploy and process any data on any cloud ecosystem, such as Amazon Web Services (AWS), Microsoft Azure, Google Cloud, and Snowflake for both batch and real-time processing;

  • efficiently ingest data from any source, such as legacy on-premises systems, databases, change data capture (CDC) sources, applications, or IoT sources into any target, such as cloud data warehouses and data lakes;

  • detect schema drift in relational data base management systems (RDBMS) schema [191] in the source database or a modification to a table, such as adding a column or modifying a column size and automatically replicating the target changes in real time for data synchronization and real-time analytics use cases;

  • provide a simple wizard-based interface with no hand coding for a unified experience;

  • incorporate automation and intelligence capabilities such as auto-tuning, auto-provisioning, and auto-scaling to design time and run-time, and

  • deploy in a fully managed advanced server-less environment for improving productivity and operational efficiency.

Performance Level and SLO

An important measure of performance is how well the pipeline meets the business requirements. Service level objectives (SLO) provide tangible definitions of performance that can be compared against acceptable thresholds.

For example, we might define the following SLO for a system:

  • data freshness – 90% of product recommendations should be generated from user website activity that occurred no later than 3 minutes ago;

  • data correctness – within a calendar month, fewer than 0.5% of customer invoices should contain errors,

  • data isolation/load balancing – within a business day, all high-priority payments should be processed within 10 minutes of being lodged, and standard-priority payments should be completed by the next business day.

Data freshness refers to the usability of data in relation to its age. Common data freshness SLO formats include:

  • \(x\)% of data processed within \(y\) units of time [seconds, days, minutes] – this SLO refers to the percentage of data that is processed in a given period of time, and is commonly used for batch pipelines that process bounded data sources. The metrics are the input and output data sizes at key processing steps relative to the elapsed pipeline run-time. We can choose a step that reads an input dataset and another step that processes each item of the input;

  • oldest data no older than \(y\) units of time [seconds, days, minutes] – this SLO refers to the age of data produced by the pipeline, and is commonly used for streaming pipelines that process data from unbounded sources. The metrics indicate how long the pipeline takes to process data, such as the age of the oldest unprocessed item (how long an unprocessed item has been in the queue) or the age of the most recently processed item,

  • pipeline job completed successfully within \(y\) units of time [seconds, days, minutes] – this SLO sets a deadline for successful completion and is commonly used for batch pipelines that process data from bounded data sources. It requires the total pipeline-elapsed time and job-completion status, in addition to other signals that indicate the success of the job (for example, the percentage of processed elements that result in errors).

Data correctness refers to data being free of errors. We can determine data correctness through different means. One method is to check whether the data is consistent by using a set of validation rules, such as rules that use regular expressions (regexps). Another method is to have a domain expert verify that the data is correct, perhaps by checking it against reference data. One challenge is that reference data for validating correctness might not always be available. Therefore, there might be a need to generate reference data using automated tools, or even manually. These reference datasets can then be stored and used for different pipeline tests.

With reference datasets, we can verify data correctness in the following contexts:

  • unit and integration tests, which are automated through continuous integration;

  • end-to-end pipeline tests, which can be executed in a pre-production environment after the pipeline has successfully passed unit and integration tests, and is automated via continuous delivery, and/or

  • pipelines running in production, when using monitoring to observe metrics related to data correctness.

For running pipelines, defining a data correctness target usually involves measuring correctness over a period of time, such as:

  • on a per-job basis, fewer than \(x\)% of input items contain data errors – this SLO can be used to measure data correctness for batch pipelines. As an example, consider: “For each daily batch job to process electricity meter readings, fewer than 3% of readings contain data entry errors”;

  • over an \(y\)-minute moving window, fewer than \(x\)% of input items contain data errors – this SLO can be used to measure data correctness for streaming pipelines. An example, consider: “Fewer than 2% of electricity meter readings over the last hour contain negative values.”

To measure these SLOs, we can use metrics over a suitable period of time to accumulate the number of errors by type, such as the data being incorrect due to a malformed schema, or the data being outside a valid range.

10.2.4 Data Engineering Tools

While it is unlikely that any one data engineer could achieve mastery over all possible data engineering tools, it would be beneficial for data teams to have competencies in a fair number of the following:

  • analytical databases (Big Query, Redshift, Synapse, etc.)

  • ETL (Spark, Databricks, DataFlow, DataPrep, etc.)

  • scalable compute engines (GKE, AKS, EC2, DataProc, etc.)

  • process orchestration (AirFlow / Cloud Composer, Bat, Azure Data Factory, etc.)

  • platform deployment and scaling (Terraform, custom tools, etc.)

  • visualization tools (Power BI, Tableau, Google Data Studio, D3.js, ggplot2, etc.)

  • programming (tidyverse, numpy, pandas, matplotlib, scikit-learn, scipy, Spark, Scala, Java, SQL, T-SQL, H-SQL, PL/SQL, etc.)

An open-source data analysis pipeline.

Figure 10.3: An open-source data analysis pipeline.

An unfortunately far-too-common data analysis pipeline.

Figure 10.4: An unfortunately far-too-common data analysis pipeline.

Here are some currently popular pipeline tools [190]:

  1. Luigi (Spotify) builds long-running pipelines (thousands of tasks stretching across days or weeks); it is a Python module available on an open-source license under Apache. It addresses the “plumbing” issues typically associated with long-running batch processes, where many tasks need to be chained together (Hadoop jobs, dumping data to/from databases, running machine learning algorithms, etc.). Luigi uses 3 steps to build pipelines: requires() defines the dependencies between the tasks, output() defines the target of the task, and run() defines the computation performed by each task. Luigi tasks are intricately connected with the data that feeds into them, making it difficult to create, modify, and test a single task, but relatively easy to string tasks together.

  2. Airflow (AirBnB) is used to build, monitor, and retrofit data pipelines. It is a very general system, capable of handling flows for a variety of tools and highly complex pipelines; it is good tool for pipeline orchestration and monitoring. It connects well with other systems (databases, Spark, Kubernetes, etc.). Airflow defines workflows as Directed Acyclic Graphs (DAG), and tasks are instantiated dynamically. Airflow is built around: hooks (high-level interfaces for connections to external platforms), operators (predefined tasks that become DAG nodes), executors (run jobs remotely, handle message queuing, and decide which worker will execute each task), and schedulers (trigger scheduled workflows and submit tasks to the executors).

  3. scikit-learn pipelines: scikit-learn pipelines are not used to orchestrate big tasks from different services; rather they help make code cleaner and easier to reproduce/re-use. They are found in scikit-learn, a popular Python data science module. The pipelines allow users to concatenate a series of transformers, followed by a final estimator; this is useful for model training and data processing, for instance. With scikit-learn pipelines, data science workflows are easy to read and understand, which also makes it easier to spot issues such as data leakage (unplanned or unauthorized release of data). The pipelines only work with scikit-learn transformers and estimators, however, and they must all be run within the same run-time, which makes it impossible to run different pipeline parts on different worker nodes while keeping a single control point.

  4. Pandas (Python) or Tidyverse (R) Pipes: pandas and the tidyverse are popular data analysis and manipulation libraries. When data analysis becomes very sophisticated, the underlying code tends to become messier. Pandas and tidyverse pipes keep the code clean by allowing users to concatenate multiple tasks using a simple framework, similar to scikit-learn pipelines. These pipes have one criterion, the “data frame in, data frame out” principle: every step consists of a function with a data frame and other parameters as arguments, and a data frame as output. Users can add as many steps as needed to the pipe, as long as the criterion is satisfied.

References

[190]
A. Dutrée, Data pipelines: What, why and which ones.” Towards Data Science, 2021.
[191]
A. Watt, Database Design. BCCampus, 2014.
[192]