Module 10 Data Engineering and Management

by Aditya Maheshwari


In this chapter, we briefly explain some of the basic concepts that help data scientists go beyond theoretical/small scale projects (mostly used for experiments/local research/conceptual solutions) and introduce the concepts and frameworks that allow data scientists in conjunction with data teams to building data science products that process and deliver results at scale. We will discuss this in the context of exploring the role of data engineering in data projects and providing an overview of some of the types of data pipeline infrastructure commonly involved in these projects.

In the current data ecosystem, most data scientists are still not required to understand the inner workings of data engineering and data management; however, as modeling tools become increasingly automated, and as machine learning solutions move from conceptual to practical, most data project requirements become engineering focused.

We only provide a cursory look at the topic in this module; in-depth information is available at [184], [185], [186], [187], and [188], while shorter overviews can be found at [189], [190]. Learners interested in database design should consult [191].

Contents

10.1 Background and Context

10.2 Data Engineering
     10.2.1 Data Pipelines
     10.2.2 Automatic Deployment and Operations
     10.2.3 Scheduled Pipelines and Workflows
     10.2.4 Data Engineering Tools
 

10.3 Data Management
     10.3.1 Databases
     10.3.2 Data Modeling
     10.3.3 Data Storage

10.4 Reporting and Deployment
     10.4.1 Reports and Products
     10.4.2 Cloud and On-Premise Architecture

References

[184]
[185]
[186]
J. Kunigk, I. Buss, P. Wilkinson, and L. George, Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale. O’Reilly Media, 2018.
[187]
T. Malaska and J. Seidman, Foundations for Architecting Data Solutions: Managing Successful Data Projects. O’Reilly Media, 2018.
[188]
M. Kleppmann, Designing Data-Intensive applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media, 2017.
[189]
[190]
A. Dutrée, Data pipelines: What, why and which ones.” Towards Data Science, 2021.
[191]
A. Watt, Database Design. BCCampus, 2014.