Module 10 Data Engineering and Management
by Aditya Maheshwari
In this chapter, we briefly explain some of the basic concepts that help data scientists go beyond theoretical/small scale projects (mostly used for experiments/local research/conceptual solutions) and introduce the concepts and frameworks that allow data scientists in conjunction with data teams to building data science products that process and deliver results at scale. We will discuss this in the context of exploring the role of data engineering in data projects and providing an overview of some of the types of data pipeline infrastructure commonly involved in these projects.
In the current data ecosystem, most data scientists are still not required to understand the inner workings of data engineering and data management; however, as modeling tools become increasingly automated, and as machine learning solutions move from conceptual to practical, most data project requirements become engineering focused.
We only provide a cursory look at the topic in this module; in-depth information is available at [184], [185], [186], [187], and [188], while shorter overviews can be found at [189], [190]. Learners interested in database design should consult [191].
Contents
10.1 Background and Context
10.2 Data Engineering
10.2.1 Data Pipelines
10.2.2 Automatic Deployment and Operations
10.2.3 Scheduled Pipelines and Workflows
10.2.4 Data Engineering Tools
10.3 Data Management
10.3.1 Databases
10.3.2 Data Modeling
10.3.3 Data Storage
10.4 Reporting and Deployment
10.4.1 Reports and Products
10.4.2 Cloud and On-Premise Architecture