10.3 Data Management
As covered in the previous section, a major element of data engineering (that is, developing data pipelines) involves moving data from storage to storage as it is processed. In this sense, data storage can be viewed as the metaphorical heart of the data pipeline, while the machine learning model components of the pipeline could be thought of as the brains. In this section we will focus on this metaphorical heart, and consider management of the brains (analytics and machine learning models) in the subsequent section (Reporting and Deployment).
Computers have advanced significantly in their ability to store large amounts of data. In this section, we will cover databases, data modeling, and data storage. For more details, consult  (and Getting Insight From Data).
Historically, computers relied on a file-based system (i.e., they manipulate data files). File-based systems face a number of shortcomings:
data redundancy: files and applications are created by different programmers from various departments over long periods of time. This can lead to redundancy, a situation that occurs in a database when a field needs to be updated in more than one table, inconsistencies in data format, the same info being stored in multiple files, and conflicting copies;
data isolation: it can prove difficult for new applications to retrieve the appropriate data, which might be stored in various files;
data integrity: maintenance may be required to ensure that data in a database are correct and consistent;
security: it can be difficult to enforce access constraints (if needed) when application requirements are added to the system in an ad-hoc manner, and
concurrency: if multiple users access the same file at the same time, there can be issues with file locking.
Spreadsheets were originally designed for a single user, which is reflected in their characteristics. They are adequate for single users or for small teams of users who have no need for complicated data manipulations.
Databases, on the other hand, hold massive amounts of information, and allow multiple concurrent users to quickly and securely access and query data using highly complex logic and language. They only need to be defined once before being accessed by various users.
Databases consist of a representation of some aspect of the real world or a collection of data elements representing real world information.160 They are:
logical, coherent, and internally consistent;
designed, built, and populated with data for a specific purpose;
made up of data items, which are stored in fields,
populated with tables, which are combinations of fields.
A database management system (DBMS) is a collection of programs that enables users to create and maintain databases and control all access to them. The primary goal of a DBMS is to provide an environment for users to retrieve and store information in a convenient and efficient manner.
Data management is “simply” care-taking for the data so that it works for its users and remains useful for tasks. Managing information using a database allows data scientists to become strategic users of the data at their disposal. The processing power in a database can be used to manipulate the data it houses, namely: sort, match, link, aggregate, filter, compute contents, etc. Because of the versatility of databases, we find them powering all sorts of projects.
While databases might be overkill for small datasets, they have many benefits (especially for larger projects):
self-describing nature of a database system: a database contains the data and the metadata, which describes and defines relationships between tables in the database. This separation of data and information about the data makes a database system entirely different from the traditional file-based system in which the data definition is part of the application program;
insulation between program and data (also called program-data independence): in a file-based system, the structure of data files is defined in the application programs, so if a user wants to change the structure of a file, all programs that access it need to be changed as well. In a database system, the data structure is stored in the system catalogue and not in the programs. Therefore, one change is all that is needed to change the structure of a file;
support for multiple views: a database supports multiple views, or subsets, of the database. Each view contains data that is only of interest to the group of users subscribed to the particular view;
sharing of data and multi-users: many users can access data at the same time, through features called concurrency control strategies. The design of model multi-user database systems is a great improvement from those in the past which restricted usage to one user at a time,
control of redundancy: ideally, each data item is only found in one location, but redundancy can sometimes improve query performance (even though it should be kept to a minimum wherever possible).
Types of Databases
Databases come in various flavours:
the most common (as of 2022) are relational databases, in which data items are organized as a set of tables with columns and rows;
data in object-oriented databases is represented in the form of objects, as in object-oriented programming (OOP);
in distributed databases, two or more files are located in different sites – such databases may be stored on multiple computers located in the same physical location, or scattered over different networks, etc.;
data warehouses are central repository for data, designed specifically for fast query and analysis;
NoSQL warehouses are non-relational databases that allow for unstructured and semi-structured data to be stored and manipulated (in contrast with relational databases which define how all the data inserted into the database must be composed) – NoSQL has grown popular as web apps have became more common and more complex,
- graph databases store data in terms of entities and relationships between entities – for instance, online transaction processing (OLTP) databases are speedy analytic databases designed for large numbers of transactions performed by multiple users.
Today’s large enterprise databases often support very complex queries and are expected to deliver nearly instant responses to those queries. As a result, database administrators are constantly called upon to employ a wide variety of methods to help improve performance.
Some common database challenges that they face include:
absorbing significant increases in data volume: the explosion of data coming in from sensors, connected machines, and dozens of other sources keeps database administrators scrambling to manage and organize their companies’ data efficiently;
ensuring data security: data breaches are happening at an ever-increasing rate, and hackers are getting more and more inventive – it is more important than ever to ensure that data is secure … yet also easily accessible to users;
keeping up with demand: in today’s fast-moving business environment, companies need real-time access to their data to support timely decision-making and to take advantage of new opportunities;
managing and maintaining the database and infrastructure: database administrators must continually watch the database for problems and perform preventative maintenance, as well as apply software upgrades and patches; as databases become more complex and data volumes grow, companies are faced with the expense of hiring additional talent to monitor and tune their databases;
removing limits on scalability: some claim that businesses need to grow if they are going to survive, and so must their data management; but it is nearly impossible for database administrators to predict how much capacity the company will need, particularly with on-premises databases,
ensuring data residency, data sovereignty, or latency requirements: some organizations have use cases that are better suited to run on-premises; in those cases, engineered systems that are pre-configured and pre-optimized for running the database are ideal.
Addressing all of these challenges can be time-consuming and can prevent database administrators from performing more strategic functions.
10.3.2 Data Modeling
Data modeling is a process used to define and analyze data requirements needed to support the business processes within the scope of corresponding information systems. This includes both data elements and structures/relationships between them.
Requirements are put into a conceptual model (tech independent specifications), which describes the semantics of a domain and the scope of the model. For example, a model of the interest area of an organization or industry. This consists of entity classes, representing the kinds of things of significance in the domain, and relationship assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial ‘language’ with a scope that is limited by the scope of the model.
The structure of the database data is put into a logical model, which describes the model semantics, as represented by a particular data manipulation technology. This consists of descriptions of tables and columns, object-oriented classes, and XML tags, among other things. The implementation of a single conceptual model may require multiple logical models. The logical models are then incorporated into a physical data model that organizes data into tables, which accounts for access, performance, and storage details.
The physical data model describes the physical means by which data is stored, including partitions, CPUs, tablespaces, and the like.
A database model, then, is a specification describing how a database is structured and used.
The flat (table) model may not strictly qualify as a data model; it consists of a single, two-dimensional array of data elements, where all members of a given column are assumed to be roughly similar values, and all members of a row are assumed to be related to one another.
The network model organizes data using two fundamental constructs: the records and the sets. Records contain fields, and sets define one-to-many relationships between records: one owner, many members. The network data model is an abstraction of the design concept used in the implementation of databases.
The hierarchical model is similar to the network model except that links in the hierarchical model form a tree structure, while the network model allows arbitrary graphs.
The relational model is a database model based on first-order predicate logic. Its core idea is to describe a database as a collection of predicates over a finite set of predicate variables, describing constraints on the possible values and combinations of values. The power of the relational data model lies in its mathematical foundations and its simple user-level paradigm.
The object-relational model is similar to a relational database model, but objects, classes and inheritance are directly supported in database schemas and in the query language.
Object-role modeling is an approach that has been defined as “attribute free” and “fact-based”. The result is a verifiably correct system, from which other common artifacts, such as ERD, UML, and semantic models may be derived. Associations between data objects are described during the database design procedure, leading to inevitable database normalization.161
The star schema is the simplest of the data warehouse schemas; it consists of a few “fact tables” (possibly only one, justifying the name) referencing any number of “dimension tables”. The star schema is considered an important special case of the snowflake schema.
Data modeling can also be phrased as a high-level abstract design phase used to describe:
the data contained in the database;
the relationships between data items, and
constraints on data.
The data items, relationships and constraints are all expressed using concepts provided by the high-level data model. Because these concepts do not include the implementation details, the result of the data modeling process is a semi-formal representation of the database structure. Database design includes logical design which is the definition of a database in a data model of a specific DBMS, and physical design which defines the internal database storage structure, file organization, and indexing techniques.
In database design , the first step is to identify business rules. The design is created and implemented using a DBMS.
In an external model, the user’s view of a database (multiple different external views) is closely related to the real world as perceived by each user.
Conceptual models provide flexible data-structuring capabilities; they offer a “community view” of the entire database (logical structure). This contains the data stored in the database and it shows relationships including: constraints, semantic information (e.g., business rules), security and integrity information, etc. Conceptual models consider a database as a collection of entities (objects) of various kinds, but they avoid detailed descriptions of the main data objects, in effect being independent of the eventual database implementation model.
Internal models are relational, network, and/or hierarchical data models. They consider the database as a collection of fixed-size records, closer to the physical level or the file structure. Internal models offer a representation of the database as seen by the DBMS and require the database designer to match the conceptual model’s characteristics and constraints to those of the selected implementation model; this may involve mapping entities in the conceptual model to tables in the relational model, say.
Physical models are physical representations of the database, its lowest level of abstraction. The focus is on how to deal with run-time, storage utilization and compression, file organization and access, and data encryption. The physical level is managed by the operating system; it provides concepts that describe how the data is stored in computer memory, in detail.
We have already mentioned schemas, which are database descriptions represented by an entity relationship diagram (ERD, see Structuring and Organizing Data). The most popular data models today are relational data models, although hierarchical and network data models are also often used on mainframe platforms. Relational data models describe the world as “a collection of inter-related relations (or tables)” .
Fundamental concepts include:
relations (table or file), which are subset of the Cartesian product of a list of domains characterized by a name. Within each table, the row represents a group of related data values; the row is known as a record or a tuple. Table columns are known as fields or attributes. Attributes are used to define a record, and a record contains a set of attributes;
tables and columns house the basic data components, into which content can be broken down. Columns are combined into tables. Tables must have distinct names, no duplicate rows, and atomic entries (values that cannot be divided) in its columns;
a column’s domain, the range of values found it the column, and finally
records, which contain related fields, and degree, which refers to the number of attributes. Records and fields form the basis of all databases. A simple table provides the clearest picture of how records and fields work together in a database storage project.
10.3.3 Data Storage
Data storage refers to the collection and retention of digital information: the bits and bytes behind applications, network protocols, documents, media, address books, user preferences, and so on.
For computers, short term information is handled on random-access memory (RAM), and long-term information is held on storage volumes. Computers also distribute data by type. Markup languages have become popular formats for digital file storage: UML, XML, JSON, CSV, etc.
Data storage basically boils down to:
the different ways to store different files;
how to store them in the right kind of structures based on data type, and
how those structures link together in a database.
It is data engineers and database analysts (data managers) that are responsible for storing collected and transformed data in various locations depending on the business requirements. Each set of tool and location may store and access the data in different ways; the limitations, benefits, and use cases for each location and set of data must be taken into account as part of good data management.
For instance, let us assume a company is ingesting a million records a day from a particular data source. If the data is stored on a disk, we cannot simply append the daily updates to a singular file!162) Any report or question needing a particular piece of information found on the disk would never be produced/answered.
Instead, the company’s data engineers would:
know that the data needs to be partitioned across different files and directories within the file system to separate the data;
evaluate the data and how it is loaded and consumed to determine the appropriate way to split it,
determine how to update specific pieces of data as changes are applied to the data source.
At a more meta level, there are other factors to consider, such as:
is the data key-value based? (see Structuring and Organizing Data)
are there complex relationships in the data?
does the data need to be processed or joined with other datasets?
Data warehousing is the term used to refer to the storage process of structured data. Data storage is transforming rapidly, since files can be compressed to take up less memory space, and computers can hold more files locally and in RAM.
Cloud-based data warehousing solutions like Snowflake, AWS Redshift, Azure Synapse, and Google BigQuery allow for pay-per-use data warehouses too, giving seemingly infinite storage.163
For on-premise data warehousing solutions, the investment is all up-front. The customers pay for the data warehousing solution, but do not get to see any return on investment while the hardware is set up, configured, and operationalized. It might take months, with millions of dollars already invested, just to be able to start to implement a solution for the first use case. The initial consequence, then, is that businesses are left with a severely under-utilized piece of hardware, making such a move a high-risk leap of faith for anyone but the biggest players.
At some point in the warehouse lifetime, enough use cases exist to eat the available hardware computer power or storage. When this occurs, either more hardware must be acquired (at another large hit to the budget) or existing use cases that can be scaled back (and to what extent) must be identified to create the required “space”. Purchasing more hardware in this stage is not as much of a leap of faith as the initial commitment was, but will once again leave the organization with an under-utilized data platform as new use cases are prioritized and solutions built for them.
In comparison, cloud-based data warehouse solutions use a pay-per-use cost model, where there is an opportunity to prove the value of a use case using an iterative approach. The initial step is to implement a use case solution with very light requirements to help gauge cost estimates and to understand how valuable that solution might be. Future iterations can expand on the solution, modifying the complexity of data transformation or how data flows through it, and even remove it to focus on another use case, if appropriate.
At no point is there a need to consider purchasing and installing additional hardware, as new warehouses or clusters can be created on-demand. Using a cloud-based data warehouse allows costs to scale according to the number of use cases and their complexity. However, there is a level of expertise and a lack of control over any changes to prices or policies that go with cloud tools.
It is also important to consider who has access to what pieces of information that are stored (data governance). In practice, rules and regulations define who should have access to particular pieces of information within your organization. For a shipping company, as an example, we may need to separate the data that suppliers and customers can see at any given time, or ensure that different suppliers cannot see information about other suppliers.
This requires data classification, tagging, and access constraints. When gathering data from various systems, a data engineer is responsible for applying the classification and tagging rules upon collection. This might include adding additional data points to the collected data or storing data separately on disk.
Then, when the data is aggregated or transformed, the end result must include this same information. When setting up access constraints to the data, the data engineer also has to enforce the required policies.
As more organizations are obtaining additional data from ever-growing new sources, they are faced with new problems:
securing the data;
ensuring regulatory compliance, and
general management of the data.
These are also problems that data governance exists to solve. Unfortunately, data governance is not achieved by using a specific tool or set of tools. There are tools that are designed to support some of the aspects of data governance, but they only enhance existing data governance practices. Data governance is very much a “people and process” oriented discipline, intending to make data secure, usable, available, and of high quality.