10.4 Reporting and Deployment

Currently, in industry, the two main applications of data science are reporting and deployment of machine learning models. In the context of data engineering, these machine learning models are embedded in the data pipeline. Data pipelines do not need to contain machine learning models. They may instead focus, for example, on business intelligence functionality. Here, however, the focus will be on data pipelines with machine learning models as a key component. As noted in a previous section, these machine learning models can be viewed metaphorically as the brains of the pipeline.

Typically, the field associated with managing machine learning models in an implemented context is known as MLOps.

MLOps breaks away from the traditional research cycle of training an AI model, which often involves only a single pass of the following steps:

  1. preparing the training data;

  2. training the model,

  3. evaluating the model.

These elements are still present in MLOps, but there is an increased focus on ongoing monitoring and management of the ML models embedded in the pipeline. For example, MLOps processes will monitor models for model drift in the context of the automated data stream, monitor models for performance relative to volume of data, and iteratively and automaticaly train and improve models over time based on the feedback received from this monitoring.

This iterative or interactive approach often includes automated machine learning (AutoML) capabilities; what happens outside the scope of the trained model is not included in this traditional definition.

In modern data science contexts, MLOps may also refer to the entire data science process, from ingestion of the data to a live application that runs in a business environment and makes an impact at the business level. In this respect, there is overlap with DataOps and DevOps more generally.

10.4.1 Reports and Products

In the research-first approach to data science, which still dominates a lot of industry applications, machine learning models are used to generate static or interactive reports for business analysts; data science is handled as a silo, running batch predictions on historical data and returning the results for someone else to incorporate manually into applications.

In those conditions, there is little demand for resiliency, scale, real-time access, or continuous integration and deployment (CI/CD); the results are of limited value, in and of themselves, and are used more as proof-of-concept. Most data science solutions and platforms today still start with a research workflow but fail to move past the proof-of-concept stage.

CI/CD components refer to the training/re-training loop of a model, and do not extend to the full reporting and deployment pipeline. Even the concept of a CI/CD pipeline is often used to refer only to the training loop and do not extend to include the entire operational pipeline.

In most AI projects, the starting point is the development of a model:

  1. data scientists receive data, which may be extracted manually from many sources;

  2. the data is then joined and cleaned in an interactive way (using notebooks, perhaps), and

  3. training and experiments are conducted while tracking results.

The model is generated and tested/validated until the results “look good” (meet a certain performance threshold), at which point different teams take the results and attempt to integrate them into real-world applications.

Modern tools (such as Flask) allow data scientists to serialize a model into a file and then simply call the file to make predictions. However, the full process of monitoring, creating feedback loops, then retraining and updating the model still requires an underlying architecture.

In most cases, eventually, the original data science product/model will be set aside and re-implemented in a robust and scalable way which fits production, but which may not be what the data scientist originally intended.

A production pipeline starts with automated data collection and preparation, continues with automated training and evaluation pipelines, and incorporates real-time application pipelines, data quality and model monitoring, feedback loops, etc.

As applications that demand real-time recommendations, prevent fraud, predict failures, and make decisions continue to be in demand, more engineering efforts are required to make these applications feasible. Business needs have forced data science components to be robust, performant, highly scalable, and aligned with agile software and DevOps practices.

Unfortunately, it is all too often the case that operationalizing machine learning164 comes as an afterthought, making it all the more difficult to create real business value with AI.

Instead of this siloed, complex, and manual process, we should start by designing the ML elements of the pipeline using a modular strategy, where the different parts of the ML component provide a continuous, automated, and far simpler way to move from research and development to scalable production pipelines, without the need to refactor code, add glue logic, and spend significant efforts on data and ML engineering.

ML-focused production-ready pipelines have four key components:

  1. feature store: collects, prepares, catalogues, and serves data features for development (offline) and real-time (online) usage;

  2. machine learning CI/CD pipeline: automatically trains, tests, optimizes, and deploys or updates models using a snapshot of the production data (generated by the feature store) and code from the source control (Git);

  3. real-time/event-driven application pipeline: includes the API handling, data preparation/enrichment, model serving, ensembles, driving and measuring actions, etc., and

  4. real-time data and model monitoring: monitors data, models, and production components, and provides a feedback loop for exploring production data, identifying drift, alerting on anomalies or data quality issues, triggering re-training jobs, measuring business impact, etc.

10.4.2 Cloud and On-Premise Architecture

Organizations have to make decisions on how much of their data architecture to build in-house, and how much to build with off-the-shelf tools. Additionally, there are compromises and benefits to building infrastructure on the cloud (renting external resources) with potential to publish results for anyone in the world to see and build on, and building solutions on premise which depend heavily on local capacity and hardware.

Many companies, such as Spotify, build their own pipelines from scratch to analyze data and understand user preferences, and map customers to music preferences, say. The main challenges to developing in-house pipelines are that different data sources provide different application program interfaces (API) and involve different kinds of technologies.

Developers must write new code for every data source, and may need to rewrite it if a vendor changes its API, or if the organization adopts a different data warehouse destination. Data engineers also must address speed and scalability: for time-sensitive analysis or business intelligence applications, ensuring low latency can be crucial to providing data that drives decisions.

Data solutions need to be able to dynamically access more resources as data volume grows. Therefore, in-house pipelines can be expensive to build and maintain.

On-premise amateur-ish data pipelines ingest data in pre-scheduled batches (e.g., twice every hour or every night, say), and are not ideal for any real-time analytics solutions. Such pipelines may be all that is required in certain cases, such as establishing a proof-of-concept for business processes that require less frequent and manual decision-making. For example, a retailer can use them to make decisions about the order of recommendation of certain items in an online store, but may miss on recommending a product to an individual on a certain short-term buying spree in real-time.

ETL tools that work with in-house data warehouses do as much preparation work as possible, including transformation, prior to loading data into data warehouses. Cloud data warehouses like Amazon Redshift, Google BigQuery, Azure SQL Data Warehouse, and Snowflake can scale up and down in seconds or minutes, so developers can replicate raw data from disparate sources and define transformations in SQL and run them in the data warehouse after loading or at the time of query.

Just as there are cloud-native data warehouses, there also are ETL services built for the cloud. Organizations can set up a cloud-first platform for moving data in minutes, and data engineers can rely on the solution to monitor and handle unusual scenarios and failure points. Even without larger more specialized tools, simple desktop tools such as Tableau, Looker, or Microsoft’s Power BI can still be used to run queries and reports, and with a modern real-time pipeline the results will be current and immediately actionable.

Overall, cloud tools are becoming more and more popular to host data pipelines and by extension data science solutions.