Data Flow

Data Flow is a template for understanding and designing a Machine Learning sequence of data movement.

Related concepts include:

Data is used by Machine Learning functional group experts as shown below:

Data Flow Layers

Data passes through layers of processing as it is stored, refined, and prepared for use in Machine Learning Models and Applications.

Sources

Data sources include:

Company Internal Databases
Company Internal Files
Websites
Public Data
Smartphone Apps
IOT Devices
Commercial Data Aggregators
Point of Sale
Corporate Internal Processes
Social Media
Data Streams

Capture

Capture mechanisms include:

Website Scraping
Website and Smartphone Chat Dialogues
Website and Smartphone Form Submissions
IOT Device Interfaces
Commercial Data Aggregator Feeds
Corporate Internal Process Feeds

Pipeline

Pipeline processes include:

Data Ingestion
Data Temporary Storage
Data Subscription
Data Publication

Databases

Databases include:

ETLs

ETLs Include:

Extract Functions: pulling data from selected sources
Transform Functions: normalization, regularization, aggregation
Load Functions: saving data in formats for use in modeling processes

Models

Model type category examples include:

Applications

Application examples include:

Medical Diagnosis
Autonomous Vehicles
Chatbot Dialog
Image Recognition
Face Recognition
Product Recommendations
Churn Prediction
Malware Detection
Search Refinement
…

Functional Groups

Functional Groups are those organizations and clusters of professionals that participate in Machine Learning.

Functional Groups are discussed here.

Key Factors

Flow Continuity

Efficient and accurate Machine Learning processes require a data flow that is continuous and well managed. Reasons for this include:

environment change: the world, its population, technology, etc. is in a state of constant change which must be reflected in the data used for Machine Learning
constant testing and evaluation: Machine Learning models and predictions must be continually tested and evaluated to determine when and how to modify/update them to reflect environment changes
new applications: Machine Learning is evolving very rapidly and new applications require new data

Historical Data

It’s critical to retain historical data related to Machine Learning model data, training, predictions, alerts, etc. in order to:

measure accuracy: knowledge of how models perform outside of training and testing is a critical element in performance evaluation
detect model degradation: as the environment changes, model performance can decay requiring model changes and upgrades
demonstrate performance: reporting and visualizing performance data is important for justifying the investments needed to maintain Machine Learning growth

Data Storage

Data should be stored in a manner that makes it readily available for ETL processing for model training and prediction processing:

database stores: if possible, data for Machine Learning should be stored in a database for convenient access by ETL processes
data normalization: data should be normalized when possible
data update: data should be updated in as close to real-time as possible for use by production models

Data Tiering

Implementing a tiered storage strategy provides optimization options for different types of data:

Hot data: Kept in high-performance storage for quick access
Warm data: Moved to lower-cost storage options
Cold data: Archived in very low-cost storage for long-term retention

Automation

Maintaining a rapid and efficient flow of data is not possible without significant levels of process automation due to the:

volume of data: model training and performance evaluation requires very large volumes of data
velocity of change: rapid environmental change translates to rapid data change