Data ETL

Extract, transform, load (ETL) is the process of copying data from one or more sources into a destination files which represents the data differently than the sources.

ETL Process

The typical Machine Learning ETL process is one of successive data operations as shown below:

Extract Functions

Extract functions can include:

  • Database Queries

  • File Reading and/or Data Selection

Transform Functions

Transform functions can include:

Load Functions

Load functions can include:

  • Database Loads

  • File Writes

Tools

ETL processes are often constructed and performed using a custom collection of tools.

Below are some references to tools oriented to ETL processing:

Key Factors

Key factors include:

Process Automation

Process automation is necessary to provide:

  • real-time prediction input data: performing real-time predictions requires real-time input data

  • timely model training input data: model training is an iterative process requiring a series of ETLs

Comprehensive Data Access

ETL processes should provide comprehensive data access for:

  • effective model training, predictions and evaluation: data is the life blood of Machine Learning

  • research for new applications: Machine Learning is evolving rapidly requiring access to new data and data sources

References