Extract-Transform-Load mechanisms, available with the LabKey dataintegration
module, let you encapsulate some of the most common database tasks, specifically:
- Extracting data from a source database
- Transforming it, and
- Loading it into a target database (the same or another one)
At a high level, ETLs provide a system for structuring a sequence of operations or actions on data. There are a wide variety of flexible options available to support diverse scenarios. Including, but certainly not limited to:
- Integrate data from multiple sources into a shared data warehouse.
- Migrate from one database schema to another, especially where the source schema is an external database.
- Coalescing many tables into one, or distributing (aka, provisioning) one table into many.
- Normalize data from different systems.
- Synchronizing data in scheduled increments.
- Logging and auditing the details of repeated migration or analysis processes.
ETL Definition Options
ETL functionality can be (1) included in a custom module
or (2) written directly into the user interface
. ETLs can also be either standalone or run a series of other ETLs or stored database procedures in sequence. Depending on your needs, you may decide to use a combination of the methods for defining ETLs.
- Module ETLs can more easily be placed under version control and need only be defined once to be deployed on multiple systems (such as staging and production).
- ETLs defined in the user interface are easier to start from scratch on a running server, but will be defined only on the server where they are created.
- A combination approach, where a 'wrapper' ETL in the user interface calls a set of module-defined component ETLs may give you the right blend of features.
ETL Development Process
When developing an ETL for your own custom scenario, it is good practice to follow a staged approach, beginning with the simplest process and adding layers of complexity only after the first steps are well understood and functioning as expected.
- Identify the source data. Understand structure, location, access requirements.
- Create the target table. Define the location and end requirements for data you will be loading.
- Create an ETL to perform your transformations, truncating the target each time (as the simplest pathway).
- If your ETL should run on a schedule, modify it to make it run on that schedule.
- Add additional options, such as a timestamp based filter strategy.
- Switch to a target data merge if desired.
- Add handling of deletes of data from the source.
More details about this procedure are available in the topic: ETL: Planning
The following topics will get you started developing ETL scripts and processes and packaging them as modules:Get Started with the Basics
Filtering: Either on Source Data or to Run ETLs Incrementally