Extract-Transform-Load mechanisms, available with the LabKey
dataintegration module, let you encapsulate some of the most common database tasks, specifically:
- Extracting data from a source database
- Transforming it, and
- Loading it into a target database (the same or another one)
Example Scenarios
At a high level, ETLs provide a system for structuring a sequence of operations or actions on data. There are a wide variety of flexible options available to support diverse scenarios. Including, but certainly not limited to:
- Integrate data from multiple sources into a shared data warehouse.
- Migrate from one database schema to another, especially where the source schema is an external database.
- Coalescing many tables into one, or distributing (aka, provisioning) one table into many.
- Normalize data from different systems.
- Synchronizing data in scheduled increments.
- Logging and auditing the details of repeated migration or analysis processes.
ETL Definition Options
ETL functionality can be (1) included in
a custom module or (2) written directly
into the user interface. ETLs can also be either standalone or run a series of other ETLs or stored database procedures in sequence. Depending on your needs, you may decide to use a combination of the methods for defining ETLs.
- Module ETLs can more easily be placed under version control and need only be defined once to be deployed on multiple systems (such as staging and production).
- ETLs defined in the user interface are easier to start from scratch on a running server, but will be defined only on the server where they are created.
- Module ETLs should only be edited via the module source. While there may be occasions where you can make a change to a module-defined ETL in the UI (such as changing the schedule on which it runs), this is not recommended. Any changes will be reverted when Tomcat restarts.
- A combination approach, where a 'wrapper' ETL in the user interface calls a set of module-defined component ETLs may give you the right blend of features.
ETL Development Process
When developing an ETL for your own custom scenario, it is good practice to follow a staged approach, beginning with the simplest process and adding layers of complexity only after the first steps are well understood and functioning as expected.
- Identify the source data. Understand structure, location, access requirements.
- Create the target table. Define the location and end requirements for data you will be loading.
- Create an ETL to perform your transformations, truncating the target each time (as the simplest pathway).
- If your ETL should run on a schedule, modify it to make it run on that schedule.
- Add additional options, such as a timestamp based filter strategy.
- Switch to a target data merge if desired.
- Add handling of deletes of data from the source.
More details about this procedure are available in the topic:
ETL: Planning
Topics
The following topics will get you started developing ETL scripts and processes and packaging them as modules:
Get Started with the Basics
<transforms> Topics
Scheduling
Filtering: Either on Source Data or to Run ETLs Incrementally
More Topics
Related Topics