Premium Feature — Available with all Premium Editions of LabKey Server. Learn more or contact LabKey.

Extract-Transform-Load mechanisms, available with the LabKey dataintegration module, let you encapsulate some of the most common database tasks, specifically:

  1. Extracting data from a source database
  2. Transforming it, and
  3. Loading it into a target database (the same or another one)

Example Scenarios

At a high level, ETLs provide a system for structuring a sequence of operations or actions on data. There are a wide variety of flexible options available to support diverse scenarios. Including, but certainly not limited to:

  • Integrate data from multiple sources into a shared data warehouse.
  • Migrate from one database schema to another, especially where the source schema is an external database.
  • Coalescing many tables into one, or distributing (aka, provisioning) one table into many.
  • Normalize data from different systems.
  • Synchronizing data in scheduled increments.
  • Logging and auditing the details of repeated migration or analysis processes.

ETL Definition Options

ETL functionality can be (1) included in a custom module or (2) written directly into the user interface. ETLs can also be either standalone or run a series of other ETLs or stored database procedures in sequence. Depending on your needs, you may decide to use a combination of the methods for defining ETLs.

  • Module ETLs can more easily be placed under version control and need only be defined once to be deployed on multiple systems (such as staging and production).
  • ETLs defined in the user interface are easier to start from scratch on a running server, but will be defined only on the server where they are created.
  • A combination approach, where a 'wrapper' ETL in the user interface calls a set of module-defined component ETLs may give you the right blend of features.

ETL Development Process

When developing an ETL for your own custom scenario, it is good practice to follow a staged approach, beginning with the simplest process and adding layers of complexity only after the first steps are well understood and functioning as expected.

  1. Identify the source data. Understand structure, location, access requirements.
  2. Create the target table. Define the location and end requirements for data you will be loading.
  3. Create an ETL to perform your transformations, truncating the target each time (as the simplest pathway).
  4. If your ETL should run on a schedule, modify it to make it run on that schedule.
  5. Add additional options, such as a timestamp based filter strategy.
  6. Switch to a target data merge if desired.
  7. Add handling of deletes of data from the source.
More details about this procedure are available in the topic: ETL: Planning

Topics

The following topics will get you started developing ETL scripts and processes and packaging them as modules:

Get Started with the Basics

<transforms> Topics Scheduling Filtering: Either on Source Data or to Run ETLs Incrementally More Topics

Related Topics

Discussion

Was this content helpful?

Log in or register an account to provide feedback


previousnext
 
expand all collapse all