This topic provides guidance for planning and developing your ETLs. ETLs are defined in and run on your LabKey instance, but may interact with and even run remote processes on other databases.
Procedure for Development
ETLs are a mechanism that a developer controls using structured XML definitions that direct the module's actions. As you develop your own custom ETLs, following a tiered approach is recommended. The later steps will be easier to add and confirm when you are confident that the first steps are working as expected.
- Identify the source data. Understand structure, location, access requirements.
- Create (or identify) the destination or target table. Define the location and end requirements for data you will be loading.
- Create an ETL to perform your transformations, truncating the target each time (as the simplest pathway).
- If your ETL should run on a schedule, modify it to make it run on that schedule.
- Add additional options, such as a timestamp based filter strategy.
- Switch to a target data merge if desired.
- Add handling of deletes of data from the source.
In addition, outlining the overall goal and scope of your ETL up front will help you get started with the right strategy:
Source Considerations
The
source data may be:
- In the same LabKey folder or project where the ETL will be defined and run.
- In a different folder or project on the same LabKey server.
- Accessible via an external schema or a linked schema, such that it appears local though it in fact lives on another server or database.
- Accessible by remote connection to another server or database.
- Either a single table or multiple tables in any of the above locations.
In the <source> element in your transform, you specify the
schemaName and
queryName within that schema. If your source data is not on the same server you can use an
external data source configuration or
remote connection to access the data. If the source data is on the same LabKey server but not in the same LabKey folder as your ETL and destination, you can use a
linked schema or specify a
sourceContainerPath. Check the query browser to be sure you can see your data source:
(Admin) > Module > Query, check for the
schema and make sure it includes the
queryName you will use.
Learn about
cross-container ETL options below.
Destination Considerations
The
destination, or target of the ETL describes where and how you want your data to be at the end. Options include:
- Tables or files.
- Single or multiple ETL destinations
- Within the same LabKey server or to an external server
- Particular order of import to multiple destinations
If your destination is a table, the <destination> element requires the
schemaName and
queryName within that schema. If your destination is not on the same server as the ETL, you will need to use an
external data source configuration to access the destination. If your destination is on the same server but in a different LabKey folder, the ETL should be created in the same folder as the destination and a
linked schema used to access the source data. Check the query browser to be sure you can see your data source:
(Admin) > Module > Query, check for the
schema and make sure it includes the
queryName you will use.
For table destination ETLs, you also specify how you want any existing data in the destination query to be treated. Options are explored in the topic:
ETL: Target Options
- Append: Add new data to the end
- Merge: Update existing rows when the load contains new data for them; leave existing rows; add new rows.
- Truncate: Drop the existing contents and replace with the newly loaded contents.
The most common way to target
multiple destinations is to write a multiple step ETL. Each step is one <transform> which represents one source and one destination. If just replicating the same data across multiple destinations, the multi-step ETL would have one step for each destination and the same source for each step. If splitting the source data to multiple destinations, the most common way would be to write queries (within LabKey) or views (external server) against the source data to shape it to match the destination. Then each query or view would be the source in a step of the multi-step ETL. Often the associated destination tables have foreign key relations between them, so the order of the multi-step ETL (top to bottom) will be important.
Transformation Needs
- What kinds of actions need to be performed on your data between the source and target?
- Do you have existing scripting or procedures running on other databases that you want to leverage? If so, consider having an ETL call stored procedures on those other databases.
Review the types of transformation task available in this topic:
Decisions and Tradeoffs
There are often many ways to accomplish the same end result. This section covers some considerations that may help you decide how to structure your ETLs.
Multiple Steps in a Single ETL or Multiple ETLs?
- Do changes to the source affect multiple target datasets at once? If so consider configuring multiple steps in one ETL definition.
- Do source changes impact a single target dataset? Consider using multiple ETL definitions, one for each dataset.
- Are the target queries relational? Consider multiple steps in one ETL definition.
- Do you need steps to always run in a particular order? Use multiple steps in a single ETL, particularly if one is long running and needs to occur first. ETLs may run in varying order depending on many performance affecting factors.
- Should the entire series of steps run in a single transaction? If so, then use multiple steps in a single ETL.
ETLs Across LabKey Containers
ETLs are defined in the destination folder, i.e. where they will "pull" data from some original source. In the simplest case, the source is the same local folder, but if the source location is a different container on the same server, there are ways to accomplish this with your ETL:
- Create a linked schema for the source table in the destination folder. Then your ETL is created in the destination folder and simply provides this linked schema and query name as the source as if it were local. Linked schemas also let you be selective about the way that data is exposed "locally".
- Make your LabKey Server a remote connection to itself. Then you can access the source folder on the "remote connection" and provide the different container path there.
- Use "dot notation" to include a sort of path to the source schema in the schema definition. In this syntax, you use the word "Site" to mean the site level, "Project" to mean the current project when referencing a sibling folder at the same level under the project, and "folder" as a separator between layers of subfolder. A few examples:
schemaName="Project.<folderName>.lists"
schemaName="Site.<projectName>.folder.<folderName>.lists"
- Use a sourceContainerPath parameter in the <source> element, providing the full path to the location of the source data. For example, for a list in the "Example" folder under the "Home" project, you'd use:
<source queryName="listName" schemaName="lists" sourceContainerPath="/Home/Example">
Accessibility and Permissions
- How does the user who will run the ETL currently access the source data?
- Will the developer who writes the ETL also run (or schedule) it?
- To create an ETL, the user needs to either be a module developer (providing the module source for it) OR have administrative permissions to create one within the user interface.
ETL processes are run in the context of a folder, and with the permissions of the initiator.
- If run manually, they run with the permissions of the initiating user, i.e. the user who clicks Run.
- If scheduled, they will run as the user who enabled them, i.e. the user who checked the enabled box.
- To run or enable ETLs as a "service user", an administrator can use impersonation of that service user account, or sign in under it.
- There is no way for a non-admin to initiate an ETL that then runs as an admin or service account. To support a scenario like having a non-admin "kick off" an ETL after adding data, use a frequently-running scheduled ETL (enabled as the admin/service user) with a "ModifiedSince" filter strategy on the source so that it will only actually run when the data has changed.
- Will the ETL only run in a single folder, or will you want to make it available site-wide? If the latter, your ETL must set the siteScope attribute to true:
<etl xmlns="http://labkey.org/etl/xml" siteScope="true">
Site-scoped ETLs can be viewed on, and run from, the
admin console.
Repeatability
- Do you need your ETL to run on a regular schedule?
- Is it a one-time migration for a specific table or will you want to make a series of similar migrations?
- Do you expect the ETL to start from scratch each time (i.e. act on the entire source table) or do you need it to only act on modifications since the last time it was run? If the latter, use an incremental filters strategy.
- Will the ETL only run in a single folder, or will you want to make it available site-wide? If the latter, consider including it in a module that can be accessed site wide.
Notifications
Your ETL will run as a pipeline job, so if you would like to enable email notifications for ETLs, you can configure
pipeline notifications.
Related Topics