Run NLP Pipeline

2025-07-02

Premium Feature — Available in the Enterprise Edition of LabKey Server. Learn more or contact LabKey.

This topic outlines how to configure a workspace and run the NLP pipeline directly against source tsv files. First, an administrator must configure the pipeline as described here. Then, any number of users can process tsv files through one or more versions of the NLP engine. The user can also rerun a given tsv file later using a different version of the engine to compare results and test the NLP engine itself.

Set Up a Workspace
Configure Options

Set Module Properties
Alternate Metadata Location

Define Pipeline Protocol(s)
Define Document Processing Configuration(s)
Run Data Through the NLP Pipeline
Error Reporting

If you already have abstraction results obtained from another provider formatted in a compatible manner, you can upload them directly, bypassing the NLP pipeline. Find details here:

Directly Import Abstraction Results

Set Up a Workspace

Each user should work in their own folder, particularly if they intend to use different NLP engines.

Log in to the server.
Create a new folder to work in (you must be a folder administrator to create a new folder).

Enter a (unique) name for your folder. This walkthrough and associated screencaps use the folder name "NLP Test Space".
Choose folder type "NLP."
Accept other folder creation wizard defaults.

The default NLP folder contains web parts for the Data Pipeline, NLP Job Runs, and NLP Reports. To return to this main page at any time, click NLP Dashboard in the upper right.

Setup the Data Pipeline

In the Data Pipeline web part, click Setup.
Select Set a pipeline override.
Enter the primary directory where the files you want to process are located.
Set searchability and permissions appropriately.
Click Save.
Click NLP Dashboard.

Configure Options

Set Module Properties

Multiple abstraction pipelines may be available on a given server. Using module properties, the administrator can select a specific abstraction pipeline and specific set of metadata to use in each container.

These module properties can all vary per container, so for each, you will see a list of folders, starting with "Site Default" and ending with your current project or folder. All parent containers will be listed, and if values are configured, they will be displayed. Any in which you do not have permission to edit these properties will be grayed out.

Navigate to the container (project or folder) you are setting up.
Select (Admin) > Folder > Management and click the Module Properties tab.
Under Property: Pipeline, next to your folder name, select the appropriate value for the abstraction pipeline to use in that container.
Under Property: Metadata Location enter an alternate metadata location to use for the folder if needed.
Check the box marked "Is Metadata Location Relative to Container?" if you've provided a relative path above instead of an absolute one.

Click Save Changes when finished.

Alternate Metadata Location

The metadata is specified in a .json file, named in this example "metadata.json" though other names can be used.

Upload the "metadata.json" file to the Files web part.
Select (Admin) > Folder > Management and click the Module Properties tab.
Under Property: Metadata Location, enter the file name in one of these ways:

"metadata.json": The file is located in the root of the Files web part in the current container.
"subfolder-path/metadata.json": The file is in a subfolder of the Files web part in the current container (relative path)
"full path to metadata.json": Use the full path if the file has not been uploaded to the current project.

If the metadata file is in the files web part of the current container (or a subfolder within it), check the box for the folder name under Property: Is Metadata Location Relative to Container.

Define Pipeline Protocol(s)

When you import a TSV file, you will select a Protocol which may include one or more overrides of default parameters to the NLP engine. If there are multiple NLP engines available, you can include the NLP version to use as a parameter. With version-specific protocols defined, you then simply select the desired protocol during file import. You may define a new protocol on the fly during any tsv file import, or you may find it simpler to predefine one or more. To quickly do so, you can import a small stub file, such as the one attached to this page.

Download this file: stub.nlp.tsv and place in the location of your choice.
In the Data Pipeline web part, click Process and Import Data.
Drag and drop the stub.nlp.tsv file into the upload window.

For each protocol you want to define:

In the Data Pipeline web part, click Process and Import Data.
Select the stub.nlp.tsv file and click Import Data.
Select "NLP engine invocation and results" and click Import.

From the Analysis Protocol dropdown, select "<New Protocol>". If there are no other protocols defined, this will be the only option.
Enter a name and description for this protocol. A name is required if you plan to save the protocol for future use. Using the version number in the name can help you easily differentiate protocols later.
Add a new line to the Parameters section for any parameters required, such as giving a location for an alternate metadata file or specifying the subdirectory that contains the intended NLP engine version. For example, in the example in our setup documentation, the subdirectories are named "engineVersion1" and "engineVersion2" (your naming may differ). So to specify an engine version you would uncomment the example line shown and use:

<note label="version" type="input">engineVersion1</note>

Select an Abstraction Identifier from the dropdown if you want every document processed with this protocol to be assigned the same identifier. "[NONE]" is the default.
Confirm "Save protocol for future use" is checked.

Scroll to the next section before proceeding.

Define Document Processing Configuration(s)

Document processing configurations control how assignment of abstraction and review tasks is done automatically. One or more configurations can be defined enabling you to easily select among different settings or criteria using the name of the configuration. In the lower portion of the import page, you will find the Document processing configuration section:

From the Name dropdown, select the name of the processing configuration you want to use.

The definition will be shown in the UI so you can confirm this is correct.

If you need to make changes, or want to define a new configuration, select "<New Configuration>".

Enter a unique name for the new configuration.
Select the type of document and disease groups to apply this configuration to. Note that if other types of report are uploaded for assignment at the same time, they will not be assigned.
Enter the percentages for assignment to review and abstraction.
Select the group(s) of users to which to make assignments. Eligible project groups are shown with checkboxes.

Confirm "Save configuration for future use" is checked to make this configuration available for selection by name during future imports.

Complete the import by clicking Analyze.
In the Data Pipeline web part, click Process and Import Data.
Upload the "stub.nlp.tsv" file again and repeat the import. This time you will see the new protocol and configuration you defined available for selection from their respective dropdowns.
Repeat these two steps to define all the protocols and document processing configurations you need.

For more information, see Pipeline Protocols.

Run Data Through the NLP Pipeline

First upload your TSV files to the pipeline.

In the Data Pipeline web part, click Process and Import Data.
Drag and drop files or directories you want to process into the window to upload them.

Once the files are uploaded, you can iteratively run each through the NLP engine as follows:

In the Data Pipeline web part, click Process and Import Data.
Navigate uploaded directories if necessary to find the files of interest.
Check the box for a tsv file of interest and click Import Data.
Select "NLP engine invocation and results" and click Import.
Choose an existing Analysis Protocol or define a new one.
Choose an existing Document Processing Configuration or define a new one.
Click Analyze.
While the engine is running, the pipeline web part will show a job in progress. When it completes, the pipeline job will disappear from the web part.
Refresh your browser window to show the new results in the NLP Job Runs web part.

View and Download Results

Once the NLP pipeline import is successful, the input and intermediate output files are both deleted from the filesystem.

The NLP Job Runs lists the completed run, along with information like the document type and any identifier assigned for easy filtering or reporting. Hover over any run to reveal a (Details) link in the leftmost column. Click it to see both the input file and how the run was interpreted into tabular data.

During import, new line characters (CRLF, LFCR, CR, and LF)are all normalized to LF to simplify highlighting text when abstracting information.

Note: The results may be reviewed for accuracy. In particular, the disease group determination is used to guide other values abstracted. If a reviewer notices an incorrect designation, they can edit, manually update it and send the document for reprocessing through the NLP information with the correct designation.

Download Results

To download the results, select (Export/Sign Data) above the grid and choose the desired format.

Rerun

To rerun the same file with a different version of the engine, simply repeat the original import process, but this time choose a different protocol (or define a new one) to point to a different engine version.

Error Reporting

During processing of files through the NLP pipeline, some errors which occur require human reconcilation before processing can proceed. The pipeline log is available with a report of any errors that were detected during processing, including:

Mismatches between field metadata and the field list. To ignore these mismatches during upload, set "validateResultFields" to false and rerun.
Errors or excessive delays while the transform phase is checking to see if work is available. These errors can indicate problems in the job queue that should be addressed.

Add a Data Transform Jobs web part to see the latest error in the Transform Run Log column.

For more information about data transform error handling and logging, see ETL: Logs and Error Handling.

Directly Import Abstraction Results

An alternative method for uploading abstraction results directly can be used if the results use the same JSON format as output by the NLP pipeline.

Requirements:

A TXT report file and a JSON results file. Each named for the report, one with with a .txt extension and one with a .nlp.json extension. No TSV file is required.
This pair of files is uploaded to LabKey as a directory or zip file.
A single directory or zip file can contain multiple pairs of files representing multiple reports, each with a unique report name and corresponding json file.

Upload the directory or zip file containing the report .txt and .nlp.json files.
Select the JSON file(s) and click Import Data.
Choose the option NLP Results Import.
Choose (or define) the pipeline protocol and document processing configuration.
Click Analyze.
Results are imported, with new documents appearing in the task list as if they had been processed using the NLP engine.