Process Files Using Natural Language Pipeline (NLP): /home/

Process Files Using Natural Language Pipeline (NLP)

This topic outlines how to configure a workspace and run the NLP pipeline directly against source tsv files. First, an administrator must configure the pipeline as described here. Then, any number of users can process tsv files through one or more versions of the NLP engine. The user can also rerun a given tsv file later using a different version of the engine to compare results and test the NLP engine itself.

Set Up a Workspace
Define Pipeline Protocol(s)
Run Data Through the NLP Pipeline
Error Reporting

Set Up a Workspace

Each user should work in their own folder, particularly if they intend to use different NLP engines.

Log in to the server.
Create a new folder to work in (you must be a folder administrator to create a new folder).

Select Admin > Folder > Management.
Click Create Subfolder.
Enter a (unique) name for your folder and click the button for type NLP.
Click Next and then Finish.

This walkthrough and associated screencaps use the folder name "NLP Test Space".

The default NLP folder contains web parts for the Data Pipeline, NLP Job Runs, and NLP Reports. To return to this main page at any time, click NLP Dashboard in the upper right.

Setup the Data Pipeline

In the Data Pipeline web part, click Setup.
Select Set a pipeline override.
Enter the primary directory where the files you want to process are located.
Set searchability and permissions appropriately.
Click Save.
Click NLP Dashboard.

Define Pipeline Protocol(s)

When you import a TSV file, you will select a Protocol which may include one or more overrides of default parameters to the NLP engine. If there are multiple NLP engines available, you can include the NLP version to use as a parameter. With version-specific protocols defined, you then simply select the desired protocol during file import. You may define a new protocol on the fly during any tsv file import, or you may find it simpler to predefine one or more. To quickly do so, you can import a small stub file, such as the one attached to this page.

Download this file: stub.nlp.tsv and place in the location of your choice.
Click Process and Import Data on the NLP Dashboard.
Drag and drop the stub.nlp.tsv file into the upload window.

For each protocol you want to define:

Click Process and Import Data on the NLP Dashboard.
Select the stub.nlp.tsv file and click Import Data.
Select "NLP engine invocation and results" and click Import.

From the Analysis Protocol dropdown, select "<New Protocol>". If there are no other protocols defined, this will be the only option.
Enter a name (required) and description for this protocol. Using the version number in the name will help you easily differentiate them later.
Add a new line to the Parameters section giving the subdirectory that contains the intended version. In the example in our setup documentation, the subdirectories are named "engineVersion1" and "engineVersion2" but your naming may differ.

<note label="version" type="input">engineVersion1</note>

Confirm "Save protocol for future use" is checked.

Click Analyze.
Return to the files panel by clicking NLP Dashboard, then Process and Import Data.
Select the "stub.nlp.tsv" file again and repeat the import. This time you will see the first protocol you defined as an option.
Select "<New Protocol>" and enter the name of the next engine subdirectory as the version parameter.
Repeat as needed.

For more information, see Pipeline Protocols.

Run Data Through the NLP Pipeline

First upload your TSV files to the pipeline.

In the Data Pipeline web part, click Process and Import Data.
Drag and drop files or directories you want to process into the window to upload them.

Once the files are uploaded, you can iteratively run each through the NLP engine as follows:

Click NLP Dashboard and then Process and Import Data.
Navigate uploaded directories if necessary to find the files of interest.
Check the box for a tsv file of interest and click Import Data.
Select "NLP engine invocation and results" and click Import.

Choose an existing Analysis Protocol or define a new one.
Click Analyze.
While the engine is running, the pipeline web part will show a job in progress. When it completes, the pipeline job will disappear from the web part.
Refresh your browser window to show the new results in the NLP Job Runs web part.

View and Download Results

Once the NLP pipeline import is successful, the input and intermediate output files are both deleted from the filesystem.

The NLP Job Runs lists the completed run, click Details on the right to see both input and how it was interpreted into tabular data.

Note: The results may be reviewed for accuracy. In particular, the disease group determination is used to guide other values abstracted. If a reviewer notices an incorrect designation, they can edit, manually update it and send the document for reprocessing through the NLP information with the correct designation.

Download Results

To download the results, select Export above the grid and choose the desired format.

Rerun

To rerun the same file with a different version of the engine, simply repeat the original import process, but this time choose a different protocol (or define a new one) to point to a different engine version.

Error Reporting

During processing of files through the NLP pipeline, some errors which occur require human reconcilation before processing can proceed. The pipeline log is available with a report of any errors that were detected during processing, including:

Mismatches between field metadata and the field list. To ignore these mismatches during upload, set "validateResultFields" to false and rerun.
Errors or excessive delays while the transform phase is checking to see if work is available. These errors can indicate problems in the job queue that should be addressed.

Add a Data Transform Jobs webpart to see the latest error in the Transform Run Log column.

For more information about data transform error handling and logging, see ETL: Logs and Error Handling.

LabKey Support

LabKey Support