Table of Contents

guest
2019-06-17
     LabKey Natural Language Processing (NLP)
       Natural Language Processing (NLP) Pipeline
       Metadata Json Files
       Document Abstraction Workflow
       Automatic Assignment for Abstraction
       Manual Assignment for Abstraction
       Abstraction Task Lists
       Document Abstraction
       Review Document Abstraction
       Review Multiple Result Sets
       NLP Result Transfer
       Configure LabKey NLP
       Run NLP Pipeline

LabKey Natural Language Processing (NLP)


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

Large amounts of clinical data is locked up in free-hand notes and document formats that were not originally designed for entry into computer systems. How can this data be extracted for the purposes of standardization, consolidation, and clinical research? LabKey's Natural Language Processing (NLP) and Document Abstraction, Curation, and Annotation workflow tools help to unlock this data and transform it into a tabular format that can better yield clinical insight.

LabKey Server's solution focuses on the overall workflow required to efficiently transform large of amounts of data into formats usable by researchers. Teams can take an integrated, scalable approach to both manual data abstraction and automated natural language processing (NLP) engine use.

LabKey NLP is available for subscription purchase from LabKey. For further information, please contact LabKey.

Documentation

Deprecated NLP Engine

An integrated solution combining LabKey workflow tools with a natural language processing engine developed at Fred Hutch is no longer under active development. You can learn more about how this solution was implemented in these topics:

Resources




Natural Language Processing (NLP) Pipeline


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

This topic outlines how to upload documents that are already formatted to be compatible with the Abstraction and Annotation workflow tools.

Integrated NLP Engine (Optional)

Another upload option is to configure and use a Natural Language Processing (NLP) pipeline with an integrated NLP engine. Different NLP engines and preprocessors have different requirements and output in different formats. This topic can help you correctly configure one to suit your appplication.

Set Up a Workspace

The default NLP folder contains web parts for the Data Pipeline, NLP Job Runs, and NLP Reports. To return to this main page at any time, click the Folder Name link near the top of the page.

Upload Documents Directly

Documents for abstraction, annotation, and curation can be directly uploaded. Typically the following formats are provided:

  • A TXT report file and a JSON results file. Each named for the document, one with with a .txt extension and one with a .nlp.json extension.
  • This pair of files is uploaded to LabKey as a directory or zip file.
  • A single directory or zip file can contain multiple pairs of files representing multiple documents, each with a unique report name and corresponding json file.
For more about the JSON format used, see Metadata Json Files.

  • Upload the directory or zip file containing the document including .nlp.json files.
  • Select the JSON file(s) and click Import Data.
  • Choose the option NLP Results Import.

Review the uploaded results and proceed to task assignment.

Integrated NLP Engine

Configure a Natural Language Processing (NLP) Pipeline

An alternate method of integration with an NLP engine can encompass additional functionality during the upload process, provided in a pipeline configuration.

Setup the Data Pipeline

  • Return to the main page of the folder.
  • In the Data Pipeline web part, click Setup.
  • Select Set a pipeline override.
  • Enter the primary directory where the files you want to process are located.
  • Set searchability and permissions appropriately.
  • Click Save.
  • Click NLP Dashboard.

Configure Options

Set Module Properties

Multiple abstraction pipelines may be available on a given server. Using module properties, the administrator can select a specific abstraction pipeline and specific set of metadata to use in each container.

These module properties can all vary per container, so for each, you will see a list of folders, starting with "Site Default" and ending with your current project or folder. All parent containers will be listed, and if values are configured, they will be displayed. Any in which you do not have permission to edit these properties will be grayed out.

  • Navigate to the container (project or folder) you are setting up.
  • Select (Admin) > Folder > Management and click the Module Properties tab.
  • Under Property: Pipeline, next to your folder name, select the appropriate value for the abstraction pipeline to use in that container.
  • Under Property: Metadata Location enter an alternate metadata location to use for the folder if needed.
  • Check the box marked "Is Metadata Location Relative to Container?" if you've provided a relative path above instead of an absolute one.

Alternate Metadata Location

The metadata is specified in a .json file, named in this example "metadata.json" though other names can be used.

  • Upload the "metadata.json" file to the Files web part.
  • Select (Admin) > Folder > Management and click the Module Properties tab.
  • Under Property: Metadata Location, enter the file name in one of these ways:
    • "metadata.json": The file is located in the root of the Files web part in the current container.
    • "subfolder-path/metadata.json": The file is in a subfolder of the Files web part in the current container (relative path)
    • "full path to metadata.json": Use the full path if the file has not been uploaded to the current project.
  • If the metadata file is in the files web part of the current container (or a subfolder within it), check the box for the folder name under Property: Is Metadata Location Relative to Container.

Define Pipeline Protocol(s)

When you import a TSV file, you will select a Protocol which may include one or more overrides of default parameters to the NLP engine. If there are multiple NLP engines available, you can include the NLP version to use as a parameter. With version-specific protocols defined, you then simply select the desired protocol during file import. You may define a new protocol on the fly during any tsv file import, or you may find it simpler to predefine one or more. To quickly do so, you can import a small stub file, such as the one attached to this page.

  • Download this file: stub.nlp.tsv and place in the location of your choice.
  • In the Data Pipeline web part, click Process and Import Data.
  • Drag and drop the stub.nlp.tsv file into the upload window.

For each protocol you want to define:

  • In the Data Pipeline web part, click Process and Import Data.
  • Select the stub.nlp.tsv file and click Import Data.
  • Select "NLP engine invocation and results" and click Import.
  • From the Analysis Protocol dropdown, select "<New Protocol>". If there are no other protocols defined, this will be the only option.
  • Enter a name and description for this protocol. A name is required if you plan to save the protocol for future use. Using the version number in the name can help you easily differentiate protocols later.
  • Add a new line to the Parameters section for any parameters required, such as giving a location for an alternate metadata file or specifying the subdirectory that contains the intended NLP engine version. For example, in the example in our setup documentation, the subdirectories are named "engineVersion1" and "engineVersion2" (your naming may differ). So to specify an engine version you would uncomment the example line shown and use:
<note label="version" type="input">engineVersion1</note>
  • Select an Abstraction Identifier from the dropdown if you want every document processed with this protocol to be assigned the same identifier. "[NONE]" is the default.
  • Confirm "Save protocol for future use" is checked.
  • Scroll to the next section before proceeding.

Define Document Processing Configuration(s)

Document processing configurations control how assignment of abstraction and review tasks is done automatically. One or more configurations can be defined enabling you to easily select among different settings or criteria using the name of the configuration. In the lower portion of the import page, you will find the Document processing configuration section:

  • From the Name dropdown, select the name of the processing configuration you want to use.
    • The definition will be shown in the UI so you can confirm this is correct.
  • If you need to make changes, or want to define a new configuration, select "<New Configuration>".
    • Enter a unique name for the new configuration.
    • Select the type of document and disease groups to apply this configuration to. Note that if other types of report are uploaded for assignment at the same time, they will not be assigned.
    • Enter the percentages for assignment to review and abstraction.
    • Select the group(s) of users to which to make assignments. Eligible project groups are shown with checkboxes.
  • Confirm "Save configuration for future use" is checked to make this configuration available for selection by name during future imports.
  • Complete the import by clicking Analyze.
  • In the Data Pipeline web part, click Process and Import Data.
  • Upload the "stub.nlp.tsv" file again and repeat the import. This time you will see the new protocol and configuration you defined available for selection from their respective dropdowns.
  • Repeat these two steps to define all the protocols and document processing configurations you need.

For more information, see Pipeline Protocols.

Run Data Through the NLP Pipeline

First upload your TSV files to the pipeline.

  • In the Data Pipeline web part, click Process and Import Data.
  • Drag and drop files or directories you want to process into the window to upload them.

Once the files are uploaded, you can iteratively run each through the NLP engine as follows:

  • In the Data Pipeline web part, click Process and Import Data.
  • Navigate uploaded directories if necessary to find the files of interest.
  • Check the box for a tsv file of interest and click Import Data.
  • Select "NLP engine invocation and results" and click Import.
  • Choose an existing Analysis Protocol or define a new one.
  • Choose an existing Document Processing Configuration or define a new one.
  • Click Analyze.
  • While the engine is running, the pipeline web part will show a job in progress. When it completes, the pipeline job will disappear from the web part.
  • Refresh your browser window to show the new results in the NLP Job Runs web part.

View and Download Results

Once the NLP pipeline import is successful, the input and intermediate output files are both deleted from the filesystem.

The NLP Job Runs lists the completed run, along with information like the document type and any identifier assigned for easy filtering or reporting. Hover over any run to reveal a (Details) link in the leftmost column. Click it to see both the input file and how the run was interpreted into tabular data.

During import, new line characters (CRLF, LFCR, CR, and LF)are all normalized to LF to simplify highlighting text when abstracting information.

Note: The results may be reviewed for accuracy. In particular, the disease group determination is used to guide other values abstracted. If a reviewer notices an incorrect designation, they can edit, manually update it and send the document for reprocessing through the NLP information with the correct designation.

Download Results

To download the results, select (Export/Sign Data) above the grid and choose the desired format.

Rerun

To rerun the same file with a different version of the engine, simply repeat the original import process, but this time choose a different protocol (or define a new one) to point to a different engine version.

Error Reporting

During processing of files through the NLP pipeline, some errors which occur require human reconcilation before processing can proceed. The pipeline log is available with a report of any errors that were detected during processing, including:

  • Mismatches between field metadata and the field list. To ignore these mismatches during upload, set "validateResultFields" to false and rerun.
  • Errors or excessive delays while the transform phase is checking to see if work is available. These errors can indicate problems in the job queue that should be addressed.
Add a Data Transform Jobs web part to see the latest error in the Transform Run Log column.

For more information about data transform error handling and logging, see ETL: Logs and Error Handling.

Related Topics




Metadata Json Files


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

A metadata file written in JSON is used to configure the fields and categories for abstraction. The file configures the field categories and lists available, as well as what type of values are permitted for each field.

Example

Download this sample file for a simple example:

Here, a Pathology category is defined, with table and field level groupings, and two tables with simple fields, "ClassifiedDiseaseGroup" and "Field1", a boolean.

{
"pathology": {
"groupings": [
{
"level": "table",
"order": "alpha",
"orientation": "horizontal"
},
{
"level": "field",
"order": "alpha",
"orientation": "horizontal"
}
],
"tables":[
{"table":"EngineReportInfo",
"fields":[
{"field":"ClassifiedDiseaseGroup",
"datatype":"string",
"closedClass":"False",
"diseaseProperties":
[
{"diseaseGroup":["*"],
"values":["disease1","disease2","disease3"]
}
]
}
]
},
{"table":"Table1",
"fields":[
{"field":"Field1",
"datatype":"string",
"closedClass":"True",
"diseaseProperties":
[
{"diseaseGroup":["*"],
"values":["Yes","No"]
}
]
}
]
}
]
}
}

Enable Multiple Values per Field

To allow an abstractor to multiselect values from a pulldown menu, which will be shown in the field value separated by || (double pipes), include the following in the field definition:

“multiValue” : “True”



Document Abstraction Workflow


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.
The Document Abstraction (or Annotation) Workflow supports the movement and tracking of documents through the following general process. All steps are optional for any given document and each part of the workflow may be configured to suit your specific needs: Different types of documents (for example, Pathology Reports and Cytogenetics Reports) can be processed through the same workflow, task list and assignment process, each using abstraction algorithms specific to the type of document. The assignment process itself can also be customized based on the type of disease discussed in the document.

Roles and Tasks

  • NLP/Abstraction Administrator:
    • Configure terminology and set module properties.
    • Review list of documents ready for abstraction.
    • Make assignments of roles and tasks to others.
    • Manage project groups corresponding to the expected disease groups and document types.
    • Create document processing configurations.
  • Abstractor:
    • Choose a document to abstract from individual list of assignments.
    • Abstract document. You can save and return to continue work in progress if needed.
    • Submit abstraction for review - or approval if no reviewer is assigned.
  • Reviewer:
    • Review list of documents ready for review.
    • Review abstraction results; compare results from multiple abstractors if provided.
    • Mark document as ready to progress to the next stage - either approve or reject.
    • Review and potentially edit previously approved abstraction results.

It is important to note that documents to be abstracted may well contain protected health information (PHI). Protection of PHI is strictly managed by LabKey Server, and with the addition of the nlp_premium, compliance, and complianceActivites modules, all access to documents, task lists, etc, containing PHI can be gated by permissions and also subject to approval of terms of use specific to the user's intended activity. Further, all access that is granted, including viewing, abstracting, and reviewing can be logged for audit or other review.

All sample screenshots and information shown in this documentation are fictitious.

Configuring Terminology

The terminology used in the abstraction user interface can be customized to suit the specific words used in the implementation. Even the word "Abstraction" can be replaced with something your users are used to, such as "Annotation" or "Extraction." Changes to NLP terminology apply to your entire site.

An Abstraction Administrator can configure the terminology used as follows:

  • Select (Admin) > Site > Admin Console.
  • Click Admin Console Links.
  • Under Premium Features, click NLP Labels.
  • The default labels for concepts, fields, tooltips, and web part customization are listed.
    • There may be additional customizable terms shown, depending on your implementation.
  • Hover over any icon for more information about any term.
  • Enter new terminology as needed, then click Save.
  • To revert all terms to the original defaults, click Reset.

Terms you can customize:

  • The concept of tagging or extracting tabular data from free text documents (examples: abstraction, annotation) and name for user(s) who do this work.
  • Subject identifier (examples: MRN, Patient ID)
  • What is the item being abstracted (examples: document, case, report)
  • The document type, source, identifier, activity
The new terms you configure will be used in web parts, data regions, and UIs; in some cases admins will still see the "default" names, such as when adding new web parts. To facilitate helping your users, you can also customize the tooltips that will be shown for the fields to use your local terminology.

You can also customize the subtable name used for storing abstraction information using a module property. See below

Abstraction Workflow

Documents are first uploaded, then assigned, then pass through the many options in the abstraction workflow until completion.

The document itself passes through a series of states within the process:

  • Ready for assignment: when automatic abstraction is complete, automatic assignment was not completed, or reviewer requests re-abstraction
  • Ready for manual abstraction: once an abstractor is assigned
  • Ready for review: when abstraction is complete, if a reviewer is assigned
  • (optional) Ready for reprocessing: if requested by the reviewer
  • Approved
  • If Additional Review is requested, an approved abstraction result is sent for secondary review to a new reviewer.
Passage of a document through these stages can be done using a BPMN (business process management) workflow engine. LabKey Server uses an Activiti Workflow to automatically advance the document to the correct state upon completion of the prior state. Users assigned as abstractors and reviewers can see lists of tasks assigned to them and mark them as completed when done.

Assignment

Following the upload of the document and any provided metadata or automatic abstraction results, many documents will also be assigned for manual abstraction. The manual abstractor begins with any information garnered automatically and validates, corrects, and adds additional information to the abstracted results.

The assignment of documents to individual abstractors may be done automatically or manually by an administrator. An administrator can also choose to bypass the abstraction step by unassigning the manual abstractor, immediately forwarding the document to the review phase.

The Abstraction Task List web part is typically included in the dashboard for any NLP project, and shows each viewing user a tailored view of the particular tasks they are to complete. Typically a user will have only one type of task to perform, but if they play different roles, such as for different document types, they will see multiple lists.

Abstraction

The assigned user completes a manual document abstraction following the steps outlined here:

Review

Once abstraction is complete, the document is "ready for review" (if a reviewer is assigned) and the task moves to the assigned reviewer. If the administrator chooses to bypass the review step, they can leave the reviewer task unassigned for that document.

Reviewers select their tasks from their personalized task list, but can also see other cases on the All Tasks list. In addition to reviewing new abstractions, they can review and potentially reject previously approved abstraction results. Abstraction administrators may also perform this second level review. A rejected document is returned for additional steps as described in the table here.

Setting Module Properties

Module properties are provided for customizing the behavior of the abstraction workflow process, particularly the experience for abstraction reviewers. Module properties, also used in configuring NLP engines, can have both a site default setting and an overriding project level setting as needed.

  • ShowAllResults: Check the box to show all sets of selected values to reviewers. When unchecked, only the first set of results are shown.
  • AnonymousMode: Check to show reviewers the column of abstracted information without the userID of the individual abstractor. When unchecked, the name of the abstractor is shown.
  • Default Record Key: The default grouping name to use in the UI for signaling which record (e.g. specimen or finding) the field value relates to. Provide the name your users will expect to give to subtables. Only applicable when the groupings level is at 'recordkey'. If no value is specified, the default is "SpecimenA" but other values like "1" might be used instead.
  • To set module properties, select (Admin) > Folder > Management and click the Module Properties tab.
  • Scroll down to see and set the properties relevant to the abstraction review process:
  • Click Save Changes.

Developer Note: Retrieving Approved Data via API

The client API can be used to retrieve information about imported documents and results. However, the task status is not stored directly, rather it is calculated at render time when displaying task status. When querying to select the "status" of a document, such as "Ready For Review" or "Approved," the reportID must be provided in addition to the taskKey. For example, a query like the following will return the expected calculated status value:

SELECT reportId, taskKey FROM Report WHERE ReportId = [remainder of the query]

Related Topics




Automatic Assignment for Abstraction


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

Automatic Task Assignment

When setting up automatic task assignment, the abstraction administrator defines named configurations for the different types of documents to be abstracted and different disease groups those documents cover. The administrator can also create specific project groups of area experts for these documents so that automatic assignment can draw from the appropriate pool of people.

Project Group Curation

The abstraction administrator uses project groups to identify the people who should be assigned to abstract the particular documents expected. It might be sufficient to simply create a general "Abstractors" group, or perhaps more specific groups might be appropriate, each with a unique set of members:

  • Lung Abstractors
  • Multiple Myeloma Abstractors
  • Brain Abstractors
  • Thoracic Abstractors
When creating document processing configurations, you can select one or more groups from which to pull assignees for abstraction and review.

  • Create the groups you expect to need via (Admin) > Folder > Permissions > Project Groups.
  • On the Permissions tab, add the groups to the relevant abstraction permission role:
    • Abstractor groups: add to Document Abstractor.
    • Reviewer groups: add to Abstraction Reviewer.
  • Neither of these abstraction-specific roles carries any other permission to read or edit information in the folder. All abstractors and reviewers will also require the Editor role in the project in order to record information. Unless you have already granted such access to your pool of users through other groups, also add each abstractor and reviewer group to the the Editor role.
  • Next add the appropriate users to each of the groups.

While the same person may be eligible to both abstract some documents and review others, no document will be reviewed by the same person who did the abstraction.

NLP Document Processing Configurations

Named task assignment configurations are created by an administrator using an NLP Document Processing Configurations web part. Configurations include the following fields:

  • Name
  • DocumentType
    • Pathology Reports
    • Cytogenetics Reports
    • Clinical Reports
    • All Documents (including the above)
  • Disease Groups - check one or more of the disease groups listed. Available disease groups are configured via a metadata file. The disease group control for a document is generated during the initial processing during upload. Select "All" to define a configuration that will apply to any disease group not covered by a more specific configuration.
  • ManualAbstractPct - the percentage of documents to assign for manual abstraction (default is 5%).
  • ManualAbstractReviewPct - the percentage of manually abstracted documents to assign for review (default is 5%).
  • EngineAbstractReviewPct - the percentage of automatically abstracted documents to assign for review (default is 100%).
  • MinConfidenceLevelPct - the minimum confidence level required from an upstream NLP engine to skip review of those engine results (default is 75%).
  • Assignee - use checkboxes to choose the group(s) from which abstractors should be chosen for this document and disease type.
  • Status - can be "active" or "inactive"
Other fields are tracked internally and can provide additional information to assist in assigning abstractors:
  • DocumentsProcessed
  • LastAbstractor
  • LastReviewer
You can define different configurations for different document types and different disease groups. For instance, standard pathology reports might be less likely to need manual abstraction than cytogenetics reports, but more likely to need review of automated abstraction. Reports about brain diseases might be more likely to need manual abstraction than those about lung diseases. The document type "All Documents" and the disease group "All" are used for processing of any documents not covered by a more specific configuration. If there is a type-specific configuration defined and active for a given document type, it will take precedence over the "All Documents" configuration. When you are defining a new configuration, you will see a message if it will override an existing configuration for a given type.

You can also define multiple configurations for a given document type. For example, you could have a configuration requiring higher levels of review and only activate it during a training period for a new abstractor. By selecting which configuration is active at any given time for each document type, different types of documents can get different patterns of assignment for abstraction. If no configuration is active, all assignments must be done manually.

Outcomes of Automatic Document Assignment

The following table lists what the resulting status for a document will be for all the possible combinations of whether engine abstraction is performed and whether abstractors or reviewers are assigned.

Engine Abstraction?Abstractor Auto-Assigned?Reviewer Auto-Assigned?Document Status Outcome
YYYReady for initial abstraction; to reviewer when complete
YYNReady for initial abstraction; straight to approved when complete
YNYReady for review (a common case when testing engine algorithms)
YNNReady for manual assignment
NYYReady for initial abstraction; to reviewer when complete
NYNReady for initial abstraction; straight to approved when complete
NNYNot valid; there would be nothing to review
NNNReady for manual assignment

Related Topics




Manual Assignment for Abstraction


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

When documents need to be manually assigned to an abstractor, they appear as tasks for an abstraction administrator. Assigned tasks in progress can also be reassigned using the same process.

Manual Assignment

Task List View

The task list view allows manual assignment of abstractors and reviewers for a given document. To be able to make manual assignments, the user must have "Abstraction Administrator" permission; folder and project administrators also have this permission.

Users with the correct roles are eligible to be assignees:

  • Abstractors: must have both "Document Abstractor" and "Editor" roles.
  • Reviewers: must have both "Abstraction Reviewer" and "Editor" roles.
It is good practice to create project groups of eligible assignees and granted the appropriate roles to these groups, as described here.

Each user assigned to an abstraction role can see tasks assigned to them and work through a personalized task list.

Click Assign on the task list.

In the popup, the pulldowns will offer the list of users granted the permission necessary to be either abstrators or reviewers. Select to assign one or both tasks. Leaving either pulldown without a selection means that step will be skipped. Click Save and the document will disappear from your "to assign" list and move to the pending task list of the next user you assigned.

Reassignment and Unassignment

After assigment, the task is listed in the All Cases grid. Here the Assign link allows an administrator to change an abstraction or review assignment to another person.

If abstraction has not yet begun (i.e. the document is still in the "Ready for initial abstraction" state), the administrator can also unassign abstraction by selecting the null row on the assignment pulldown. Doing so will immediately send the document to the review step, or if no reviewer is assigned, the document will be approved and sent on.

Once abstraction has begun, the unassign option is no longer available.

Batch Assignment and Reassignment

Several document abstraction or review tasks can be assigned or reassigned simultaneously as a batch, regardless of whether they were uploaded as part of the same "batch". Only tasks which would be individually assignable can be included in a batch reassignment. If the task has an Assign link, it can be included in this process:

  • From the Abstraction Task List, check the checkboxes for the tasks (rows) you want to assign to the same abstractor or reviewer (or both).
  • Click Assign in the grid header bar.
  • Select an Initial Abstractor or Reviewer or both.
    • If you leave the default "No Changes" in either field, the selected tasks will retain prior settings for that field.
    • If you select the null/empty value, the task will be unassigned. Unassignment is not available once the initial abstraction is in progress. Attempting to unassign an ineligible document will raise a warning message that some of the selected documents cannot be modified, and you will have the option to update all other selected documents with the unassignment.
  • Click Save to apply your changes to all tasks.

Newly assigned abstractors and reviewers will receive any work in progress and see previously selected values when they next view the assigned task.

Related Topics




Abstraction Task Lists


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

Administrators, abstractors, and reviewers have a variety of options for displaying and organizing tasks in their workflows.

If your folder does not already have the Abstraction Task List or NLP Batch View web parts, an administrator can add them as follows.

  • Enter > Page Admin Mode.
  • Using the selector in the lower left, choose the desired web part by name, then click Add.
  • Use the (triangle) menu for the web part to move it up or down as desired.
  • Click Exit Admin Mode to hide the selectors.

Abstraction Task List

The Abstraction Task List web part will be unique for each user, showing a tailored view of the particular tasks they are to complete, including assignment, abstraction, and review of documents.

Typically a user will have only one type of task to perform, but if they play different roles, such as for different document types, they will see multiple lists. Tasks may be grouped in batches, such as by identifier or priority, making it easier to work and communicate efficiently. Below the personalized task list(s), the All Cases list gives an overview of the latest status of all cases visible to the user in this container - both those in progress and those whose results have been approved. In this screenshot, an user has both abstract and review tasks. Hover over any row to reveal a (Details) icon link for more information.

All task lists can be sorted to provide the most useful ordering to the individual user. Save the desired sorted grid as the "default" view to use it for automatically ordering your tasks. When an abstraction or review task is completed, the user will advance to the next task on their default view of the appropriate task list.

Task Group Identifiers

Abstraction cases can be assigned an identifier, allowing administrators to group documents for abstraction and filter based on these identifiers. Identifiers correspond to batches or groups of tasks, and are used to organize information in various ways. To define identifiers to use in a given folder, use the schema browser.

  • Select (Admin) > Developer Links > Schema Browser.
  • Open the nlp schema.
  • Select Identifier from the built in queries and tables.
  • Click View Data.
  • To add a new identifier, click (Insert data) > Insert new row.
    • Enter the Name and Description
    • Enter the Target Count of documents for this identifier
    • Click Submit.
  • When cases are uploaded, the pipeline protocol determines which identifier they are assigned to. All documents imported in a batch are assigned to the same identifier.

NLP Batch View

The Batch View web part can help abstractors and reviewers organize tasks by batch and identifier. It shows a summary of information about all cases in progress, shown by batch. Button options include the standard grid buttons as well as:

Columns include:

  • Extra Review: If additional rounds of review are requested, documents with approved results will show an Assign link in this column, allowing administrators to assign for the additional review step.
  • Identifier Name: Using identifiers allows administrators to group related batches together. Learn to change identifiers below.
  • Document Type
  • Dataset: Links in this column let the user download structured abstraction results for batches. These downloads are audited as "logged select query events". The name of the link may be the name of the originally uploaded file, or the batch number. The exported zip contains a json file and a collection of .txt files. The json file provides batch level metadata such as "diseaseGroup", "documentType", "batchID" and contains all annotations for all reports in the batch.
  • Schema: The metadata for this batch, in the form of a .json file.
  • # documents: Total number of documents in this batch.
  • # abstracted: The number of documents that have gone through at least one cycle of abstraction.
  • # reviewed: The number of documents that have been reviewed at least once.
  • Batch Summary: A count of documents in each state.
  • Job Run ID
  • Input File Name

Customize Batch View

Administrators can customize the batch view to show either, both, or neither of the Dataset and Schema columns, depending on the needs of your users. By default, both are shown.

  • Open the (triangle) menu in the corner of the webpart.
  • Select Customize.
  • Check the box to hide either or both of the Dataset and Schema columns.

Change Identifiers

Administrators viewing the Batch View web part will see a button for changing identifiers associated with the batch. Select one or more rows using the checkboxes, then click Change Identifier to assign a new one to the batch.

Assign for Secondary Review

If additional rounds of review are requested as part of the workflow, the administrator can use the Batch View web part to assign documents from a completed batch for extra rounds of review.

  • If an NLP Batch View web part does not exist, create one.
  • In the Extra Review column, you will see one of the following values for each batch:
    • "Batch in progress": This batch is not eligible for additional review until all documents have been completed.
    • "Not assigned": This batch is completed and eligible for assignment for additional review, but has not yet been assigned.
    • "Assigned": This batch has been assigned and is awaiting secondary review to be completed.
  • Click the Assign link for any eligible batch to assign or reassign.
  • In the popup, use checkmarks to identify the groups eligible to perform the secondary review. You can create distinct project groups for this purpose in cases where specific teams need to perform secondary reviews.
  • Enter the percent of documents from this batch to assign for secondary review.
  • Click Submit.
  • The selected percentage of documents will be assigned to secondary reviewers from the selected groups.

The assigned secondary reviewers will now see the documents in their task list and perform review as in the first review round, with the exception that sending documents directly for reprocessing from the UI is no longer an option.

Related Topics




Document Abstraction


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

Abstraction of information from clinical documents into tabular data can unearth in a wealth of previously untapped data for integration and analysis, provided it is done efficiently and accurately. An NLP engine can automatically abstract information based on the type of document, and further manual abstraction by one or more people using the process covered here can maximize information extraction.

Abstraction Task List

Tasks have both a Report ID and a Batch Number, as well as an optional Identifier Name, all shown in the task list. The assigned user must have "Abstractor" permissions and will initiate a manual abstraction by clicking Abstract on any row in their task list.

The task list grid can be sorted and filtered as desired, and grid views saved for future use. After completion of a manual abstraction, the user will advance to the next document in the user's default view of the task list.

Abstraction UI Basics

The document abstraction UI is shown in two panels. Above the panels, the report number is prominently shown.

The imported text is shown on the right and can be scrolled and reviewed for key information. The left hand panel shows field entry panels into which information found in the text will be abstracted.

One set of subtables for results, shown in the above screenshot named "Specimen A", is made available. You can add more using the "Add another specimen" option described below. An admininstrator can also control the default name for the first subtable here. For instance, it could be "1" instead of "Specimen A".

The abstracted fields are organized in categories that may vary based on the document type. For example, pathology documents might have categories as shown above: Pathology, PathologyStageGrade, EnineReportInfo, PathologyFinding, NodePathFinding

Expand and contract field category sections by clicking the title bars or / icons. By default, the first category is expanded when the abstractor first opens the UI. Fields in the "Pathology" category include:

  • PathHistology
  • PathSpecimenType
  • Behavior
  • PathSite
  • Pathologist
If an automated abstraction pass was done prior to manual abstraction, pulldowns may be prepopulated with some information gathered by the abstraction (NLP) engine. In addition, if the disease group has been automatically identified, this can focus the set of values for each field offered to a manual abstractor. The type of document also drives some decisions about how to interpret parts of the text.

Populating Fields

Select a field by clicking the label; the selected row will show in yellow, as will any associated text highlights previously added for that field. Some fields allow free text entry, other fields use pulldowns offering a set of possible values.

Start typing in either type of field to narrow the menu of options, or keep typing to enter free text as appropriate. There are two types of fields with pulldown menus.

  • Open-class fields allow you to either select a listed value or enter a new value of your own.
  • Closed-class fields (marked with a ) require a selection of one of the values listed on the pulldown.

Multiple Value Fields

Fields supporting multiple entries allow you to click several pulldown menu using the shift or ctrl key when you click to add the new value instead of replacing the prior choice. The values show with a || (double pipe) separating them in the field value.

Highlighting Text

The abstractor scans for relevant details in the text, selects or enters information in the field in the results section, and can highlight one or more relevant pieces of text on the right to accompany it.

If you highlight a string of text before entering a value for the active field, the selected text will be entered as the value if possible. For a free text field, the entry is automatic. For a field with a pulldown menu, if you highlight a string in the text that matches a value on the given menu, it will be selected. If you had previously entered a different value, however, that earlier selection takes precedence and is not superceded by later text highlighting. You may multi-select several regions of text for any given field result as needed.

In the following screenshot, several types of text highlighting are shown. When you click to select a field, the field and any associated highlights are colored yellow. If you double-click the field label, the text panel will be scrolled to place the first highlighted region within the visible window, typically three rows from the top. Shown selected here, the text "Positive for malignancy" was just linked to the active field Behavior with the value "Malignant". Also shown here, when you hover over the label or value for a field which is not active, in this case "PathHistology" the associated highlighted region(s) of text will be shown in green.

Text that has been highlighted for a field that is neither active (yellow) nor hovered-over (green) is shown in light blue. Click on any highlighting to activate the associated field and show both in yellow.

A given region of text can also be associated with multiple field results. The count of related fields is shown with the highlight region ("1 of 2", for example).

Unsaved changes are indicated by red corners on the entered fields. If you make a mistake or wish to remove highlighting on the right, click the 'x' attached to the highlight region.

Save Abstraction Work

Save work in progress any time by clicking Save Draft. If you leave the abstraction UI, you will still see the document as a task waiting to be completed, and see the message "Initial abstraction in progress". When you return to an abstraction in progress, you will see previous highlighting, selections, and can continue to review and abstract more of the document.

Once you have completed the abstraction of the entire document, you will click Submit to close your task and pass the document on for review, or if no review is selected, the document will be considered completed and approved.

When you submit the document, you will automatically advance to the next document assigned for you to abstract, according to the sort order established on your default view of your task list. There is no need to return to your task list explicitly to advance to the next task. The document you just completed will be shown as "Previously viewed" above the panels.

If you mistakenly submit abstraction results for a document too quickly, you can use the back button in your browser to return. Click Reopen to return it to an "abstraction in progress" status.

Abstraction Timer

If enabled, you can track metrics for how much time is spent actively abstracting the document, and separately time spent actively reviewing that abstraction. The abstraction timer is displayed above the document title and automatically starts when the assigned abstractor lands on the page. If the abstractor needs to step away or work on another task, they may pause the timer, then resume when they resume work on the document. As soon as the abstractor submits the document, the abstraction timer stops.

The reviewer's time on the same document is also tracked, beginning from zero. The total time spent in process for the document is the sum of these two times.

Note that the timer does not run when others, such as administrators, are viewing the document. It only applies to the edit-mode of active abstracting and reviewing.

Session Timeout Suspension

When abstracting a particularly lengthy or complicated document, or one requiring input from others, it is possible for a long period of time to elapse between interactions with the server. This could potentially result in the user’s session expiring, especially problematic as it can result in the loss of the values and highlights entered since the last save. To avoid this problem, the abstraction session in progress will keep itself alive by pinging the server with a lightweight "keepalive" function while the user continues interacting with the UI.

Multiple Specimens per Document

There may be information about multiple specimens in a single document. Each field results category can have multiple panels of fields, one for each specimen. You can also think of these groupings as subtables, and the default name for the first/default subtable can be configured by an administrator at the folder level. To add information for an additional specimen, open the relevant category in the field results panel, then click Add another specimen and select New Specimen from the menu.

Once you have defined multiple specimens for the document, you can use the same menu to select among them.

Specimen names can be changed and specimens deleted from the abstraction using the cog icon for each specimen panel.

Reopen an Abstraction Task

If you mistakenly submit abstraction results for a document too quickly, you can use the back button in your browser to return to the document. Click Reopen to return it to an unapproved status.

Related Topics




Review Document Abstraction


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

Once document abstraction is complete, if a reviewer is assigned to the document, the status becomes "ready for review" and the task moves to the assigned reviewer. If no reviewer is assigned, the document abstraction will bypass the review step and the status will be "approved."

This topic covers the review process when there is a single result set per document. When multiple result sets exist, a reviewer can compare and select among the abstractions of different reviewers following this topic: Review Multiple Result Sets.

Single Result Set Review

The review page shows the abstracted information and source text side by side. Only populated field results are displayed by default. Hover over any field to highlight the linked text in green. Click to scroll the document to show the highlighted element within the visible window, typically three rows from the top. A tooltip shows the position of the information in the document. To see all available fields, and enable editing of any entries or adding any additional abstraction information, the reviewer can click the pencil icon.

Once the pencil icon has opened the abstraction results for potential editing, the reviewer has the intermediate option to Save Draft in order to preserve work in progress and return later to complete their review.

The reviewer finishes with one of the following clicks:

    • Approve to accept the abstraction and submit the results as complete. If you mistakenly click approve, use your browser back button to return to the open document; there will be a Reopen button allowing you to undo the mistaken approval.
    • Reprocess which rejects the abstraction results and returns the document for another round of abstraction. Either the engine will reprocess the document, or an administrator will assign a new manual abstractor and reviewer.
If you select Reprocess, you will be prompted to enter the cause of rejection.

After completing the review, you will immediately be taken to the next document in your default view of your review task list.

Reprocessing

When a reviewer clicks Reprocess, the document will be given a new status and returned for reprocessing according to the following table:

Engine Abstracted?Manually Abstracted?Reviewed?ActionResult
YesNoNoReopenReady for assignment
YesNoYesReopenReady for review; assign to same reviewer
YesNoYesReprocessEngine reprocess; ready for assignment
YesYesNoReopenReady for assignment
YesYesYesReopenReady for review; assign to same reviewer
YesYesYesReprocessEngine reprocess, then ready for assignment
NoYesNoReopenReady for assignment
NoYesYesReopenReady for review; assign to same reviewer
NoYesYesReprocessReady for assignment

Reopen is an option available to administrators for all previously approved documents. Reviewers are only able to reopen the documents they reviewed and approved themselves.

Related Topics




Review Multiple Result Sets


Document abstraction by multiple abstractors may generate multiple result sets for the same document. This topic describes how reviewers can compare differing abstracted values and select the correct values.
Select the document for review from the task list.

View Multiple Result Sets

If there are multiple abstraction result sets, they will be shown in columns with colored highlighting to differentiate which abstractor chose which result. Each abstractor is assigned a color, shown in a swatch next to the user name. Colors are hard coded and used in this order:

  • Abstractor 1
  • Abstractor 2
  • Abstractor 3
  • Abstractor 4
  • Abstractor 5
In the following screenshot, you see two abstraction results sets for the same document were created. Values where the abstractors are in agreement are prepopulated, and you can still make adjustments if needed as when reviewing a single abstraction set.

Only one abstractor chose a value for "PathQuality" and different selections were made for the "PathSite".

Anonymous Mode

An administrator can configure the folder to use anonymous mode. In this mode, no abstractor user names are displayed; instead, labels like "Abstractor 1", "Abstractor 2", etc. are used instead.

Compare Highlights

When you select a field, you will see highlights in the text panel, color coded by reviewer. When multiple abstractors agreed and chose the same text to highlight, the region will be shown striped with both abstractor colors.

When the abstractors did not agree, no value will be prepopulated in the left panel. Select the field to see the highlights; in this case, the highlights still agree, and there is no shown indication of why the first abstractor chose mediastinum. As a reviewer, you will use the dropdown to select a value - values chosen by abstractors are listed first, but you could choose another value from the list if appropriate. In this case, you might elect to select the "Lung" value.

While working on your review, you have the option to Save Draft and return to finish later. Once you have completed review of the document, click Approve or Reprocess if necessary. The completion options for review are the same as when reviewing single result sets.

Configure Settings

In the lower left corner, click Settings to configure the following options.

  • Show highlight offsets for fields: When you hover over a value selected by an abstractor, you see the full text they entered. If you want to instead see the offsets for the highlights they made, check this box.
  • Enable comparison view: Uncheck to return to the single result set review display. Only results selected by the first abstractor are shown.
An administrator can configure module properties to control whether reviewers see all result sets by default, and whether the anonymous mode is used.

Related Topics




NLP Result Transfer


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

This topic covers how natural language processing (NLP) Result Sets, also referred as abstraction or annotation results, can be transferred from one server to another. For example, a number of individual Registry Servers could submit approved results to a central Data Sharing Server for broader comparison and review.

Set Up Servers for NLP Result Transfer

The following steps are required on both the source (where the transferred data will come from) and destination servers.

Encryption

A system administrator must first enable AES_256 encryption for the server JRE.


API Keys

A user creates an API key on the destination server. This key is used to authorize data uploads sent from the source server, as if they came from the user who generated the API key. Any user with write access to the destination container can generate a key, once an administrator has enabled this feature.

On the destination server:

  • Select (Your_Username) > API Keys.
    • Note: Generation of API Keys must be enabled on your server to see this option. Learn more: API Keys.
  • Click Generate API Key.
  • Click Copy to Clipboard to do so and share with the system administrator of the source server in a secure manner.

On the source server, an administrator performs these steps:

  • Select (Admin) > Site > Admin Console.
  • Click Admin Console Links.
  • Under Premium Features, click NLP Transfer.
  • Enter the required information:
    • API Key: Enter the key provided by the destination server/recipient.
    • Encryption Passphrase: The passphrase to use for encrypting the results, set above.
    • Base URL: The base URL of the destination server. You will enter the path to the destination folder relative to this base in the module properties below.
  • Click Test Transfer to confirm a working connection.
  • Click Save.

Module Properties

In each folder on the source server containing results which will be exported, a folder administrator must configure these properties.

  • Select (Admin) > Folder > Management.
  • Click the Module Properties tab.
  • Scroll down to Property: Target Transfer Folder.
  • Enter the path to the target folder relative to the Base URL on the destination server, which was set in the NLP Transfer configuration above.
    • Note that you can have a site default for this destination, a project-wide override, or a destination specific to this individual folder.

Transfer Results to Destination

To transfer data to the destination, begin on the source server:

  • Navigate to the Batch View web part containing the data to transfer.
  • Three buttons related to transferring results are included.
    • Export Results: Click to download an unencrypted bundle of the selected batch results. Use this option to check the format and contents of what you will transfer.
    • Transfer Data: Click to bundle and encrypt the batches selected, then transfer to the destination. If this button appears inactive (grayed out), you have not configured all the necessary elements.
    • Config Transfer: click to edit the module property to point to the correct folder on the destination server if not done already.

  • Select the desired row(s) and click Transfer Data.
  • The data transfer will begin. See details below.

Note: The Batch View web part shows both Dataset and Schema columns by default. Either or both of those columns may have been hidden by an administrator from the displayed grid but both contain useful information that is transferred.

Zip File Contents and Process

  1. Result set files, with files in JSON, XML, and TSV format. The entire bundle is encrypted using the encryption key entered on the source server. Results can only be decrypted using this password. The encrypted file is identical to that which would be created with this command line:
    gpg --symmetric --cipher-algo AES256 test_file.txt.zip
  2. Report text.
  3. metadata.json schema file
  4. Batch info metadata file which contains:
  • The source server name
  • The document type
  • Statistics: # of documents, # abstracted, #reviewed

Use Results at the Destination

The destination server will receive the transfer in the folder configured from the source server above.

Developers can access the results using the schema browser, (schema = "nlp", query "ResultsArchive", click "View Data"). For convenience, an admin can create a web part to add it to a folder as follows:

  • Navigate to the folder where results were delivered.
  • Enter > Page Admin Mode.
  • Select Query from the selector in the lower left, then click Add.
  • Give the web part a title, such as "Results Archive".
  • Select the Schema: nlp.
  • Click Show the contents of a specific query and view.
  • Select the Query: ResultsArchive.
  • Leave the remaining options at their defaults.
  • Click Submit.

Users viewing the results archive will see a new row for each batch that was transferred.

The columns show:

    • Source: The name of the source server; shown here regional registries.
    • Schema: The name of the metadata.json file.
    • Dataset: A link to download the bundled and encrypted package of results transferred.
    • Document Type, # reports, # abstracted, and # reviewed: Information provided by the source server.
    • Created/Created By: The transfer time and username of the user who created the API key authorizing the upload.
To unpack the results:

  • Click the link in the Dataset column to download the transferred bundle.
  • Decrypt it using this command:
    gpg -o test.txt.zip -d test.txt.zip.gpg
  • Unzip the decrypted archive to find all three export types (JSON, TSV, XML) are included.

Folder editors and administrators have the ability to delete transferred archives. Deleting a row from the Results Archive grid also deletes the corresponding schema JSON and gpg (zip archive) files.

Related Topics




Configure LabKey NLP


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

These instructions enable an administrator to configure the LabKey NLP pipeline so that tsv source files can be run through the NLP engine provided with LabKey Server.

Once the administrator has properly configured the pipeline and server, any number of users can process tsv files through one or more versions of the NLP engine using the instructions here.

Install Required Components

Install python (2.7.9)

The NLP engine will not run under python 3. If possible, there should be only one version of python installed. If you require multiple versions, it is possible to configure the LabKey NLP pipeline accordingly, but that is not covered in this topic.

  • Download python 2.7.9 from https://www.python.org/download/
  • Double click the .msi file to begin the install. Accept the wizard defaults, confirm that pip will be installed as shown below. Choose to automatically add python.exe to the system path on this screen by selecting the install option from the circled pulldown menu.
  • When the installation is complete, click Finish.
  • By default, python is installed on windows in C:/Python27/
  • Confirm that python was correctly added to your path by opening a command shell and typing "python -V" using a capital V. The version will be displayed.

Install the NumPy package (1.8.x)

NumPy is a package for scientific computation with Python. Learn more here: http://www.numpy.org/

  • For Windows, download a pre-complied whl file for NumPy from: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy
  • The whl you select must match the python version you downloaded (for 2.7.9 select "cp27") as well as the bit-width (32 vs 64) of your system.
    • To confirm your bit-width, open the Windows Control Panel, select System and Security, then select System. The system type is shown about mid page.
    • For instance, if running 64-bit windows, you would download: numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl
  • Move the downloaded package to the scripts directory under where python was installed. By default, C:/Python27/Scripts/
  • A bug in pip requires that you rename the downloaded package, replacing "win_amd64" with "any".
  • In a command shell, navigate to that same Scripts directory and run:
pip install numpy‑1.9.2+mkl‑cp27‑none‑any.whl

Install nltk package and "book" data collection

The Natural Language Toolkit (NLTK) provides support for natural language processing.

  • Update to the latest pip and setuptools by running:
    • On Windows:
      python -m pip install -U pip setuptools
    • On Linux: Instructions for using the package manager are available here
  • Install nltk by running:
    • On Windows:
      python -m pip install -U nltk
    • On Mac/Linux:
      sudo pip install -U nltk
  • Test the installation by running python, then typing:
    import nltk
  • If no errors are reported, your install was successful.

Next install the "book" data collection:

  • On Windows, in Python:
    • Type:
      nltk.download()
    • A GUI will open, select the "book" identifier and download to "C:
      nltk_data"
  • On Mac/Linux, run:
    sudo python -m nltk.downloader -d /usr/local/share/nltk_data book

Install python-crfsuite

Install python-crfsuite, version 0.8.4 or later, which is a python binding to CRFsuite.

  • On Windows, first install the Microsoft Visual C++ compiler for Python. Download and install instructions can be found here. Then run:
    python -m pip install -U python-crfsuite
  • On Mac or Linux, run:
    sudo pip install -U python-crfsuite

Install the LabKey distribution

Install the LabKey distribution. Complete instructions can be found here. The location where you install your LabKey distribution is referred to in this topic as ${LABKEY_INSTALLDIR}.

Configure the NLP pipeline

The LabKey distribution already contains an NLP engine, located in:

${LABKEY_INSTALLDIR}\bin\nlp

If you want to be able to use one or more NLP engines installed elsewhere, an administrator may configure the server to use that alternate location. For example, if you want to use an engine located here:

C:\alternateLocation\nlp

Direct the pipeline to look first in that alternate location by adding it to the Pipeline tools path:

  • Select (Admin) > Site > Admin Console.
  • Click the Admin Console Links tab.
  • Under Configuration, click Site Settings.
  • The Pipeline tools field contains a semicolon separated list of paths the server will use to locate tools including the NLP engine. By default the path is "${LABKEY_INSTALLDIR}\bin" (in this screenshot, "C:\labkey\labkey\bin")
  • Add the location of the alternate NLP directory to the front of the Pipeline tools list of paths.
    • For example, to use an engine in "C:\alternateLocation\nlp", add "C:\alternateLocation;" as shown here:
  • Click Save.
  • No server restart is required when adding a single alternate NLP engine location.

Configure to use Multiple Engine Versions

You may also make multiple versions of the NLP engine available on your LabKey Server simultaneously. Each user would then configure their workspace folder to use a different version of the engine. The process for doing so involves additional steps, including a server restart to enable the use of multiple engines. Once configured, no restarting will be needed to update or add additional engines.

  • Download the nlpConfig.xml file.
  • Select or create a location for config files. For example, "C:\labkey\configs, and place nlpConfig.xml in it.
  • The LabKey Server configuration file, named labkey.xml by default, or ROOT.xml in production servers, is typically located in a directory like <CATALINA_HOME>/conf/Catalina/localhost. This file must be edited to point to the alternate config location.
  • Open it for editing, and locate the pipeline configuration line, which will look something like this:
<!-- Pipeline configuration -->
<!--@@pipeline@@ <Parameter name="org.labkey.api.pipeline.config" value="@@pipelineConfigPath@@"/> @@pipeline@@-->
  • Uncomment and edit to point to the location of nlpConfig.xml, in our example, "C:\labkey\configs". The edited line will look something like this:
<!-- Pipeline configuration -->
<Parameter name="org.labkey.api.pipeline.config" value="C:\labkey\configs"/>
    • Save.
  • Restart your LabKey Server.

Multiple alternate NLP engine versions should be placed in a directory structure one directory level down from the "nlp" directory where you would place a single engine. The person installing these engines must have write access to this location in the file system, but does not need to be the LabKey Server administrator. The directory names here will be used as 'versions' when you import, so it is good practice to include the version in the name, for example:

C:\alternateLocation\nlp\engineVersion1
C:\alternateLocation\nlp\engineVersion2

Related Topics




Run NLP Pipeline


Premium Feature — Available in the Enterprise Edition. Learn more or contact LabKey.

This topic outlines how to configure a workspace and run the NLP pipeline directly against source tsv files. First, an administrator must configure the pipeline as described here. Then, any number of users can process tsv files through one or more versions of the NLP engine. The user can also rerun a given tsv file later using a different version of the engine to compare results and test the NLP engine itself.


If you already have abstraction results obtained from another provider formatted in a compatible manner, you can upload them directly, bypassing the NLP pipeline. Details are below.

Set Up a Workspace

Each user should work in their own folder, particularly if they intend to use different NLP engines.

The default NLP folder contains web parts for the Data Pipeline, NLP Job Runs, and NLP Reports. To return to this main page at any time, click NLP Dashboard in the upper right.

Setup the Data Pipeline

  • In the Data Pipeline web part, click Setup.
  • Select Set a pipeline override.
  • Enter the primary directory where the files you want to process are located.
  • Set searchability and permissions appropriately.
  • Click Save.
  • Click NLP Dashboard.

Configure Options

Set Module Properties

Multiple abstraction pipelines may be available on a given server. Using module properties, the administrator can select a specific abstraction pipeline and specific set of metadata to use in each container.

These module properties can all vary per container, so for each, you will see a list of folders, starting with "Site Default" and ending with your current project or folder. All parent containers will be listed, and if values are configured, they will be displayed. Any in which you do not have permission to edit these properties will be grayed out.

  • Navigate to the container (project or folder) you are setting up.
  • Select (Admin) > Folder > Management and click the Module Properties tab.
  • Under Property: Pipeline, next to your folder name, select the appropriate value for the abstraction pipeline to use in that container.
  • Under Property: Metadata Location enter an alternate metadata location to use for the folder if needed.
  • Check the box marked "Is Metadata Location Relative to Container?" if you've provided a relative path above instead of an absolute one.
  • Click Save Changes when finished.

Alternate Metadata Location

The metadata is specified in a .json file, named in this example "metadata.json" though other names can be used.

  • Upload the "metadata.json" file to the Files web part.
  • Select (Admin) > Folder > Management and click the Module Properties tab.
  • Under Property: Metadata Location, enter the file name in one of these ways:
    • "metadata.json": The file is located in the root of the Files web part in the current container.
    • "subfolder-path/metadata.json": The file is in a subfolder of the Files web part in the current container (relative path)
    • "full path to metadata.json": Use the full path if the file has not been uploaded to the current project.
  • If the metadata file is in the files web part of the current container (or a subfolder within it), check the box for the folder name under Property: Is Metadata Location Relative to Container.

Define Pipeline Protocol(s)

When you import a TSV file, you will select a Protocol which may include one or more overrides of default parameters to the NLP engine. If there are multiple NLP engines available, you can include the NLP version to use as a parameter. With version-specific protocols defined, you then simply select the desired protocol during file import. You may define a new protocol on the fly during any tsv file import, or you may find it simpler to predefine one or more. To quickly do so, you can import a small stub file, such as the one attached to this page.

  • Download this file: stub.nlp.tsv and place in the location of your choice.
  • In the Data Pipeline web part, click Process and Import Data.
  • Drag and drop the stub.nlp.tsv file into the upload window.

For each protocol you want to define:

  • In the Data Pipeline web part, click Process and Import Data.
  • Select the stub.nlp.tsv file and click Import Data.
  • Select "NLP engine invocation and results" and click Import.
  • From the Analysis Protocol dropdown, select "<New Protocol>". If there are no other protocols defined, this will be the only option.
  • Enter a name and description for this protocol. A name is required if you plan to save the protocol for future use. Using the version number in the name can help you easily differentiate protocols later.
  • Add a new line to the Parameters section for any parameters required, such as giving a location for an alternate metadata file or specifying the subdirectory that contains the intended NLP engine version. For example, in the example in our setup documentation, the subdirectories are named "engineVersion1" and "engineVersion2" (your naming may differ). So to specify an engine version you would uncomment the example line shown and use:
<note label="version" type="input">engineVersion1</note>
  • Select an Abstraction Identifier from the dropdown if you want every document processed with this protocol to be assigned the same identifier. "[NONE]" is the default.
  • Confirm "Save protocol for future use" is checked.
  • Scroll to the next section before proceeding.

Define Document Processing Configuration(s)

Document processing configurations control how assignment of abstraction and review tasks is done automatically. One or more configurations can be defined enabling you to easily select among different settings or criteria using the name of the configuration. In the lower portion of the import page, you will find the Document processing configuration section:

  • From the Name dropdown, select the name of the processing configuration you want to use.
    • The definition will be shown in the UI so you can confirm this is correct.
  • If you need to make changes, or want to define a new configuration, select "<New Configuration>".
    • Enter a unique name for the new configuration.
    • Select the type of document and disease groups to apply this configuration to. Note that if other types of report are uploaded for assignment at the same time, they will not be assigned.
    • Enter the percentages for assignment to review and abstraction.
    • Select the group(s) of users to which to make assignments. Eligible project groups are shown with checkboxes.
  • Confirm "Save configuration for future use" is checked to make this configuration available for selection by name during future imports.
  • Complete the import by clicking Analyze.
  • In the Data Pipeline web part, click Process and Import Data.
  • Upload the "stub.nlp.tsv" file again and repeat the import. This time you will see the new protocol and configuration you defined available for selection from their respective dropdowns.
  • Repeat these two steps to define all the protocols and document processing configurations you need.

For more information, see Pipeline Protocols.

Run Data Through the NLP Pipeline

First upload your TSV files to the pipeline.

  • In the Data Pipeline web part, click Process and Import Data.
  • Drag and drop files or directories you want to process into the window to upload them.

Once the files are uploaded, you can iteratively run each through the NLP engine as follows:

  • In the Data Pipeline web part, click Process and Import Data.
  • Navigate uploaded directories if necessary to find the files of interest.
  • Check the box for a tsv file of interest and click Import Data.
  • Select "NLP engine invocation and results" and click Import.
  • Choose an existing Analysis Protocol or define a new one.
  • Choose an existing Document Processing Configuration or define a new one.
  • Click Analyze.
  • While the engine is running, the pipeline web part will show a job in progress. When it completes, the pipeline job will disappear from the web part.
  • Refresh your browser window to show the new results in the NLP Job Runs web part.

View and Download Results

Once the NLP pipeline import is successful, the input and intermediate output files are both deleted from the filesystem.

The NLP Job Runs lists the completed run, along with information like the document type and any identifier assigned for easy filtering or reporting. Hover over any run to reveal a (Details) link in the leftmost column. Click it to see both the input file and how the run was interpreted into tabular data.

During import, new line characters (CRLF, LFCR, CR, and LF)are all normalized to LF to simplify highlighting text when abstracting information.

Note: The results may be reviewed for accuracy. In particular, the disease group determination is used to guide other values abstracted. If a reviewer notices an incorrect designation, they can edit, manually update it and send the document for reprocessing through the NLP information with the correct designation.

Download Results

To download the results, select (Export/Sign Data) above the grid and choose the desired format.

Rerun

To rerun the same file with a different version of the engine, simply repeat the original import process, but this time choose a different protocol (or define a new one) to point to a different engine version.

Error Reporting

During processing of files through the NLP pipeline, some errors which occur require human reconcilation before processing can proceed. The pipeline log is available with a report of any errors that were detected during processing, including:

  • Mismatches between field metadata and the field list. To ignore these mismatches during upload, set "validateResultFields" to false and rerun.
  • Errors or excessive delays while the transform phase is checking to see if work is available. These errors can indicate problems in the job queue that should be addressed.
Add a Data Transform Jobs web part to see the latest error in the Transform Run Log column.

For more information about data transform error handling and logging, see ETL: Logs and Error Handling.

Directly Import Abstraction Results

An alternative method for uploading abstraction results directly can be used if the results use the same JSON format as output by the NLP pipeline.

Requirements:

  • A TXT report file and a JSON results file. Each named for the report, one with with a .txt extension and one with a .nlp.json extension. No TSV file is required.
  • This pair of files is uploaded to LabKey as a directory or zip file.
  • A single directory or zip file can contain multiple pairs of files representing multiple reports, each with a unique report name and corresponding json file.
  • Upload the directory or zip file containing the report .txt and .nlp.json files.
  • Select the JSON file(s) and click Import Data.
  • Choose the option NLP Results Import.
  • Choose (or define) the pipeline protocol and document processing configuration.
  • Click Analyze.
  • Results are imported, with new documents appearing in the task list as if they had been processed using the NLP engine.

Related Topics