These instructions enable an administrator to configure the LabKey Natural Language Processing (NLP) pipeline so that source files can be run through the NLP engine provided with LabKey Server. Note that if you are using another method of natural language processing, you do not need to complete these steps to use the
NLP Abstraction Workflow.
Once the administrator has properly configured the pipeline and server, any number of users can process tsv files through one or more versions of the NLP engine using the instructions
here.
Install Required Components
Install python (2.7.9)
The NLP engine will not run under python 3. If possible, there should be only one version of python installed. If you require multiple versions, it is possible to configure the LabKey NLP pipeline accordingly, but that is not covered in this topic.
- Download python 2.7.9 from https://www.python.org/download/
- Double click the .msi file to begin the install. Accept the wizard defaults, confirm that pip will be installed as shown below. Choose to automatically add python.exe to the system path on this screen by selecting the install option from the circled pulldown menu.
- When the installation is complete, click Finish.
- By default, python is installed on windows in C:/Python27/
- Confirm that python was correctly added to your path by opening a command shell and typing "python -V" using a capital V. The version will be displayed.
Install the NumPy package (1.8.x)
NumPy is a package for scientific computation with Python. Learn more here:
http://www.numpy.org/
- For Windows, download a pre-complied whl file for NumPy from: http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy
- The whl you select must match the python version you downloaded (for 2.7.9 select "cp27") as well as the bit-width (32 vs 64) of your system.
- To confirm your bit-width, open the Windows Control Panel, select System and Security, then select System. The system type is shown about mid page.
- For instance, if running 64-bit windows, you would download: numpy‑1.9.2+mkl‑cp27‑none‑win_amd64.whl
- Move the downloaded package to the scripts directory under where python was installed. By default, C:/Python27/Scripts/
- A bug in pip requires that you rename the downloaded package, replacing "win_amd64" with "any".
- In a command shell, navigate to that same Scripts directory and run:
pip install numpy‑1.9.2+mkl‑cp27‑none‑any.whl
Install nltk package and "book" data collection
The
Natural Language Toolkit (NLTK) provides support for natural language processing.
- Update to the latest pip and setuptools by running:
- On Windows:
python -m pip install -U pip setuptools
- On Linux: Instructions for using the package manager are available here
- Install nltk by running:
- On Windows:
python -m pip install -U nltk
- On Mac/Linux:
- Test the installation by running python, then typing:
- If no errors are reported, your install was successful.
Next install the "book" data collection:
- On Windows, in Python:
- Type:
- A GUI will open, select the "book" identifier and download to "C:
nltk_data"
- On Mac/Linux, run:
sudo python -m nltk.downloader -d /usr/local/share/nltk_data book
Install python-crfsuite
Install
python-crfsuite, version 0.8.4 or later, which is a python binding to CRFsuite.
- On Windows, first install the Microsoft Visual C++ compiler for Python. Download and install instructions can be found here. Then run:
python -m pip install -U python-crfsuite
- On Mac or Linux, run:
sudo pip install -U python-crfsuite
Install the LabKey distribution
Install the LabKey distribution. Complete instructions can be found
here.
The location where you install your LabKey distribution is referred to in this topic as ${LABKEY_INSTALLDIR}.
Configure the NLP pipeline
The LabKey distribution already contains an NLP engine, located in:
${LABKEY_INSTALLDIR}\bin\nlp
If you want to be able to use one or more NLP engines installed elsewhere, an administrator may configure the server to use that alternate location. For example, if you want to use an engine located here:
Direct the pipeline to look first in that alternate location by adding it to the
Pipeline tools path:
- Select (Admin) > Site > Admin Console.
- Under Configuration, click Site Settings.
- The Pipeline tools field contains a semicolon separated list of paths the server will use to locate tools including the NLP engine. By default the path is "${LABKEY_INSTALLDIR}\bin" (in this screenshot, "C:\labkey\labkey\bin")
- Add the location of the alternate NLP directory to the front of the Pipeline tools list of paths.
- For example, to use an engine in "C:\alternateLocation\nlp", add "C:\alternateLocation;" as shown here:
- Click Save.
- No server restart is required when adding a single alternate NLP engine location.
Configure to use Multiple Engine Versions
You may also make multiple versions of the NLP engine available on your LabKey Server simultaneously. Each user would then configure their workspace folder to use a different version of the engine. The process for doing so involves additional steps, including a server restart to enable the use of multiple engines. Once configured, no restarting will be needed to update or add additional engines.
- Download the nlpConfig.xml file.
- Select or create a location for config files. For example, "C:\labkey\configs, and place nlpConfig.xml in it.
- The LabKey Server configuration file, named labkey.xml by default, or ROOT.xml in production servers, is typically located in a directory like <CATALINA_HOME>/conf/Catalina/localhost. This file must be edited to point to the alternate config location.
- Open it for editing, and locate the pipeline configuration line, which will look something like this:
<!-- Pipeline configuration -->
<!--@@pipeline@@ <Parameter name="org.labkey.api.pipeline.config" value="@@pipelineConfigPath@@"/> @@pipeline@@-->
- Uncomment and edit to point to the location of nlpConfig.xml, in our example, "C:\labkey\configs". The edited line will look something like this:
<!-- Pipeline configuration -->
<Parameter name="org.labkey.api.pipeline.config" value="C:\labkey\configs"/>
- Restart your LabKey Server.
Multiple alternate NLP engine versions should be placed in a directory structure one directory level down from the "nlp" directory where you would place a single engine. The person installing these engines must have write access to this location in the file system, but does not need to be the LabKey Server administrator. The directory names here will be used as 'versions' when you import, so it is good practice to include the version in the name, for example:
C:\alternateLocation\nlp\engineVersion1
C:\alternateLocation\nlp\engineVersion2
Related Topics