Difference between revisions of "Infrastructure/software/nltk"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Usage on Abel)
(Usage)
 
(7 intermediate revisions by the same user not shown)
Line 12: Line 12:
 
(for English, at least).
 
(for English, at least).
  
= Usage on Abel =
+
= Usage =
  
 
The module <tt>nlpl-nltk</tt> provides an NLTK installation
 
The module <tt>nlpl-nltk</tt> provides an NLTK installation
Line 19: Line 19:
 
<pre>
 
<pre>
 
module purge
 
module purge
module use -a /projects/nlpl/software/modulefiles
+
module use -a /proj*/nlpl/software/modulefiles
 
module load nlpl-nltk
 
module load nlpl-nltk
 
</pre>
 
</pre>
Line 30: Line 30:
  
 
<pre>
 
<pre>
module load nlpl-nltk nlpl-tensorflow
+
module load nlpl-nltk nlpl-gensim nlpl-tensorflow
 
</pre>
 
</pre>
  
Line 42: Line 42:
 
</pre>
 
</pre>
  
In late September 2018, for example, the output from the above command
+
In late September 2018, on Abel for example, the output from the above command
would look as follows:
+
would look somewhat as follows:
 
<pre>
 
<pre>
 
python is /projects/nlpl/software/tensorflow/1.11/bin/python
 
python is /projects/nlpl/software/tensorflow/1.11/bin/python
Line 51: Line 51:
 
</pre>
 
</pre>
  
= Usage on Taito =
+
= Versions =
  
The module <tt>nlpl-nltk</tt> provides an NLTK installation
+
As of September 2018, version 3.3 of NLTK is installed on both Abel
in a Python 3.5 virtual environment.
+
and Taito.
 +
 
 +
= Installation =
  
 
<pre>
 
<pre>
 
module purge
 
module purge
module use -a /proj/nlpl/software/modulefiles
+
module load python3/3.5.0
module load nlpl-nltk
 
 
</pre>
 
</pre>
 
This installation (just as other NLPL-maintained Python virtual environments)
 
can be combined with other Python-based modules, for example the NLPL
 
installations of PyTorch or OpenNMT-py.
 
To ‘stack’ multiple Python environments, they can simply be loaded together,
 
e.g.
 
 
<pre>
 
module load nlpl-nltk nlpl-pytorch
 
</pre>
 
 
Because PyTorch is somewhat ‘special’ in its requirements for
 
dynamic libraries and support for both cpu and gpu nodes, it is important
 
for it to be activated last, i.e. on the ‘top’ of a multi-module stack.
 
This can be validated by inspecting which <tt>python</tt> binary is active
 
according to the search order in the <tt>$PATH</tt> environment variable:
 
<pre>
 
type -all python
 
</pre>
 
 
In late September 2018, for example, the output from the above command
 
would look as follows:
 
<pre>
 
python is /proj/nlpl/software/pytorch/0.4.1/bin/python
 
python is /proj/nlpl/software/nltk/3.3/bin/python
 
python is /usr/bin/python
 
python is /opt/rocks/bin/python
 
</pre>
 
 
= Available Versions =
 
 
= Installation on Abel or Taito =
 
  
 
<pre>
 
<pre>
 
module purge
 
module purge
module load python3/3.5.0
+
module load python-env/3.5.3
 
</pre>
 
</pre>
  
Line 114: Line 83:
  
 
<pre>
 
<pre>
 +
module purge
 
module load nlpl-nltk/3.3
 
module load nlpl-nltk/3.3
 
pip install --upgrade pip
 
pip install --upgrade pip

Latest revision as of 18:23, 24 October 2018

Background

The Natural Language Toolkit (NLTK) provides a large collection of core NLP utilities (e.g. sentence splitting and tokenization, part of speech tagging, various approaches to parsing, and many more) in an integrated Python environment. The NLTK distribution also bundles a broad range of common, freely available data sets, which are made accessible through a uniform API. Albeit often neither quite state of the art nor blindingly efficient, NLTK is popular as a teaching environment and go-to repository of common ‘basic’ preprocessing tasks, e.g. sentence splitting, stop word removal, or lemmatization (for English, at least).

Usage

The module nlpl-nltk provides an NLTK installation in a Python 3.5 virtual environment.

module purge
module use -a /proj*/nlpl/software/modulefiles
module load nlpl-nltk

This installation (just as other NLPL-maintained Python virtual environments) can be combined with other Python-based modules, for example the NLPL installations of PyTorch or TensorFlow. To ‘stack’ multiple Python environments, they can simply be loaded together, e.g.

module load nlpl-nltk nlpl-gensim nlpl-tensorflow

Because PyTorch and TensorFlow are ‘special’ in their requirements for dynamic libraries and support for both cpu and gpu nodes, it is important for them to be activated last, i.e. on the ‘top’ of a multi-module stack. This can be validated by inspecting which python binary is active according to the search order in the $PATH environment variable:

type -all python

In late September 2018, on Abel for example, the output from the above command would look somewhat as follows:

python is /projects/nlpl/software/tensorflow/1.11/bin/python
python is /projects/nlpl/software/nltk/3.3/bin/python
python is /usr/bin/python
python is /opt/rocks/bin/python

Versions

As of September 2018, version 3.3 of NLTK is installed on both Abel and Taito.

Installation

module purge
module load python3/3.5.0
module purge
module load python-env/3.5.3
mkdir ${NLPLROOT}/software/nltk
virtualenv ${NLPLROOT}/software/nltk/3.3

Next, we need to create a module definition, in this case ${NLPLROOT}/software/modulefiles/nlpl-nltk/3.3 (on Abel) or ${NLPLROOT}/software/modulefiles/nlpl-nltk/3.3.lua (on Taito); make sure to establish the environment variable $NLTK_DATA, pointing to the data sub-directory of the NLTK tree, as established by the command-line data download below.

module purge
module load nlpl-nltk/3.3
pip install --upgrade pip
pip install --upgrade $(pip list | tail -n +3 | gawk '{print $1}')

Finally, install the NLTK code and all data packages.

pip install nltk
python -m nltk.downloader -d ${NLPLROOT}/software/nltk/3.3/data all