Difference between revisions of "Infrastructure/software/nltk"
(Created page with "= Bakckground = = Usage on Abel = = Available Versions = = Installation on Abel = <pre> module purge module load python3/3.5.0 </pre> <pre> mkdir /projects/nlpl/software/...") |
(→Usage) |
||
(12 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | = | + | = Background = |
− | + | The | |
+ | [https://www.nltk.org/ Natural Language Toolkit] (NLTK) provides a large collection | ||
+ | of core NLP utilities (e.g. sentence splitting and tokenization, part of speech | ||
+ | tagging, various approaches to parsing, and many more) in an integrated Python environment. | ||
+ | The NLTK distribution also bundles a broad range of common, freely available data sets, | ||
+ | which are made accessible through a uniform API. | ||
+ | Albeit often neither quite state of the art nor blindingly efficient, NLTK is | ||
+ | popular as a teaching environment and go-to repository of common ‘basic’ | ||
+ | preprocessing tasks, e.g. sentence splitting, stop word removal, or lemmatization | ||
+ | (for English, at least). | ||
− | = | + | = Usage = |
− | = | + | The module <tt>nlpl-nltk</tt> provides an NLTK installation |
+ | in a Python 3.5 virtual environment. | ||
+ | |||
+ | <pre> | ||
+ | module purge | ||
+ | module use -a /proj*/nlpl/software/modulefiles | ||
+ | module load nlpl-nltk | ||
+ | </pre> | ||
+ | |||
+ | This installation (just as other NLPL-maintained Python virtual environments) | ||
+ | can be combined with other Python-based modules, for example the NLPL | ||
+ | installations of PyTorch or TensorFlow. | ||
+ | To ‘stack’ multiple Python environments, they can simply be loaded together, | ||
+ | e.g. | ||
+ | |||
+ | <pre> | ||
+ | module load nlpl-nltk nlpl-gensim nlpl-tensorflow | ||
+ | </pre> | ||
+ | |||
+ | Because PyTorch and TensorFlow are ‘special’ in their requirements for | ||
+ | dynamic libraries and support for both cpu and gpu nodes, it is important | ||
+ | for them to be activated last, i.e. on the ‘top’ of a multi-module stack. | ||
+ | This can be validated by inspecting which <tt>python</tt> binary is active | ||
+ | according to the search order in the <tt>$PATH</tt> environment variable: | ||
+ | <pre> | ||
+ | type -all python | ||
+ | </pre> | ||
+ | |||
+ | In late September 2018, on Abel for example, the output from the above command | ||
+ | would look somewhat as follows: | ||
+ | <pre> | ||
+ | python is /projects/nlpl/software/tensorflow/1.11/bin/python | ||
+ | python is /projects/nlpl/software/nltk/3.3/bin/python | ||
+ | python is /usr/bin/python | ||
+ | python is /opt/rocks/bin/python | ||
+ | </pre> | ||
+ | |||
+ | = Versions = | ||
+ | |||
+ | As of September 2018, version 3.3 of NLTK is installed on both Abel | ||
+ | and Taito. | ||
+ | |||
+ | = Installation = | ||
<pre> | <pre> | ||
Line 13: | Line 64: | ||
<pre> | <pre> | ||
− | + | module purge | |
− | virtualenv | + | module load python-env/3.5.3 |
+ | </pre> | ||
+ | |||
+ | <pre> | ||
+ | mkdir ${NLPLROOT}/software/nltk | ||
+ | virtualenv ${NLPLROOT}/software/nltk/3.3 | ||
</pre> | </pre> | ||
Next, we need to create a module definition, in this case | Next, we need to create a module definition, in this case | ||
− | <tt>/ | + | <tt>${NLPLROOT}/software/modulefiles/nlpl-nltk/3.3</tt> (on Abel) |
+ | or <tt>${NLPLROOT}/software/modulefiles/nlpl-nltk/3.3.lua</tt> | ||
+ | (on Taito); | ||
make sure to establish the environment variable | make sure to establish the environment variable | ||
<tt>$NLTK_DATA</tt>, pointing to the <tt>data</tt> | <tt>$NLTK_DATA</tt>, pointing to the <tt>data</tt> | ||
Line 25: | Line 83: | ||
<pre> | <pre> | ||
+ | module purge | ||
module load nlpl-nltk/3.3 | module load nlpl-nltk/3.3 | ||
pip install --upgrade pip | pip install --upgrade pip | ||
Line 34: | Line 93: | ||
<pre> | <pre> | ||
pip install nltk | pip install nltk | ||
− | python -m nltk.downloader -d | + | python -m nltk.downloader -d ${NLPLROOT}/software/nltk/3.3/data all |
</pre> | </pre> |
Latest revision as of 18:23, 24 October 2018
Contents
Background
The Natural Language Toolkit (NLTK) provides a large collection of core NLP utilities (e.g. sentence splitting and tokenization, part of speech tagging, various approaches to parsing, and many more) in an integrated Python environment. The NLTK distribution also bundles a broad range of common, freely available data sets, which are made accessible through a uniform API. Albeit often neither quite state of the art nor blindingly efficient, NLTK is popular as a teaching environment and go-to repository of common ‘basic’ preprocessing tasks, e.g. sentence splitting, stop word removal, or lemmatization (for English, at least).
Usage
The module nlpl-nltk provides an NLTK installation in a Python 3.5 virtual environment.
module purge module use -a /proj*/nlpl/software/modulefiles module load nlpl-nltk
This installation (just as other NLPL-maintained Python virtual environments) can be combined with other Python-based modules, for example the NLPL installations of PyTorch or TensorFlow. To ‘stack’ multiple Python environments, they can simply be loaded together, e.g.
module load nlpl-nltk nlpl-gensim nlpl-tensorflow
Because PyTorch and TensorFlow are ‘special’ in their requirements for dynamic libraries and support for both cpu and gpu nodes, it is important for them to be activated last, i.e. on the ‘top’ of a multi-module stack. This can be validated by inspecting which python binary is active according to the search order in the $PATH environment variable:
type -all python
In late September 2018, on Abel for example, the output from the above command would look somewhat as follows:
python is /projects/nlpl/software/tensorflow/1.11/bin/python python is /projects/nlpl/software/nltk/3.3/bin/python python is /usr/bin/python python is /opt/rocks/bin/python
Versions
As of September 2018, version 3.3 of NLTK is installed on both Abel and Taito.
Installation
module purge module load python3/3.5.0
module purge module load python-env/3.5.3
mkdir ${NLPLROOT}/software/nltk virtualenv ${NLPLROOT}/software/nltk/3.3
Next, we need to create a module definition, in this case ${NLPLROOT}/software/modulefiles/nlpl-nltk/3.3 (on Abel) or ${NLPLROOT}/software/modulefiles/nlpl-nltk/3.3.lua (on Taito); make sure to establish the environment variable $NLTK_DATA, pointing to the data sub-directory of the NLTK tree, as established by the command-line data download below.
module purge module load nlpl-nltk/3.3 pip install --upgrade pip pip install --upgrade $(pip list | tail -n +3 | gawk '{print $1}')
Finally, install the NLTK code and all data packages.
pip install nltk python -m nltk.downloader -d ${NLPLROOT}/software/nltk/3.3/data all