Nordic Language Processing Laboratory - User contributions [en]

Infrastructure/software/catalogue

2022-10-21T11:59:12Z

Yvessche:

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0.1-3990724] || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti || October 2022 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-a89691f] || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/1.0.1_20221015] || efmaral and eflomal word alignment tools || Puhti || October 2022 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_20191218] || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/mttools nlpl-mttools/20221015] || A collection of preprocessing and evaluation scripts for machine translation || Puhti || October 2022 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/mttools nlpl-mttools/20191218] || A collection of preprocessing and evaluation scripts for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/2.3.0 || OpenNMT Python Library || Puhti || October 2022 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation scripts for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.2.0] || Stanford NLP Neural Pipeline || Saga || ? || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser/2.3.1] || Uppsala Parser || Saga,Abel || December 2019 || Sara Stymne ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/turboparser nlpl-turboparser/2.3.0] || TurboParser || Saga|| January 2020 || Sara Stymne ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Saga, Puhti,Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Infrastructure/software/catalogue

2022-10-21T11:58:29Z

Yvessche:

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses//4.0.1-3990724] || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti || October 2022 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-a89691f] || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/1.0.1_20221015] || efmaral and eflomal word alignment tools || Puhti || October 2022 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_20191218] || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/mttools nlpl-mttools/20221015] || A collection of preprocessing and evaluation scripts for machine translation || Puhti || October 2022 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/mttools nlpl-mttools/20191218] || A collection of preprocessing and evaluation scripts for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py//2.3.0 || OpenNMT Python Library || Puhti || October 2022 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation scripts for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.2.0] || Stanford NLP Neural Pipeline || Saga || ? || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser/2.3.1] || Uppsala Parser || Saga,Abel || December 2019 || Sara Stymne ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/turboparser nlpl-turboparser/2.3.0] || TurboParser || Saga|| January 2020 || Sara Stymne ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Saga, Puhti,Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Translation/mttools

2022-10-21T11:55:28Z

Yvessche:

== Using the mttools module ==

<ul>
<li>Activate the NLPL software repository and load the module:
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc # Saga
module load nlpl-mttools/</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-mttools</pre>
</li>
</ul>

The following scripts are part of this module:
<ul>
<li>'''moses-scripts'''</li>
<ul>
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
<li>Installed revision: 3990724</li>
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
</ul>
<li>'''sacremoses'''</li>
<ul>
<li>Python port of Moses tokenizer and truecaser</li>
<li>Source: https://github.com/alvations/sacremoses</li>
<li>Installed version: 0.0.35</li>
</ul>
<li>'''subword-nmt'''</li>
<ul>
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
<li>Source: https://github.com/rsennrich/subword-nmt</li>
<li>Installed version: 0.3.8</li>
<li>The <code>subword-nmt</code> executable is in PATH</li>
</ul>
<li>'''sentencepiece'''</li>
<ul>
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
<li>Source: https://github.com/google/sentencepiece</li>
<li>Installed version: 0.1.97</li>
<li>The <code>spm_*</code> executables are in PATH</li>
</ul>
<li>'''sacreBLEU'''</li>
<ul>
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
<li>Source: https://github.com/mjpost/sacreBLEU</li>
<li>Installed version: 2.2.1</li>
<li>The <code>sacrebleu</code> executable is in PATH</li>
</ul>
<li>'''multeval'''</li>
<ul>
<li>Tool to evaluate machine translation with various scores (BLEU, TER, METEOR) and to perform statistical significance testing with bootstrap resampling</li>
<li>Source: https://github.com/jhclark/multeval</li>
<li>Installed version: 0.5.1 with METEOR 1.5</li>
<li>The multeval.sh script is in PATH</li>
</ul>
<li>'''compare-mt'''</li>
<ul>
<li>Compare the output of multiple systems for language generation, including machine translation, summarization, dialog response generation. Computes common evaluation scores and runs analyses to find salient differences between the systems.</li>
<li>To run METEOR, consult the module-specific help page for the exact path.</li>
<li>Source: https://github.com/neulab/compare-mt</li>
<li>Installed version: 0.2.10</li>
<li>The compare-mt executable is in PATH</li>
</ul>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2022-10-21T11:54:18Z

Yvessche:

Translation/home

2022-10-21T11:53:17Z

Yvessche:

Translation/home

2019-12-18T14:31:39Z

Yvessche:

Infrastructure/software/catalogue

2019-12-18T14:31:20Z

Yvessche: /* Activity B: Statistical and Neural Machine Translation */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-a89691f] || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_20191218] || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/mttools nlpl-mttools/20191218] || A collection of preprocessing and evaluation scripts for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation scripts for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Infrastructure/software/catalogue

2019-12-18T14:30:40Z

Yvessche: /* Activity B: Statistical and Neural Machine Translation */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-a89691f] || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_20191218] || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/mttools nlpl-mttools/20191218] || A collection of preprocessing and evaluation scripts for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Marian-NMT_module nlpl-marian-nmt/1.8.0-eba7aed] || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation scripts for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Translation/mttools

2019-12-18T14:22:25Z

Yvessche: /* Using the mttools module */

== Using the mttools module ==

<ul>
<li>Activate the NLPL software repository and load the module:
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc # Saga
module load nlpl-mttools/</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-mttools/20191218</pre>
</li>
</ul>

The following scripts are part of this module:
<ul>
<li>'''moses-scripts'''</li>
<ul>
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
<li>Installed revision: a89691f</li>
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
</ul>
<li>'''sacremoses'''</li>
<ul>
<li>Python port of Moses tokenizer and truecaser</li>
<li>Source: https://github.com/alvations/sacremoses</li>
<li>Installed version: 0.0.35</li>
</ul>
<li>'''subword-nmt'''</li>
<ul>
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
<li>Source: https://github.com/rsennrich/subword-nmt</li>
<li>Installed version: 0.3.7</li>
<li>The <code>subword-nmt</code> executable is in PATH</li>
</ul>
<li>'''sentencepiece'''</li>
<ul>
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
<li>Source: https://github.com/google/sentencepiece</li>
<li>Installed version: 0.1.85</li>
<li>The <code>spm_*</code> executables are in PATH</li>
</ul>
<li>'''sacreBLEU'''</li>
<ul>
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
<li>Source: https://github.com/mjpost/sacreBLEU</li>
<li>Installed version: 1.4.3</li>
<li>The <code>sacrebleu</code> executable is in PATH</li>
</ul>
<li>'''multeval'''</li>
<ul>
<li>Tool to evaluate machine translation with various scores (BLEU, TER, METEOR) and to perform statistical significance testing with bootstrap resampling</li>
<li>Source: https://github.com/jhclark/multeval</li>
<li>Installed version: 0.5.1 with METEOR 1.5</li>
<li>The multeval.sh script is in PATH</li>
</ul>
<li>'''compare-mt'''</li>
<ul>
<li>Compare the output of multiple systems for language generation, including machine translation, summarization, dialog response generation. Computes common evaluation scores and runs analyses to find salient differences between the systems.</li>
<li>To run METEOR, consult the module-specific help page for the exact path.</li>
<li>Source: https://github.com/neulab/compare-mt</li>
<li>Installed version: 0.2.7</li>
<li>The compare-mt executable is in PATH</li>
</ul>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/mttools

2019-12-18T14:22:00Z

Yvessche: /* Using the mttools module */

== Using the mttools module ==

<ul>
<li>Activate the NLPL software repository and load the module:
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc # Saga
module load nlpl-mttools/</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-mttools/20191218</pre>
</li>
</ul>

The following scripts are part of this module:
<ul>
<li>'''moses-scripts'''</li>
<ul>
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
<li>Installed revision: a89691f</li>
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
</ul>
<li>'''sacremoses'''</li>
<ul>
<li>Python port of Moses tokenizer and truecaser</li>
<li>Source: https://github.com/alvations/sacremoses</li>
<li>Installed version: 0.0.35</li>
</ul>
<li>'''subword-nmt'''</li>
<ul>
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
<li>Source: https://github.com/rsennrich/subword-nmt</li>
<li>Installed version: 0.3.7</li>
<li>The <code>subword-nmt</code> executable is in PATH</li>
</ul>
<li>'''sentencepiece'''</li>
<ul>
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
<li>Source: https://github.com/google/sentencepiece</li>
<li>Installed version: 0.1.85</li>
<li>The <code>spm_*</code> executables are in PATH</li>
</ul>
<li>'''sacreBLEU'''</li>
<ul>
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
<li>Source: https://github.com/mjpost/sacreBLEU</li>
<li>Installed version: 1.4.3</li>
<li>The <code>sacrebleu</code> executable is in PATH</li>
</ul>
<li>'''multeval'''</li>
<ul>
<li>Tool to evaluate machine translation with various scores (BLEU, TER, METEOR) and to perform statistical significance testing with bootstrap resampling</li>
<li>Source: https://github.com/jhclark/multeval</li>
<li>Installed version: 0.5.1 with METEOR 1.5</li>
<li>The multeval.sh script is in PATH</li>
</ul>
<li>'''compare-mt'''</li>
<ul>
<li>Compare the output of multiple systems for language generation, including machine translation, summarization, dialog response generation. Computes common evaluation scores and runs analyses to find salient differences between the systems.</li>
<li>To run METEOR, consult the help <code>module spider nlpl-mttools</code> for the exact path.</li>
<li>Source: https://github.com/neulab/compare-mt</li>
<li>Installed version: 0.2.7</li>
<li>The compare-mt executable is in PATH</li>
</ul>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-18T14:17:02Z

Yvessche: /* Neural machine translation */

Translation/home

2019-12-18T14:16:36Z

Yvessche:

Translation/home

2019-12-18T14:15:37Z

Yvessche: /* Using the Efmaral module */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

= Available software and data =

=== Statistical machine translation and word alignment ===

* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])

=== Neural machine translation ===

* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian module|Usage notes below.]]
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.

=== General scripts for machine translation ===

* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Activate the NLPL module repository:
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre>
</li>
<li>Load the Moses module:
<pre>module load nlpl-moses/4.0-a89691f</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses/4.0-a89691f</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Activate the NLPL module repository:
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre>
</li>
<li>Load the Efmaral module:
<pre>
module load nlpl-efmaral/0.1_20191218
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-18T14:14:56Z

Yvessche: /* Using the Moses module */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

= Available software and data =

=== Statistical machine translation and word alignment ===

* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])

=== Neural machine translation ===

* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian module|Usage notes below.]]
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.

=== General scripts for machine translation ===

* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Activate the NLPL module repository:
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre>
</li>
<li>Load the Moses module:
<pre>module load nlpl-moses/4.0-a89691f</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses/4.0-a89691f</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-18T14:14:40Z

Yvessche: /* Using the Moses module */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

= Available software and data =

=== Statistical machine translation and word alignment ===

* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])

=== Neural machine translation ===

* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian module|Usage notes below.]]
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.

=== General scripts for machine translation ===

* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Activate the NLPL module repository:
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre>
</li>
<li>Load the Moses module:
<pre>module load nlpl-moses/4.0-a89691f</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses/4.0-a89691f</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-18T14:12:59Z

Yvessche: /* Available software and data */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

= Available software and data =

=== Statistical machine translation and word alignment ===

* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])

=== Neural machine translation ===

* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian module|Usage notes below.]]
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.

=== General scripts for machine translation ===

* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-18T14:12:23Z

Yvessche: /* Available software and data */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

= Available software and data =

=== Statistical machine translation and word alignment ===

* The Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])
* The word alignment tools efmaral and eflomal are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])

=== Neural machine translation ===

* Marian-NMT is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian module|Usage notes below.]]
* OpenNMT-py is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
* OpenNMT-py is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-18T14:10:46Z

Yvessche:

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM:
** Release 4.0, installed on Puhti and Saga as <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_20191218</code> installed on Puhti and Saga ([[#Using the Efmaral module|usage notes below]])

=== Neural machine translation ===

* Marian-NMT is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian module|Usage notes below.]]
* OpenNMT-py is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
* OpenNMT-py is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Infrastructure/software/catalogue

2019-12-18T13:58:37Z

Yvessche: /* On Abel and Taito */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-moses/4.0-a89691f || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-efmaral/0.1_20191218 || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-mttools/20191218 || A collection of preprocessing and evaluation scripts for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation scripts for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Infrastructure/software/catalogue

2019-12-18T13:58:30Z

Yvessche: /* On Saga and Puhti */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-moses/4.0-a89691f || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-efmaral/0.1_20191218 || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-mttools/20191218 || A collection of preprocessing and evaluation scripts for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation script for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Infrastructure/software/catalogue

2019-12-18T13:54:55Z

Yvessche: /* On Saga and Puhti */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-moses/4.0-a89691f || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-efmaral/0.1_20191218 || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-mttools/20191218 || A collection of preprocessing and evaluation script for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation script for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Infrastructure/software/catalogue

2019-12-18T13:48:46Z

Yvessche: /* Activity B: Statistical and Neural Machine Translation */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-moses/4.0-a89691f || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-efmaral/nlpl-efmaral/0.1_20191218 || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-mttools/20191218 || A collection of preprocessing and evaluation script for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/taito_abel#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation script for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Infrastructure/software/catalogue

2019-12-18T13:47:00Z

Yvessche: /* Activity B: Statistical and Neural Machine Translation */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-moses/4.0-a89691f || Moses SMT system, including GIZA++, MGIZA, fast_align || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-efmaral/nlpl-efmaral/0.1_20191218 || efmaral and eflomal word alignment tools || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-mttools/20191218 || A collection of preprocessing and evaluation script for machine translation || Puhti, Saga || December 2019 || Yves Scherrer
|-
| nlpl-opennmt-py/1.0.0rc2/3.7 || OpenNMT Python Library || Saga || October 2019 || Stephan Oepen
|-
| nlpl-opennmt-py/1.0.0 || OpenNMT Python Library || Puhti || December 2019 || Yves Scherrer
|-
| nlpl-marian-nmt/1.8.0-eba7aed || Marian neural machine translation system || Puhti, Saga || December 2019 || Jörg Tiedemann
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation script for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Infrastructure/software/catalogue

2019-12-18T13:41:49Z

Yvessche: /* Activity B: Statistical and Neural Machine Translation */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL community directory of software modules, on top of the
pre-configured, system-wide modules, one needs to execute the following
(on Abel, Puhti, or Taito):

<pre>
module use -a /proj*/nlpl/software/modules/etc
</pre>

For Saga, the NLPL community directory is in a different location:

<pre>
module use -a /cluster/shared/nlpl/software/modules/etc
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel),
<tt>/proj/nlpl/</tt> (on Taito),
<tt>/projappl/nlpl/</tt> (on Puhti), and
<tt>/cluster/shared/nlpl/</tt> (on Saga).

For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

To inspect what is available, one can use the <tt>avail</tt> sub-command
(on Abel), e.g.
<pre>
module avail 2>&1 | grep nlpl
</pre>

= User-Installed Software =

Even if NLPL strives to make available a comprehensive set of ready-to-run sofware modules,
users will at times want to install their own add-on components.
For Python add-on components, some
[http://wiki.nlpl.eu/index.php/Infrastructure/software/user emerging instructions] are available.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cupy/5.4.0 || Matrix Library Accelerated by CUDA || Abel (3.7) || May 2018 || Stephan Oepen
|-
| nlpl-cython/0.29.3 || C Extensions for Python || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-dynet/2.1 || DyNet Dynamic Neural Network Toolkit (CPU) || Abel (3.5, 3.7) || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.0.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/1.1.0] || PyTorch Deep Learning Framework (CPU and GPU) || Abel (3.5, 3.7) || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/python nlpl-scipy/201901] || SciPy Ecosystem of Python Add-Ons || Abel (3.5, 3.7) || January 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

=== On Saga and Puhti ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
|}

=== On Abel and Taito ===

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM Some minor fixes added to existing install 2/2018. Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation script for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-corenlp/3.9.2 || Stanford CoreNLP Suite (Including All Models) || Abel || May 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/stanfordnlp nlpl-stanfordnlp/0.1.1] || Stanford NLP Neural Pipeline || Abel || February 2019 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe_future/3.7] || UDPipe Future || Abel || June 2019 || Andrey Kutuzov
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || Topic Modeling and Word Vectors Library || Taito, Abel || October 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.0 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || December 2018 || Stephan Oepen
|-
| nlpl-gensim/3.7.3 || Topic Modeling and Word Vectors Library || Abel (3.5, 3.7) || May 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Translation/home

2019-12-16T13:05:34Z

Yvessche: /* Datasets */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Puhti supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at ???.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-16T13:05:14Z

Yvessche: /* Datasets */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Puhti supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the <code>$NLPL</code> project directory is located at `/projappl/nlpl`. On Saga, the `$NLPL` project directory is located at ???.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-16T13:04:08Z

Yvessche: /* Datasets */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Puhti supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

On Puhti, the `$NLPL` project directory is located at `/projappl/nlpl`. On Saga, the `$NLPL` project directory is located at ???.

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-16T13:03:13Z

Yvessche: /* Datasets */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Puhti supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga): 
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga): 
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-16T13:01:06Z

Yvessche: /* Background */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Puhti supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-16T13:00:23Z

Yvessche: /* Background */

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Puhti supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/taito abel

2019-12-16T13:00:00Z

Yvessche:

= Background =

'''This page is outdated and kept for documentation purposes only! It reflects the status of the translation activity mid-2019, before the launch of Puhti and Saga.'''

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/taito abel

2019-12-16T12:59:23Z

Yvessche:

= Background =

'''This page is outdated and kept for documentation purposes only!'''

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-12-16T12:58:25Z

Yvessche:

= Background =

[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/taito abel

2019-12-16T12:55:11Z

Yvessche: Created page with "Placeholder"

Placeholder

Translation/home

2019-12-16T12:51:05Z

Yvessche: /* Datasets */

= Background =

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-05-09T12:08:35Z

Yvessche: /* Neural machine translation */

= Background =

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]
* A more recent version of OpenNMT-py is installed on Taito-GPU and can be loaded with <code>module load nlpl-opennmt-py-gpu</code>. This version may solve some Cuda errors observed with the above version on Taito-GPU.

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/models

2019-05-09T12:04:55Z

Yvessche: /* MT example scripts and pretrained models */

= MT example scripts and pretrained models =

The models and scripts are located at <tt>/proj*/nlpl/data/translation/pretrained-models/</tt>

== wmt18_helsinki-enfi-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the <tt>1_prepare.sh</tt> to <tt>6_test.sh</tt> scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the <tt>6_test.sh</tt> script to your own working directory.
* Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory.
* Adapt the WORKDIR path in <tt>6_test.sh</tt> and run the script.
* The output of script 6 corresponds to the files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18_helsinki-enfi-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the <tt>1_prepare.sh</tt> to <tt>4_test.sh</tt> scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with <tt>module load nlpl-opennmt-py</tt> (without the <tt>-gpu</tt> suffix).
* The output of script 4 should be similar to the provided files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the <tt>4_test.sh</tt> script to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory.
* Adapt the WORKDIR path in <tt>4_test.sh</tt> and run the script.
* The output of script 4 corresponds to the files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== wmt18-fien-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the <tt>1_prepare.sh</tt> to <tt>6_test.sh</tt> scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files <tt>testdata.out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the <tt>6_test.sh</tt> script to your own working directory.
* Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory.
* Adapt the WORKDIR path in <tt>6_test.sh</tt> and run the script.
* The output of script 6 corresponds to the files <tt>testdata.out.en</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18-fien-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the <tt>1_prepare.sh</tt> to <tt>4_test.sh</tt> scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with <tt>module load nlpl-opennmt-py</tt> (without the <tt>-gpu</tt> suffix).
* The output of script 6 should be similar to the provided files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the <tt>4_test.sh</tt> script to your own working directory.
* Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory.
* Adapt the WORKDIR path in <tt>4_test.sh</tt> and run the script.
* The output of script 4 corresponds to the files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== opus-noen-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on sentence-aligned data from the OPUS collection
* translating from Norwegian to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the <tt>1_prepare.sh</tt> to <tt>6_test.sh</tt> scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files <tt>testdata_out.tok.en</tt>, <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the <tt>6_test.sh</tt> script to your own working directory.
* Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.true.no</tt> to your working directory.
* Adapt the WORKDIR path in <tt>6_test.sh</tt> and run the script.
* The output of script 6 corresponds to the files <tt>testdata_out.tok.en</tt>, <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>.

Yves Scherrer, May 2019

== opus-noen-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on sentence-aligned data from the OPUS collection
* translating from Norwegian to English
* using the OpenNMT-py toolkit.

The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and the OPUS corpus collection as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the <tt>1_prepare.sh</tt> to <tt>4_test.sh</tt> scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with <tt>module load nlpl-opennmt-py</tt> (without the <tt>-gpu</tt> suffix).
* The output of script 4 should be similar to the provided files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the <tt>4_test.sh</tt> script to your own working directory.
* Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory.
* Adapt the WORKDIR path in <tt>4_test.sh</tt> and run the script.
* The output of script 4 corresponds to the files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== iwslt18_helsinki-euen-marian ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on data from the IWSLT18 low-resource translation task on Basque-to-English
* using the preprocessed and augmented datasets from the University of Helsinki submission
* with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission.
The goals of this example are twofold:
* to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the <tt>1_train.sh</tt>, <tt>2_test.sh</tt>, <tt>validate.sh</tt> and <tt>composeXML.py</tt> scripts to your own working directory.
* Adapt paths if necessary.
* Run the script <tt>1_train.sh</tt>, then <tt>2_test.sh</tt>. The <tt>validate.sh</tt> script is automatically called during training and does not have to be run separately. The <tt>composeXML.py</tt> script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
* The output of script 2 should be similar to the provided <tt>test.out.en</tt> and <tt>test.out.en.xml</tt> files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the <tt>2_test.sh</tt> and <tt>composeXML.py</tt> scripts to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file or copy <tt>test.eu</tt> to your working directory.
* Adapt the WORKDIR path in <tt>2_test.sh</tt> and run the script.
* The output of script 2 corresponds to the files <tt>test.out.en</tt> and <tt>test.out.en.xml</tt>.
* The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019

Translation/models

2019-05-09T11:57:57Z

Yvessche: /* opus-noen-onmt */

= MT example scripts and pretrained models =

The models and scripts are located at <tt>/proj*/nlpl/data/translation/pretrained-models/</tt>

== wmt18_helsinki-enfi-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18_helsinki-enfi-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 4 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== wmt18-fien-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18-fien-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 6 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== opus-noen-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on sentence-aligned data from the OPUS collection
* translating from Norwegian to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.true.no` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`.

Yves Scherrer, May 2019

== opus-noen-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on sentence-aligned data from the OPUS collection
* translating from Norwegian to English
* using the OpenNMT-py toolkit.

The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and the OPUS corpus collection as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 4 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== iwslt18_helsinki-euen-marian ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on data from the IWSLT18 low-resource translation task on Basque-to-English
* using the preprocessed and augmented datasets from the University of Helsinki submission
* with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission.
The goals of this example are twofold:
* to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_train.sh`, `2_test.sh`, `validate.sh` and `composeXML.py` scripts to your own working directory.
* Adapt paths if necessary.
* Run the script `1_train.sh`, then `2_test.sh`. The `validate.sh` script is automatically called during training and does not have to be run separately. The `composeXML.py` script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
* The output of script 2 should be similar to the provided `test.out.en` and `test.out.en.xml` files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `2_test.sh` and `composeXML.py` scripts to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file or copy `test.eu` to your working directory.
* Adapt the WORKDIR path in `2_test.sh` and run the script.
* The output of script 2 corresponds to the files `test.out.en` and `test.out.en.xml`.
* The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019

Translation/models

2019-05-09T11:57:15Z

Yvessche: /* opus-noen-moses */

= MT example scripts and pretrained models =

The models and scripts are located at <tt>/proj*/nlpl/data/translation/pretrained-models/</tt>

== wmt18_helsinki-enfi-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18_helsinki-enfi-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 4 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== wmt18-fien-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18-fien-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 6 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== opus-noen-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on sentence-aligned data from the OPUS collection
* translating from Norwegian to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.true.no` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`.

Yves Scherrer, May 2019

== opus-noen-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on parallel data from the OPUS repository
* translating from Norwegian to English
* using the OpenNMT-py toolkit.

''In progress.''

== iwslt18_helsinki-euen-marian ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on data from the IWSLT18 low-resource translation task on Basque-to-English
* using the preprocessed and augmented datasets from the University of Helsinki submission
* with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission.
The goals of this example are twofold:
* to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_train.sh`, `2_test.sh`, `validate.sh` and `composeXML.py` scripts to your own working directory.
* Adapt paths if necessary.
* Run the script `1_train.sh`, then `2_test.sh`. The `validate.sh` script is automatically called during training and does not have to be run separately. The `composeXML.py` script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
* The output of script 2 should be similar to the provided `test.out.en` and `test.out.en.xml` files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `2_test.sh` and `composeXML.py` scripts to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file or copy `test.eu` to your working directory.
* Adapt the WORKDIR path in `2_test.sh` and run the script.
* The output of script 2 corresponds to the files `test.out.en` and `test.out.en.xml`.
* The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019

Translation/models

2019-05-09T11:56:50Z

Yvessche: /* opus-noen-moses */

= MT example scripts and pretrained models =

The models and scripts are located at <tt>/proj*/nlpl/data/translation/pretrained-models/</tt>

== wmt18_helsinki-enfi-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18_helsinki-enfi-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 4 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== wmt18-fien-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18-fien-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 6 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== opus-noen-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on sentence-aligned data from the OPUS collection
* translating from Norwegian to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [Moses tutorial](http://www.statmt.org/moses/?n=Moses.Baseline) which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.true.no` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`.

Yves Scherrer, May 2019

== opus-noen-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on parallel data from the OPUS repository
* translating from Norwegian to English
* using the OpenNMT-py toolkit.

''In progress.''

== iwslt18_helsinki-euen-marian ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on data from the IWSLT18 low-resource translation task on Basque-to-English
* using the preprocessed and augmented datasets from the University of Helsinki submission
* with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission.
The goals of this example are twofold:
* to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_train.sh`, `2_test.sh`, `validate.sh` and `composeXML.py` scripts to your own working directory.
* Adapt paths if necessary.
* Run the script `1_train.sh`, then `2_test.sh`. The `validate.sh` script is automatically called during training and does not have to be run separately. The `composeXML.py` script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
* The output of script 2 should be similar to the provided `test.out.en` and `test.out.en.xml` files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `2_test.sh` and `composeXML.py` scripts to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file or copy `test.eu` to your working directory.
* Adapt the WORKDIR path in `2_test.sh` and run the script.
* The output of script 2 corresponds to the files `test.out.en` and `test.out.en.xml`.
* The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019

Translation/models

2019-05-09T11:55:23Z

Yvessche: /* wmt18-fien-onmt */

= MT example scripts and pretrained models =

The models and scripts are located at <tt>/proj*/nlpl/data/translation/pretrained-models/</tt>

== wmt18_helsinki-enfi-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18_helsinki-enfi-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on WMT18 news data preprocessed by the University of Helsinki
* translating from English to Finnish
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 4 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== wmt18-fien-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
* The output of script 6 should be similar to the provided files `testdata.out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `6_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `6_test.sh` and run the script.
* The output of script 6 corresponds to the files `testdata.out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

== wmt18-fien-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on the raw versions of the WMT18 news data
* translating from Finnish to English
* using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions.
The goals of this example are twofold:
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
* The output of script 6 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

=== Use the pre-trained model to translate unseen text ===

* Copy the `4_test.sh` script to your own working directory.
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
* Adapt the WORKDIR path in `4_test.sh` and run the script.
* The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

== opus-noen-moses ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on parallel data from the OPUS repository
* translating from Norwegian to English
* using the Moses toolkit.

''In progress.''

== opus-noen-onmt ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on parallel data from the OPUS repository
* translating from Norwegian to English
* using the OpenNMT-py toolkit.

''In progress.''

== iwslt18_helsinki-euen-marian ==

This directory contains training scripts and the resulting model files for a translation system:
* trained on data from the IWSLT18 low-resource translation task on Basque-to-English
* using the preprocessed and augmented datasets from the University of Helsinki submission
* with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission.
The goals of this example are twofold:
* to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
* to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.

=== Retrain a new model using the provided scripts ===

* Copy the `1_train.sh`, `2_test.sh`, `validate.sh` and `composeXML.py` scripts to your own working directory.
* Adapt paths if necessary.
* Run the script `1_train.sh`, then `2_test.sh`. The `validate.sh` script is automatically called during training and does not have to be run separately. The `composeXML.py` script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
* The output of script 2 should be similar to the provided `test.out.en` and `test.out.en.xml` files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

=== Use the pre-trained model to translate unseen text ===

* Copy the `2_test.sh` and `composeXML.py` scripts to your own working directory.
* Provide a tokenized, truecased and BPE-encoded test file or copy `test.eu` to your working directory.
* Adapt the WORKDIR path in `2_test.sh` and run the script.
* The output of script 2 corresponds to the files `test.out.en` and `test.out.en.xml`.
* The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019

Translation/models

2019-05-09T11:54:18Z

Yvessche: /* wmt18_helsinki-enfi-onmt */

Translation/home

2019-02-03T13:52:37Z

Yvessche:

= Background =

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-02-03T13:52:12Z

Yvessche: /* General scripts for machine translation */

= Background =

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[Translation/mttools|the mttools page]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

= Using the mttools module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL software repository and load the module:
<pre>module use -a /proj*/nlpl/software/modulefiles/
module load nlpl-mttools</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-mttools</pre>
</li>
</ul>

The following scripts are part of this module:
<ul>
<li>'''moses-scripts'''</li>
<ul>
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
<li>Installed revision: 413ba6b</li>
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
</ul>
<li>'''sacremoses'''</li>
<ul>
<li>Python port of Moses tokenizer and truecaser</li>
<li>Source: https://github.com/alvations/sacremoses</li>
<li>Installed version: 0.0.5</li>
</ul>
<li>'''subword-nmt'''</li>
<ul>
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
<li>Source: https://github.com/rsennrich/subword-nmt</li>
<li>Installed version: 0.3.6</li>
<li>The <code>subword-nmt</code> executable is in PATH</li>
</ul>
<li>'''sentencepiece'''</li>
<ul>
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
<li>Source: https://github.com/google/sentencepiece</li>
<li>Installed version: 0.1.6</li>
<li>The <code>spm_*</code> executables are in PATH</li>
</ul>
<li>'''sacreBLEU'''</li>
<ul>
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
<li>Source: https://github.com/mjpost/sacreBLEU</li>
<li>Installed version: 1.2.12</li>
<li>The <code>sacrebleu</code> executable is in PATH</li>
</ul>
<li>'''scoring'''</li>
<ul>
<li>Script that makes it easy to score machine translation output using NIST's BLEU and NIST, TER, and METEOR, by Ken Heafield</li>
<li>Source: https://kheafield.com/code/scoring.tar.gz</li>
<li>Installed version: Sept 19, 2012</li>
<li>The <code>score.rb</code> script is in PATH</li>
</ul>
</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/mttools

2019-02-03T13:51:43Z

Yvessche:

== Using the mttools module ==

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL software repository and load the module:
<pre>module use -a /proj*/nlpl/software/modulefiles/
module load nlpl-mttools</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-mttools</pre>
</li>
</ul>

The following scripts are part of this module:
<ul>
<li>'''moses-scripts'''</li>
<ul>
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
<li>Installed revision: 413ba6b</li>
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
</ul>
<li>'''sacremoses'''</li>
<ul>
<li>Python port of Moses tokenizer and truecaser</li>
<li>Source: https://github.com/alvations/sacremoses</li>
<li>Installed version: 0.0.5</li>
</ul>
<li>'''subword-nmt'''</li>
<ul>
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
<li>Source: https://github.com/rsennrich/subword-nmt</li>
<li>Installed version: 0.3.6</li>
<li>The <code>subword-nmt</code> executable is in PATH</li>
</ul>
<li>'''sentencepiece'''</li>
<ul>
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
<li>Source: https://github.com/google/sentencepiece</li>
<li>Installed version: 0.1.6</li>
<li>The <code>spm_*</code> executables are in PATH</li>
</ul>
<li>'''sacreBLEU'''</li>
<ul>
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
<li>Source: https://github.com/mjpost/sacreBLEU</li>
<li>Installed version: 1.2.12</li>
<li>The <code>sacrebleu</code> executable is in PATH</li>
</ul>
<li>'''scoring'''</li>
<ul>
<li>Script that makes it easy to score machine translation output using NIST's BLEU and NIST, TER, and METEOR, by Ken Heafield</li>
<li>Source: https://kheafield.com/code/scoring.tar.gz</li>
<li>Installed version: Sept 19, 2012</li>
<li>The <code>score.rb</code> script is in PATH</li>
</ul>
</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/mttools

2019-02-03T13:51:19Z

Yvessche: Created page with "= Using the mttools module = <ul> <li>Log into Taito or Abel</li> <li>Activate the NLPL software repository and load the module: <pre>module use -a /proj*/nlpl/software/modul..."

= Using the mttools module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL software repository and load the module:
<pre>module use -a /proj*/nlpl/software/modulefiles/
module load nlpl-mttools</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-mttools</pre>
</li>
</ul>

The following scripts are part of this module:
<ul>
<li>'''moses-scripts'''</li>
<ul>
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
<li>Installed revision: 413ba6b</li>
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
</ul>
<li>'''sacremoses'''</li>
<ul>
<li>Python port of Moses tokenizer and truecaser</li>
<li>Source: https://github.com/alvations/sacremoses</li>
<li>Installed version: 0.0.5</li>
</ul>
<li>'''subword-nmt'''</li>
<ul>
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
<li>Source: https://github.com/rsennrich/subword-nmt</li>
<li>Installed version: 0.3.6</li>
<li>The <code>subword-nmt</code> executable is in PATH</li>
</ul>
<li>'''sentencepiece'''</li>
<ul>
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
<li>Source: https://github.com/google/sentencepiece</li>
<li>Installed version: 0.1.6</li>
<li>The <code>spm_*</code> executables are in PATH</li>
</ul>
<li>'''sacreBLEU'''</li>
<ul>
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
<li>Source: https://github.com/mjpost/sacreBLEU</li>
<li>Installed version: 1.2.12</li>
<li>The <code>sacrebleu</code> executable is in PATH</li>
</ul>
<li>'''scoring'''</li>
<ul>
<li>Script that makes it easy to score machine translation output using NIST's BLEU and NIST, TER, and METEOR, by Ken Heafield</li>
<li>Source: https://kheafield.com/code/scoring.tar.gz</li>
<li>Installed version: Sept 19, 2012</li>
<li>The <code>score.rb</code> script is in PATH</li>
</ul>
</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/home

2019-02-03T13:50:45Z

Yvessche: /* General scripts for machine translation */

= Background =

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.

= Available software and data =

=== Statistical machine translation and word alignment ===

* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
* Additional word alignment tools efmaral and eflomal:
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)

=== Neural machine translation ===

* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
** Installation updated on 19/3/2018
* Marian is installed on Taito-GPU. [[#Using the Marian module|Usage notes below.]]
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
* OpenNMT-py is installed on Taito and Abel. [[Translation/opennmt-py|Details]]

=== General scripts for machine translation ===

* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
** First installed on 23/12/2018 on Taito and Abel.
** See [[#Using the mttools module|below]] and [[Translation/mttools|here]] for further details.

=== Datasets ===

<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
</li>
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
</li>
<li> WMT18 news task parallel data (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
</li>
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel): 
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
</li>
</ul>

=== Models ===

See [[Translation/models|this page]] for details.

= Using the Moses module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Moses module:
<pre>module load nlpl-moses</pre>
</li>
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
<ul>
<li>cmph, irstlm, xmlprc</li>
<li>with-mm</li>
<li>max-kenlm-order 10</li>
<li>max-factors 7</li>
<li>SALM + filter-pt</li>
</ul></li>
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help nlpl-moses</pre>
</li>
</ul>

= Using the Efmaral module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/ # Taito
module use -a /projects/nlpl/software/modulefiles/ # Abel</pre>
</li>
<li>Load the most recent version of the Efmaral module:
<pre>
module load nlpl-efmaral
</pre>
</li>
<li>You can use the align.py script directly:
<pre>align.py ...</pre>
</li>
<li>You can use the efmaral module inside a Python3 script:
<pre>python3
>>> import efmaral</pre>
</li>
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
<pre>cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
3rdparty/data/test.eng.hin.wa \
3rdparty/data/test.eng 3rdparty/data/test.hin \
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
</li>
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
<pre>align_eflomal.py ...</pre>
</li>
<li>You can also use the eflomal executable:
<pre>eflomal ...</pre>
</li>
<li>You can also use the eflomal module in a Python3 script:
<pre>python3
>>> import eflomal</pre>
</li>
<li>The atools executable (from fast_align) is also made available.</li>
</ul>

= Using the HNMT module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The HNMT module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-hnmt</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-hnmt</pre>
</li>
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
</ul>

== Example scripts ==

The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

<ol>
<li>Data preparation: The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
<li>Training: The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process: 
<pre>SOURCE / TARGET / OUTPUT
at least for the time being , all of them will continue working at their current sites .
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
</ul></li>
<li>Testing: The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
<ul>
<li>The resulting translations are written to <code>test.trans</code>.</li>
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
</li>
</ul>
</ol>

== Troubleshooting ==

<ol>
<li>
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(784).....:
MPID_Init(1326)...........: channel initialization failed
MPIDI_CH3_Init(120).......:
MPID_nem_init_ckpt(852)...:
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
⇒ Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
</li>
<li>
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
⇒ HNMT does not run on the login shell, try running it through a SLURM script.
</li>
<li>
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
...
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
<li>
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable: <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
</li>
</ol>

= Using the Marian module =

<ul>
<li>Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)</li>
<li>The Marian module can be loaded by activating the NLPL software repository:
<pre>module use -a /proj/nlpl/software/modulefiles/
module load nlpl-marian</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-marian</pre>
</li>
<li>Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
<pre>module load marian</pre>
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
</ul>

== Example scripts ==

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>

<ul>
<li>Training-basics: Launch the script with <code>sbatch run-me.sh</code>.</li>
<li>Transformer: Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
<li>Translating-amun Launch the script with <code>sbatch run-me.sh</code>.</li>
</ul>

= Using the mttools module =

<ul>
<li>Log into Taito or Abel</li>
<li>Activate the NLPL software repository and load the module:
<pre>module use -a /proj*/nlpl/software/modulefiles/
module load nlpl-mttools</pre>
</li>
<li>Module-specific help is available by typing:
<pre>module help nlpl-mttools</pre>
</li>
</ul>

The following scripts are part of this module:
<ul>
<li>'''moses-scripts'''</li>
<ul>
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
<li>Installed revision: 413ba6b</li>
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
</ul>
<li>'''sacremoses'''</li>
<ul>
<li>Python port of Moses tokenizer and truecaser</li>
<li>Source: https://github.com/alvations/sacremoses</li>
<li>Installed version: 0.0.5</li>
</ul>
<li>'''subword-nmt'''</li>
<ul>
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
<li>Source: https://github.com/rsennrich/subword-nmt</li>
<li>Installed version: 0.3.6</li>
<li>The <code>subword-nmt</code> executable is in PATH</li>
</ul>
<li>'''sentencepiece'''</li>
<ul>
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
<li>Source: https://github.com/google/sentencepiece</li>
<li>Installed version: 0.1.6</li>
<li>The <code>spm_*</code> executables are in PATH</li>
</ul>
<li>'''sacreBLEU'''</li>
<ul>
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
<li>Source: https://github.com/mjpost/sacreBLEU</li>
<li>Installed version: 1.2.12</li>
<li>The <code>sacrebleu</code> executable is in PATH</li>
</ul>
<li>'''scoring'''</li>
<ul>
<li>Script that makes it easy to score machine translation output using NIST's BLEU and NIST, TER, and METEOR, by Ken Heafield</li>
<li>Source: https://kheafield.com/code/scoring.tar.gz</li>
<li>Installed version: Sept 19, 2012</li>
<li>The <code>score.rb</code> script is in PATH</li>
</ul>
</li>
</ul>

'''Contact:'''
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Translation/models

2019-02-03T13:49:07Z

Yvessche:

Translation/models

2019-02-03T13:47:42Z

Yvessche:

Translation/models

2019-02-03T13:46:37Z

Yvessche: