Difference between revisions of "Translation/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Using the Moses module)
(42 intermediate revisions by the same user not shown)
Line 5: Line 5:
 
Initially, the software and data are commissioned on the Finnish Taito supercluster.
 
Initially, the software and data are commissioned on the Finnish Taito supercluster.
  
'''Current status (11/2017):'''
 
  
* '''moses''' module: SMT pipeline (Moses + various word alignment tools) installed on Taito and Abel (Moses release 4.0)
+
= Available software and data =
* '''efmaral''' module: efmaral and eflomal word alignment tools installed on Taito and Abel
 
* Older versions of '''moses''' and '''efmaral''' modules (installed 7/2017) are still available on Taito
 
  
'''Coming up (Goal: 12/2017):'''
+
=== Statistical machine translation and word alignment ===
  
* NMT toolkits
+
* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
* Datasets
+
** Release 4.0, installed on Abel and Taito as <code>nlpl-moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
 +
** Release mmt-mvp-v0.12.1, installed on Taito as <code>nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
 +
* Additional word alignment tools efmaral and eflomal:
 +
** Most recent version <code>nlpl-efmaral/0.1_2018_12_17</code> (Abel) or <code>nlpl-efmaral/0.1_2018_12_13</code> (Taito) ([[#Using the Efmaral module|usage notes below]])
 +
** Previous version <code>nlpl-efmaral/0.1_2017_11_24</code>, installed on Abel and Taito
 +
** Previous version <code>nlpl-efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)
 +
 
 +
=== Neural machine translation ===
 +
 
 +
* HNMT (Helsinki Neural Machine Translation System) is installed on Taito-GPU. [[#Using the HNMT module|Usage notes below.]]
 +
** Release 1.0.1 from https://github.com/robertostling/hnmt installed as <code>nlpl-hnmt/1.0.1</code>
 +
** Installation updated on 19/3/2018
 +
* Marian is installed on Taito-GPU.  [[#Using the Marian module|Usage notes below.]]
 +
** Release 1.2.0 from https://github.com/marian-nmt/marian installed as <code>nlpl-marian/1.2.0</code>
 +
 
 +
=== General scripts for machine translation ===
 +
 
 +
* The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
 +
** First installed on 23/12/2018 on Taito and Abel.
 +
** See [[#Using the mttools module|below]] for further details.
 +
 
 +
=== Datasets ===
 +
 
 +
<ul>
 +
<li> IWSLT17 parallel data (0.6G, on Taito and Abel):<br/>
 +
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
 +
</li>
 +
<li> WMT17 news task parallel data (16G, on Taito and Abel):<br/>
 +
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
 +
</li>
 +
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel):<br/>
 +
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
 +
</li>
 +
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel):<br/>
 +
<pre>/proj[ects]/nlpl/data/translation/iwslt18</pre>
 +
</li>
 +
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel):<br/>
 +
<pre>/proj[ects]/nlpl/data/translation/iwslt18_helsinki</pre>
 +
</li>
 +
<li> WMT18 news task parallel data (17G, on Taito and Abel):<br/>
 +
<pre>/proj[ects]/nlpl/data/translation/wmt18news</pre>
 +
</li>
 +
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel):<br/>
 +
<pre>/proj[ects]/nlpl/data/translation/wmt18news_helsinki</pre>
 +
</li>
 +
</ul>
 +
 
 +
=== Models ===
 +
 
 +
* Coming up (Helsinki WMT2017 models, pretrained Edinburgh SMT models, ...)
  
 
= Using the Moses module =
 
= Using the Moses module =
  
* Log into Taito or Abel
+
<ul>
* Activate the NLPL module repository:
+
<li>Log into Taito or Abel</li>
module use -a /proj/nlpl/software/modulefiles/      # Taito
+
<li>Activate the NLPL module repository:
module use -a /projects/nlpl/software/modulefiles/  # Abel
+
<pre>module use -a /proj/nlpl/software/modulefiles/      # Taito
* Load the most recent version of the Moses module:
+
module use -a /projects/nlpl/software/modulefiles/  # Abel</pre>
module load moses
+
</li>
* Start using Moses, e.g. using the tutorial at http://statmt.org/moses/ <br />
+
<li>Load the most recent version of the Moses module:
For word alignment, you can use GIZA++, Mgiza and fast_align. The word alignment tools efmaral and eflomal are part of a separate module.
+
<pre>module load nlpl-moses</pre>
* The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted :
+
</li>
** cmph, irstlm, xmlprc
+
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
** with-mm
+
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
** max-kenlm-order 10
+
<ul>
** max-factors 7
+
<li>cmph, irstlm, xmlprc</li>
** SALM + filter-pt
+
<li>with-mm</li>
* If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
+
<li>max-kenlm-order 10</li>
  module help moses
+
<li>max-factors 7</li>
 +
<li>SALM + filter-pt</li>
 +
</ul></li>
 +
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
 +
<pre>module help nlpl-moses</pre>
 +
</li>
 +
</ul>
 +
 
 +
= Using the Efmaral module =
 +
 
 +
<ul>
 +
<li>Log into Taito or Abel</li>
 +
<li>Activate the NLPL module repository:
 +
<pre>module use -a /proj/nlpl/software/modulefiles/      # Taito
 +
module use -a /projects/nlpl/software/modulefiles/  # Abel</pre>
 +
</li>
 +
<li>Load the most recent version of the Efmaral module:
 +
<pre>
 +
module load nlpl-efmaral
 +
</pre>
 +
</li>
 +
<li>You can use the align.py script directly:
 +
<pre>align.py ...</pre>
 +
</li>
 +
<li>You can use the efmaral module inside a Python3 script:
 +
<pre>python3
 +
>>> import efmaral</pre>
 +
</li>
 +
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
 +
<pre>cd $EFMARALPATH
 +
python3 scripts/evaluate.py efmaral \
 +
    3rdparty/data/test.eng.hin.wa \
 +
    3rdparty/data/test.eng 3rdparty/data/test.hin \
 +
    3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre>
 +
</li>
 +
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
 +
<pre>align_eflomal.py ...</pre>
 +
</li>
 +
<li>You can also use the eflomal executable:
 +
<pre>eflomal ...</pre>
 +
</li>
 +
<li>You can also use the eflomal module in a Python3 script:
 +
<pre>python3
 +
>>> import eflomal</pre>
 +
</li>
 +
<li>The atools executable (from fast_align) is also made available.</li>
 +
</ul>
 +
 
 +
= Using the HNMT module =
 +
 
 +
<ul>
 +
<li>Log into Taito-GPU (<b>Important:</b> this module only runs on Taito-GPU, not on Taito!)</li>
 +
<li>The HNMT module can be loaded by activating the NLPL software repository:
 +
<pre>module use -a /proj/nlpl/software/modulefiles/
 +
module load nlpl-hnmt</pre>
 +
</li>
 +
<li>Module-specific help is available by typing:
 +
<pre>module help nlpl-hnmt</pre>
 +
</li>
 +
<li>The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.</li>
 +
<li>Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.</li>
 +
</ul>
 +
 
 +
 
 +
== Example scripts ==
 +
 
 +
The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.
 +
 
 +
<ol>
 +
<li><b>Data preparation:</b> The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
 +
<li><b>Training:</b> The second script is <code>train.sh</code> and calls <code>hnmt.py</code> to train a model. Launch it with <code>sbatch train.sh</code>. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
 +
<ul>
 +
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process:<br />
 +
<pre>SOURCE / TARGET / OUTPUT
 +
at least for the time being , all of them will continue working at their current sites .
 +
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
 +
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
 +
</li>
 +
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
 +
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
 +
</ul></li>
 +
<li><b>Testing:</b> The last script is <code>test.sh</code> and calls <code>hnmt.py</code> to test the previously created model on held-out data. Launch it with <code>sbatch test.sh</code>. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
 +
<ul>
 +
<li>The resulting translations are written to <code>test.trans</code>.</li>
 +
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
 +
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
 +
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
 +
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
 +
</li>
 +
</ul>
 +
</ol>
 +
 
 +
 
 +
== Troubleshooting ==
 +
 
 +
<ol>
 +
<li>
 +
<pre>Fatal error in PMPI_Init_thread: Other MPI error, error stack:
 +
MPIR_Init_thread(784).....:
 +
MPID_Init(1326)...........: channel initialization failed
 +
MPIDI_CH3_Init(120).......:
 +
MPID_nem_init_ckpt(852)...:
 +
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1</pre>
 +
&rArr; Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
 +
</li>
 +
<li>
 +
<pre>ERROR (theano.gpuarray): Could not initialize pygpu, support disabled</pre>
 +
&rArr; HNMT does not run on the login shell, try running it through a SLURM script.
 +
</li>
 +
<li>
 +
<pre>ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
 +
  ...
 +
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node</pre>
 +
&rArr; This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.</li>
 +
<li>
 +
<pre>pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'</pre>
 +
&rArr; This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable:<br/> <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
 +
</li>
 +
</ol>
 +
 
 +
= Using the Marian module =
 +
 
 +
<ul>
 +
<li>Log into Taito-GPU (<b>Important:</b> this module only runs on Taito-GPU, not on Taito!)</li>
 +
<li>The Marian module can be loaded by activating the NLPL software repository:
 +
<pre>module use -a /proj/nlpl/software/modulefiles/
 +
module load nlpl-marian</pre>
 +
</li>
 +
<li>Module-specific help is available by typing:
 +
<pre>module help nlpl-marian</pre>
 +
</li>
 +
<li><b>Note:</b> A more recent version of Marian has been installed system-wide and can be loaded in the following way:
 +
<pre>module load marian</pre>
 +
<li>The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.</li>
 +
<li>Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.</li>
 +
</ul>
 +
 
 +
 
 +
== Example scripts ==
 +
 
 +
We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:
 +
<pre>cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples</pre>
 +
 
 +
<ul>
 +
<li><b>Training-basics:</b> Launch the script with <code>sbatch run-me.sh</code>.</li>
 +
<li><b>Transformer:</b> Launch the script with <code>sbatch run-me.sh</code>. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!</li>
 +
<li><b>Translating-amun</b> Launch the script with <code>sbatch run-me.sh</code>.</li>
 +
</ul>
 +
 
 +
 
 +
= Using the mttools module =
 +
 
 +
<ul>
 +
<li>Log into Taito or Abel</li>
 +
<li>Activate the NLPL software repository and load the module:
 +
<pre>module use -a /proj*/nlpl/software/modulefiles/
 +
module load nlpl-mttools</pre>
 +
</li>
 +
<li>Module-specific help is available by typing:
 +
<pre>module help nlpl-mttools</pre>
 +
</li>
 +
</ul>
 +
 
 +
The following scripts are part of this module:
 +
<ul>
 +
<li>'''moses-scripts'''</li>
 +
<ul>
 +
<li>Tokenization, casing, corpus cleaning and evaluation scripts from Moses</li>
 +
<li>Source: https://github.com/moses-smt/mosesdecoder (scripts directory)</li>
 +
<li>Installed revision: 413ba6b</li>
 +
<li>The subfolders <code>generic</code>, <code>recaser</code>, <code>tokenizer</code>, <code>training</code> are in PATH</li>
 +
</ul>
 +
<li>'''sacremoses'''</li>
 +
<ul>
 +
<li>Python port of Moses tokenizer and truecaser</li>
 +
<li>Source: https://github.com/alvations/sacremoses</li>
 +
<li>Installed version: 0.0.5</li>
 +
</ul>
 +
<li>'''subword-nmt'''</li>
 +
<ul>
 +
<li>Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation</li>
 +
<li>Source: https://github.com/rsennrich/subword-nmt</li>
 +
<li>Installed version: 0.3.6</li>
 +
<li>The <code>subword-nmt</code> executable is in PATH</li>
 +
</ul>
 +
<li>'''sentencepiece'''</li>
 +
<ul>
 +
<li>Unsupervised text tokenizer for Neural Network-based text generation</li>
 +
<li>Source: https://github.com/google/sentencepiece</li>
 +
<li>Installed version: 0.1.6</li>
 +
<li>The <code>spm_*</code> executables are in PATH</li>
 +
</ul>
 +
<li>'''sacreBLEU'''</li>
 +
<ul>
 +
<li>Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons</li>
 +
<li>Source: https://github.com/mjpost/sacreBLEU</li>
 +
<li>Installed version: 1.2.12</li>
 +
<li>The <code>sacrebleu</code> executable is in PATH</li>
 +
</ul>
 +
<li>'''scoring'''</li>
 +
<ul>
 +
<li>Script that makes it easy to score machine translation output using NIST's BLEU and NIST, TER, and METEOR, by Ken Heafield</li>
 +
<li>Source: https://kheafield.com/code/scoring.tar.gz</li>
 +
<li>Installed version: Sept 19, 2012</li>
 +
<li>The <code>score.rb</code> script is in PATH</li>
 +
</ul>
 +
</li>
 +
</ul>
 +
 
  
 
'''Contact:'''
 
'''Contact:'''
 
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi
 
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Revision as of 12:33, 23 December 2018

Background

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT) is maintained for NLPL under the coordination of the University of Helsinki (UoH). Initially, the software and data are commissioned on the Finnish Taito supercluster.


Available software and data

Statistical machine translation and word alignment

  • Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
    • Release 4.0, installed on Abel and Taito as nlpl-moses/4.0-65c75ff (usage notes below)
    • Release mmt-mvp-v0.12.1, installed on Taito as nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb (not recommended)
  • Additional word alignment tools efmaral and eflomal:
    • Most recent version nlpl-efmaral/0.1_2018_12_17 (Abel) or nlpl-efmaral/0.1_2018_12_13 (Taito) (usage notes below)
    • Previous version nlpl-efmaral/0.1_2017_11_24, installed on Abel and Taito
    • Previous version nlpl-efmaral/0.1_2017_07_20, installed on Taito (not recommended)

Neural machine translation

General scripts for machine translation

  • The nlpl-mttools module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit.
    • First installed on 23/12/2018 on Taito and Abel.
    • See below for further details.

Datasets

  • IWSLT17 parallel data (0.6G, on Taito and Abel):
    /proj[ects]/nlpl/data/translation/iwslt17
  • WMT17 news task parallel data (16G, on Taito and Abel):
    /proj[ects]/nlpl/data/translation/wmt17news
  • WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel):
    /proj[ects]/nlpl/data/translation/wmt17news_helsinki
  • IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Taito and Abel):
    /proj[ects]/nlpl/data/translation/iwslt18
  • IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Taito and Abel):
    /proj[ects]/nlpl/data/translation/iwslt18_helsinki
  • WMT18 news task parallel data (17G, on Taito and Abel):
    /proj[ects]/nlpl/data/translation/wmt18news
  • WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Taito and Abel):
    /proj[ects]/nlpl/data/translation/wmt18news_helsinki

Models

  • Coming up (Helsinki WMT2017 models, pretrained Edinburgh SMT models, ...)

Using the Moses module

  • Log into Taito or Abel
  • Activate the NLPL module repository:
    module use -a /proj/nlpl/software/modulefiles/       # Taito
    module use -a /projects/nlpl/software/modulefiles/   # Abel
  • Load the most recent version of the Moses module:
    module load nlpl-moses
  • Start using Moses, e.g. using the tutorial at http://statmt.org/moses/
  • The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
    • cmph, irstlm, xmlprc
    • with-mm
    • max-kenlm-order 10
    • max-factors 7
    • SALM + filter-pt
  • For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a separate module.)
    If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
    module help nlpl-moses

Using the Efmaral module

  • Log into Taito or Abel
  • Activate the NLPL module repository:
    module use -a /proj/nlpl/software/modulefiles/       # Taito
    module use -a /projects/nlpl/software/modulefiles/   # Abel
  • Load the most recent version of the Efmaral module:
    module load nlpl-efmaral
    
  • You can use the align.py script directly:
    align.py ...
  • You can use the efmaral module inside a Python3 script:
    python3
    >>> import efmaral
  • You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
    cd $EFMARALPATH
    python3 scripts/evaluate.py efmaral \
        3rdparty/data/test.eng.hin.wa \
        3rdparty/data/test.eng 3rdparty/data/test.hin \
        3rdparty/data/trial.eng 3rdparty/data/trial.hin
  • The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
    align_eflomal.py ...
  • You can also use the eflomal executable:
    eflomal ...
  • You can also use the eflomal module in a Python3 script:
    python3
    >>> import eflomal
  • The atools executable (from fast_align) is also made available.

Using the HNMT module

  • Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)
  • The HNMT module can be loaded by activating the NLPL software repository:
    module use -a /proj/nlpl/software/modulefiles/
    module load nlpl-hnmt
  • Module-specific help is available by typing:
    module help nlpl-hnmt
  • The main HNMT script can be called directly on the command line (hnmt.py), but for anything serious CUDA is required, which is only available from within SLURM scripts.
  • Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.


Example scripts

The directory /proj/nlpl/data/translation/hnmt_examples contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.

  1. Data preparation: The first script to launch is prepare.sh. It fetches the training, development and test data, extracts and reformats it, and calls the make_encode.py script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.
  2. Training: The second script is train.sh and calls hnmt.py to train a model. Launch it with sbatch train.sh. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
    • The training.*.out file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process:
      SOURCE / TARGET / OUTPUT
      at least for the time being , all of them will continue working at their current sites .
      ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
      ainakin kaikki ne tekevät työtä tällä hetkellä .
    • The training.log and training.log.eval files report additional information, as explained on [1].
    • The training process creates a train.model.final file, which is then used for testing.
  3. Testing: The last script is test.sh and calls hnmt.py to test the previously created model on held-out data. Launch it with sbatch test.sh. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
    • The resulting translations are written to test.trans.
    • In the test.*.out file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
      BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
      LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
      chrF = 0.310397 (precision = 0.355720, recall = 0.306064)


Troubleshooting

  1. Fatal error in PMPI_Init_thread: Other MPI error, error stack:
    MPIR_Init_thread(784).....:
    MPID_Init(1326)...........: channel initialization failed
    MPIDI_CH3_Init(120).......:
    MPID_nem_init_ckpt(852)...:
    MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1

    ⇒ Even when using a SLURM script, the HNMT command has to be prefixed by srun: srun hnmt.py ...

  2. ERROR (theano.gpuarray): Could not initialize pygpu, support disabled

    ⇒ HNMT does not run on the login shell, try running it through a SLURM script.

  3. ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
     ...
     theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node
    ⇒ This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.
  4. pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'

    ⇒ This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable:
    export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8

Using the Marian module

  • Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)
  • The Marian module can be loaded by activating the NLPL software repository:
    module use -a /proj/nlpl/software/modulefiles/
    module load nlpl-marian
  • Module-specific help is available by typing:
    module help nlpl-marian
  • Note: A more recent version of Marian has been installed system-wide and can be loaded in the following way:
    module load marian
  • The Marian executables can be called directly on the command line, but longer-running tasks should be run with SLURM scripts.
  • Marian comes with a couple of example scripts, which need to be adapted slightly for use on Taito. See below.


Example scripts

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:

cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples
  • Training-basics: Launch the script with sbatch run-me.sh.
  • Transformer: Launch the script with sbatch run-me.sh. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!
  • Translating-amun Launch the script with sbatch run-me.sh.


Using the mttools module

  • Log into Taito or Abel
  • Activate the NLPL software repository and load the module:
    module use -a /proj*/nlpl/software/modulefiles/
    module load nlpl-mttools
  • Module-specific help is available by typing:
    module help nlpl-mttools

The following scripts are part of this module:

  • moses-scripts
    • Tokenization, casing, corpus cleaning and evaluation scripts from Moses
    • Source: https://github.com/moses-smt/mosesdecoder (scripts directory)
    • Installed revision: 413ba6b
    • The subfolders generic, recaser, tokenizer, training are in PATH
  • sacremoses
  • subword-nmt
    • Unsupervised Word Segmentation (a.k.a. Byte Pair Encoding) for Machine Translation and Text Generation
    • Source: https://github.com/rsennrich/subword-nmt
    • Installed version: 0.3.6
    • The subword-nmt executable is in PATH
  • sentencepiece
  • sacreBLEU
    • Reference BLEU implementation that auto-downloads test sets and reports a version string to facilitate cross-lab comparisons
    • Source: https://github.com/mjpost/sacreBLEU
    • Installed version: 1.2.12
    • The sacrebleu executable is in PATH
  • scoring
    • Script that makes it easy to score machine translation output using NIST's BLEU and NIST, TER, and METEOR, by Ken Heafield
    • Source: https://kheafield.com/code/scoring.tar.gz
    • Installed version: Sept 19, 2012
    • The score.rb script is in PATH


Contact: Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi