Difference between revisions of "Translation/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Using the Moses module)
 
(62 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
= Background =
 
= Background =
 +
 +
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]
  
 
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
 
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)
 
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
 
is maintained for NLPL under the coordination of the University of Helsinki (UoH).
Initially, the software and data are commissioned on the Finnish Taito supercluster.
+
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.
  
 
= Available software and data =
 
= Available software and data =
Line 9: Line 11:
 
=== Statistical machine translation and word alignment ===
 
=== Statistical machine translation and word alignment ===
  
* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
+
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]]). Note: the most recent version on Puhti (as of Oct 2022) is <code>nlpl-moses/4.0.1-3990724</code>.
** Release 4.0, installed on Abel and Taito as <code>moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
+
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]]). Note: the most recent version on Puhti (as of Oct 2022) is <code>nlpl-efmaral/1.0.1_20221015</code>.
** Release mmt-mvp-v0.12.1, installed on Taito as <code>"moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
 
* Additional word alignment tools efmaral and eflomal:
 
** Most recent version <code>efmaral/0.1_2017_11_24</code>, installed on Abel and Taito ([[#Using the Efmaral module|usage notes below]])
 
** Previous version <code>efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)
 
  
 
=== Neural machine translation ===
 
=== Neural machine translation ===
  
* Coming up (HNMT)
+
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]
 +
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
 +
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/2.3.0</code>.
 +
 
 +
=== General scripts for machine translation ===
 +
 
 +
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details. Note: the most recent version on Puhti (as of Oct 2022) is <code>nlpl-mttools/20221015</code>.
 +
 
 +
=== Tools for processing parallel corpora (OPUS tools) ===
 +
* The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module. [[#Using the OPUS Tools module|Usage notes below.]]
 +
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.
 +
* '''Udpipe''' is installed in the <code>nlpl-udpipe</code> module.
 +
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.
  
 
=== Datasets ===
 
=== Datasets ===
 +
 +
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.
  
 
<ul>
 
<ul>
<li> IWSLT17 parallel data (0.6G, on Taito and Abel):<br/>
+
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/>
<pre>/proj[ects]/nlpl/data/translation/iwslt17</pre>
+
<pre>$NLPL/data/translation/iwslt17</pre>
 
</li>
 
</li>
<li> WMT17 news task parallel data (16G, on Taito and Abel):<br/>
+
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/>
<pre>/proj[ects]/nlpl/data/translation/wmt17news</pre>
+
<pre>$NLPL/data/translation/wmt17news</pre>
 
</li>
 
</li>
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel):<br/>
+
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/>
<pre>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</pre>
+
<pre>$NLPL/data/translation/wmt17news_helsinki</pre>
 +
</li>
 +
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/>
 +
<pre>$NLPL/data/translation/iwslt18</pre>
 +
</li>
 +
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/>
 +
<pre>$NLPL/data/translation/iwslt18_helsinki</pre>
 +
</li>
 +
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/>
 +
<pre>$NLPL/data/translation/wmt18news</pre>
 +
</li>
 +
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/>
 +
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
 +
</li>
 +
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/>
 +
<pre>$NLPL/data/translation/wmt18news_helsinki</pre>
 
</li>
 
</li>
 
</ul>
 
</ul>
Line 36: Line 63:
 
=== Models ===
 
=== Models ===
  
* Coming up (Helsinki WMT2017 models, pretrained Edinburgh SMT models, ...)
+
See [[Translation/models|this page]] for details.
 
 
  
 
= Using the Moses module =
 
= Using the Moses module =
  
 
<ul>
 
<ul>
<li>Log into Taito or Abel</li>
 
 
<li>Activate the NLPL module repository:
 
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/       # Taito
+
<pre>module use -a /projappl/nlpl/software/modules/etc        # Puhti
module use -a /projects/nlpl/software/modulefiles/  # Abel</pre>
+
module use -a /cluster/shared/nlpl/software/modules/etc   # Saga</pre>
 
</li>
 
</li>
 
<li>Load the most recent version of the Moses module:
 
<li>Load the most recent version of the Moses module:
<pre>module load moses</pre>
+
<pre>module load nlpl-moses</pre>
 
</li>
 
</li>
 
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
 
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li>
 
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
 
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
 
<ul>
 
<ul>
<li>cmph, irstlm, xmlprc</li>
+
<li>cmph, xmlprc</li>
 
<li>with-mm</li>
 
<li>with-mm</li>
 
<li>max-kenlm-order 10</li>
 
<li>max-kenlm-order 10</li>
Line 60: Line 85:
 
</ul></li>
 
</ul></li>
 
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
 
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
<pre>module help moses</pre>
+
<pre>module help nlpl-moses</pre>
 
</li>
 
</li>
 
</ul>
 
</ul>
Line 67: Line 92:
  
 
<ul>
 
<ul>
<li>Log into Taito or Abel</li>
 
 
<li>Activate the NLPL module repository:
 
<li>Activate the NLPL module repository:
<pre>module use -a /proj/nlpl/software/modulefiles/       # Taito
+
<pre>module use -a /projappl/nlpl/software/modules/etc        # Puhti
module use -a /projects/nlpl/software/modulefiles/  # Abel</pre>
+
module use -a /cluster/shared/nlpl/software/modules/etc   # Saga</pre>
 
</li>
 
</li>
 
<li>Load the most recent version of the Efmaral module:
 
<li>Load the most recent version of the Efmaral module:
 
<pre>
 
<pre>
module load efmaral
+
module load nlpl-efmaral
 
</pre>
 
</pre>
 
</li>
 
</li>
Line 104: Line 128:
 
</ul>
 
</ul>
  
= Using the HNMT module =
+
= Using the OPUS Tools module =
 
 
* Log into Taito-GPU (Important: this module only runs on Taito-GPU, not on Taito!)
 
* The HNMT module can be loaded by activating the NLPL software repository:
 
module use -a /proj/nlpl/software/modulefiles/
 
module load nlpl-hnmt
 
* Module-specific help is available by typing:
 
module help nlpl-hnmt
 
* The main HNMT script can be called directly on the command line (<code>hnmt.py</code>), but for anything serious CUDA is required, which is only available from within SLURM scripts.
 
* Because model training and testing is rather resource-intensive, we recommend to get started by using the example SLURM scripts, as explained below.
 
 
 
== Example scripts ==
 
 
 
The directory <code>/proj/nlpl/data/translation/hnmt_examples</code> contains a set of SLURM scripts for training and testing a baseline English-to-Finnish HNMT system. Copy the scripts to your own working directory before trying them out.
 
  
<ol>
 
<li><b>Data preparation:</b> The first script to launch is <code>prepare.sh</code>. It fetches the training, development and test data, extracts and reformats it, and calls the <code>make_encode.py</code> script to create vocabulary files for the source and target languages. This script runs rather fast and can be executed directly on a (Taito-GPU) login shell.</li>
 
<li><b>Training:</b> The second script is <code>train.sh</code> and launches <code>hnmt.py</code> to train a model. The parameters are fairly standard, except training time, which is kept low for testing purposes here (we tend to max out the Taito limits with 71h of training time...).
 
 
<ul>
 
<ul>
<li>The <code>training.*.out</code> file contains information about the training batches (training time and loss), and also shows translations of a small number of held-out sentences for examining the training process:<br />
+
<li>Activate the NLPL module repository:
<pre>SOURCE / TARGET / OUTPUT
+
<pre>module use -a /projappl/nlpl/software/modules/etc        # Puhti
at least for the time being , all of them will continue working at their current sites .
+
module use -a /cluster/shared/nlpl/software/modules/etc  # Saga</pre>
ainakin toistaiseksi he kaikki jatkavat töitään nykyisissä toimipaikoissaan .
 
ainakin kaikki ne tekevät työtä tällä hetkellä .</pre>
 
 
</li>
 
</li>
<li> The <code>training.log</code> and <code>training.log.eval</code> files report additional information, as explained on [https://github.com/robertostling/hnmt#log-files].</li>
+
<li>Load the OPUS tools module:
<li> The training process creates a <code>train.model.final</code> file, which is then used for testing.</li>
+
<pre>
</ul></li>
+
module load nlpl-opus
<li><b>Testing:</b> The last script is <code>test.sh</code> and launches <code>hnmt.py</code> to test the previously created model on held-out data. HNMT includes evaluation scripts for chrF and BLEU and will report these scores if a reference file is given.
+
</pre>
<ul>
+
</li>
<li>The resulting translations are written to <code>test.trans</code>.</li>
+
<li>You can also load CWB, Uplug and Udpipe modules:
<li>In the <code>test.*.out</code> file, you should obtain scores close to the following (depending on the neural network initialization and the GPU used, results may vary slightly):
+
<pre>module load nlpl-cwb</pre>
<pre>BLEU = 0.057750 (0.303002, 0.086025, 0.032001, 0.013334, BP = 1.000000)
+
<pre>module nlpl-uplug</pre>
LC BLEU = 0.057913 (0.303527, 0.086283, 0.032093, 0.013383, BP = 1.000000)
+
<pre>module load nlpl-udpipe</pre>
chrF = 0.310397 (precision = 0.355720, recall = 0.306064)</pre>
 
 
</li>
 
</li>
 
</ul>
 
</ul>
</ol>
 
 
== Troubleshooting ==
 
 
1.
 
HNMT: WARNING: NLTK not installed, will not be able to use internal tokenizer
 
* The installed version of HNMT does not include the NLTK tokenizer (which we don't use all that much here in Helsinki). We recommend you to use (a) already tokenized data, (b) the tokenizer included with Moses, or (c) your own tokenizer.
 
 
2.
 
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
 
MPIR_Init_thread(784).....:
 
MPID_Init(1326)...........: channel initialization failed
 
MPIDI_CH3_Init(120).......:
 
MPID_nem_init_ckpt(852)...:
 
MPIDI_CH3I_Seg_commit(307): PMI_Barrier returned -1
 
* Even when using a SLURM script, the HNMT command has to be prefixed by <code>srun</code>: <code>srun hnmt.py ...</code>
 
 
3.
 
ERROR (theano.gpuarray): Could not initialize pygpu, support disabled
 
* HNMT does not run on the login shell, try running it through a SLURM script.
 
 
4.
 
ERROR (theano.gof.opt): SeqOptimizer apply <theano.scan_module.scan_opt.PushOutScanOutput object at 0x7f7fa34fa7b8>
 
...
 
theano.gof.fg.InconsistencyError: Trying to reintroduce a removed node
 
* This message often occurs at the beginning of the training process and signals an optimization failure. It has no visible effect on training - the program continues running correctly.
 
 
5.
 
pygpu.gpuarray.GpuArrayException: b'cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory'
 
* This error can be prevented by decreasing the amount of pre-allocation (default is 0.9). Make sure to avoid overwriting the existing content of the THEANO_FLAGS variable:<br/> <code>export THEANO_FLAGS="$THEANO_FLAGS",gpuarray.preallocate=0.8</code>
 
  
  
 
'''Contact:'''
 
'''Contact:'''
 
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi
 
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

Latest revision as of 11:54, 21 October 2022

Background

Translation activity on the Taito and Abel servers (outdated)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT) is maintained for NLPL under the coordination of the University of Helsinki (UoH). The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

Available software and data

Statistical machine translation and word alignment

  • The Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: nlpl-moses/4.0-a89691f (usage notes below). Note: the most recent version on Puhti (as of Oct 2022) is nlpl-moses/4.0.1-3990724.
  • The word alignment tools efmaral and eflomal are installed on Puhti and Saga in the nlpl-efmaral module: nlpl-efmaral/0.1_20191218 (usage notes below). Note: the most recent version on Puhti (as of Oct 2022) is nlpl-efmaral/1.0.1_20221015.

Neural machine translation

  • Marian-NMT is installed on Puhti and Saga as nlpl-marian-nmt/1.8.0-eba7aed. Usage notes below.
  • OpenNMT-py is installed on Saga using NLPL-internal Pytorch: nlpl-opennmt-py/1.0.0rc2/3.7.
  • OpenNMT-py is installed on Puhti using system-wide Pytorch: nlpl-opennmt-py/2.3.0.

General scripts for machine translation

  • The nlpl-mttools module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: nlpl-mttools/20191218. See the mttools page for further details. Note: the most recent version on Puhti (as of Oct 2022) is nlpl-mttools/20221015.

Tools for processing parallel corpora (OPUS tools)

  • The bundle of OPUS tools is installed on Puhti and Saga in the nlpl-opus module. Usage notes below.
  • Uplug is installed in the nlpl-uplug module.
  • Udpipe is installed in the nlpl-udpipe module.
  • Corpus Work Bench is installed in the nlpl-cwb module.

Datasets

On Puhti, the $NLPL project directory is located at /projappl/nlpl. On Saga, the $NLPL project directory is located at /cluster/shared/nlpl/.

  • IWSLT17 parallel data (0.6G, on Puhti and Saga):
    $NLPL/data/translation/iwslt17
  • WMT17 news task parallel data (16G, on Puhti and Saga):
    $NLPL/data/translation/wmt17news
  • WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):
    $NLPL/data/translation/wmt17news_helsinki
  • IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):
    $NLPL/data/translation/iwslt18
  • IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):
    $NLPL/data/translation/iwslt18_helsinki
  • WMT18 news task parallel data (17G, on Puhti and Saga):
    $NLPL/data/translation/wmt18news
  • WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):
    $NLPL/data/translation/wmt18news_helsinki
  • WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):
    $NLPL/data/translation/wmt18news_helsinki

Models

See this page for details.

Using the Moses module

  • Activate the NLPL module repository:
    module use -a /projappl/nlpl/software/modules/etc         # Puhti
    module use -a /cluster/shared/nlpl/software/modules/etc   # Saga
  • Load the most recent version of the Moses module:
    module load nlpl-moses
  • Start using Moses, e.g. using the tutorial at http://statmt.org/moses/
  • The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
    • cmph, xmlprc
    • with-mm
    • max-kenlm-order 10
    • max-factors 7
    • SALM + filter-pt
  • For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a separate module.)
    If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
    module help nlpl-moses

Using the Efmaral module

  • Activate the NLPL module repository:
    module use -a /projappl/nlpl/software/modules/etc         # Puhti
    module use -a /cluster/shared/nlpl/software/modules/etc   # Saga
  • Load the most recent version of the Efmaral module:
    module load nlpl-efmaral
    
  • You can use the align.py script directly:
    align.py ...
  • You can use the efmaral module inside a Python3 script:
    python3
    >>> import efmaral
  • You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
    cd $EFMARALPATH
    python3 scripts/evaluate.py efmaral \
        3rdparty/data/test.eng.hin.wa \
        3rdparty/data/test.eng 3rdparty/data/test.hin \
        3rdparty/data/trial.eng 3rdparty/data/trial.hin
  • The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
    align_eflomal.py ...
  • You can also use the eflomal executable:
    eflomal ...
  • You can also use the eflomal module in a Python3 script:
    python3
    >>> import eflomal
  • The atools executable (from fast_align) is also made available.

Using the OPUS Tools module

  • Activate the NLPL module repository:
    module use -a /projappl/nlpl/software/modules/etc         # Puhti
    module use -a /cluster/shared/nlpl/software/modules/etc   # Saga
  • Load the OPUS tools module:
    module load nlpl-opus
    
  • You can also load CWB, Uplug and Udpipe modules:
    module load nlpl-cwb
    module nlpl-uplug
    module load nlpl-udpipe


Contact: Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi