Revision as of 14:17, 18 December 2019

Background

Translation activity on the Taito and Abel servers (outdated)

This page is currently being updated (YS 16.12.2019)

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT) is maintained for NLPL under the coordination of the University of Helsinki (UoH). The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.

Available software and data

Statistical machine translation and word alignment

The Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: nlpl-moses/4.0-a89691f (usage notes below)
The word alignment tools efmaral and eflomal are installed on Puhti and Saga in the nlpl-efmaral module: nlpl-efmaral/0.1_20191218 (usage notes below)

Neural machine translation

Marian-NMT is installed on Puhti and Saga as nlpl-marian-nmt/1.8.0-eba7aed. Usage notes below.
OpenNMT-py is installed on Saga using NLPL-internal Pytorch: nlpl-opennmt-py/1.0.0rc2/3.7.
OpenNMT-py is installed on Puhti using system-wide Pytorch: nlpl-opennmt-py/nlpl-opennmt-py/1.0.0.

General scripts for machine translation

The nlpl-mttools module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: nlpl-mttools/20191218. See the mttools page for further details.

Datasets

On Puhti, the $NLPL project directory is located at /projappl/nlpl. On Saga, the $NLPL project directory is located at /cluster/shared/nlpl/.

IWSLT17 parallel data (0.6G, on Puhti and Saga):
```
$NLPL/data/translation/iwslt17
```
WMT17 news task parallel data (16G, on Puhti and Saga):
```
$NLPL/data/translation/wmt17news
```
WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):
```
$NLPL/data/translation/wmt17news_helsinki
```
IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):
```
$NLPL/data/translation/iwslt18
```
IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):
```
$NLPL/data/translation/iwslt18_helsinki
```
WMT18 news task parallel data (17G, on Puhti and Saga):
```
$NLPL/data/translation/wmt18news
```
WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):
```
$NLPL/data/translation/wmt18news_helsinki
```
WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):
```
$NLPL/data/translation/wmt18news_helsinki
```

Models

See this page for details.

Using the Moses module

Activate the NLPL module repository:

module use -a /projappl/nlpl/software/modules/etc         # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc   # Saga

Load the Moses module:
```
module load nlpl-moses/4.0-a89691f
```
Start using Moses, e.g. using the tutorial at http://statmt.org/moses/
The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
- cmph, xmlprc
- with-mm
- max-kenlm-order 10
- max-factors 7
- SALM + filter-pt
For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a separate module.)
If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
```
module help nlpl-moses/4.0-a89691f
```

Using the Efmaral module

Activate the NLPL module repository:

module use -a /projappl/nlpl/software/modules/etc         # Puhti
module use -a /cluster/shared/nlpl/software/modules/etc   # Saga

Load the Efmaral module:
```
module load nlpl-efmaral/0.1_20191218
```
You can use the align.py script directly:
```
align.py ...
```
You can use the efmaral module inside a Python3 script:
```
python3
>>> import efmaral
```

You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:

cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
    3rdparty/data/test.eng.hin.wa \
    3rdparty/data/test.eng 3rdparty/data/test.hin \
    3rdparty/data/trial.eng 3rdparty/data/trial.hin

The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
```
align_eflomal.py ...
```
You can also use the eflomal executable:
```
eflomal ...
```
You can also use the eflomal module in a Python3 script:
```
python3
>>> import eflomal
```
The atools executable (from fast_align) is also made available.

Using the Marian-NMT module

Example scripts

We provide adaptations of the Marian example scripts. These are best copied into your personal workspace before running them:

cp -r /proj/nlpl/software/marian/1.2.0/examples ./marian_examples

Training-basics: Launch the script with sbatch run-me.sh.
Transformer: Launch the script with sbatch run-me.sh. Note that the script is limited to run for 24h, which will not complete the training process. Also, multi-GPU processes consume a lot of billing units on CSC, so be careful with Transformer experiments!
Translating-amun Launch the script with sbatch run-me.sh.

Contact: Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi

@@ Line 18: / Line 18: @@
 === Neural machine translation ===
-* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian module|Usage notes below.]]
+* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]
 * '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.
 * '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.

Difference between revisions of "Translation/home"

Revision as of 14:17, 18 December 2019

Contents

Background

Available software and data

Statistical machine translation and word alignment

Neural machine translation

General scripts for machine translation

Datasets

Models

Using the Moses module

Using the Efmaral module

Using the Marian-NMT module

Example scripts

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools