Translation/models

From Nordic Language Processing Laboratory
Revision as of 13:46, 3 February 2019 by Yvessche (talk | contribs)
Jump to: navigation, search

MT example scripts and pretrained models

The models and scripts are located at /proj*/nlpl/data/translation/pretrained-models/

wmt18_helsinki-enfi-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on WMT18 news data preprocessed by the University of Helsinki
  • translating from English to Finnish
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [Moses tutorial](http://www.statmt.org/moses/?n=Moses.Baseline) which has additional information. The goals of this example are twofold:

  • to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `6_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `6_test.sh` and run the script.
  • The output of script 6 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

wmt18_helsinki-enfi-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on WMT18 news data preprocessed by the University of Helsinki
  • translating from English to Finnish
  • using the OpenNMT-py toolkit.

In progress.


wmt18-fien-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on the raw versions of the WMT18 news data
  • translating from Finnish to English
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [Moses tutorial](http://www.statmt.org/moses/?n=Moses.Baseline) which has additional information. The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files `testdata.out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `6_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `6_test.sh` and run the script.
  • The output of script 6 corresponds to the files `testdata.out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019


wmt18-fien-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on the raw versions of the WMT18 news data
  • translating from Finnish to English
  • using the OpenNMT-py toolkit.

In progress.

opus-noen-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on parallel data from the OPUS repository
  • translating from Norwegian to English
  • using the Moses toolkit.

In progress.

opus-noen-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on parallel data from the OPUS repository
  • translating from Norwegian to English
  • using the OpenNMT-py toolkit.

In progress.


iwslt18_helsinki-euen-marian

This directory contains training scripts and the resulting model files for a translation system:

  • trained on data from the IWSLT18 low-resource translation task on Basque-to-English
  • using the preprocessed and augmented datasets from the University of Helsinki submission
  • with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission. The goals of this example are twofold:

  • to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_train.sh`, `2_test.sh`, `validate.sh` and `composeXML.py` scripts to your own working directory.
  • Adapt paths if necessary.
  • Run the script `1_train.sh`, then `2_test.sh`. The `validate.sh` script is automatically called during training and does not have to be run separately. The `composeXML.py` script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
  • The output of script 2 should be similar to the provided `test.out.en` and `test.out.en.xml` files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

Use the pre-trained model to translate unseen text

  • Copy the `2_test.sh` and `composeXML.py` scripts to your own working directory.
  • Provide a tokenized, truecased and BPE-encoded test file or copy `test.eu` to your working directory.
  • Adapt the WORKDIR path in `2_test.sh` and run the script.
  • The output of script 2 corresponds to the files `test.out.en` and `test.out.en.xml`.
  • The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019