Translation/models

From Nordic Language Processing Laboratory
Jump to: navigation, search

MT example scripts and pretrained models

The models and scripts are located at /proj*/nlpl/data/translation/pretrained-models/

wmt18_helsinki-enfi-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on WMT18 news data preprocessed by the University of Helsinki
  • translating from English to Finnish
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:

  • to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the 1_prepare.sh to 6_test.sh scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files testdata.out.fi and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the 6_test.sh script to your own working directory.
  • Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
  • Adapt the WORKDIR path in 6_test.sh and run the script.
  • The output of script 6 corresponds to the files testdata.out.fi and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

wmt18_helsinki-enfi-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on WMT18 news data preprocessed by the University of Helsinki
  • translating from English to Finnish
  • using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. The goals of this example are twofold:

  • to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the 1_prepare.sh to 4_test.sh scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
  • The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with module load nlpl-opennmt-py (without the -gpu suffix).
  • The output of script 4 should be similar to the provided files testdata.out.fi and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

Use the pre-trained model to translate unseen text

  • Copy the 4_test.sh script to your own working directory.
  • Provide a tokenized, truecased and BPE-encoded test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
  • Adapt the WORKDIR path in 4_test.sh and run the script.
  • The output of script 4 corresponds to the files testdata.out.fi and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

wmt18-fien-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on the raw versions of the WMT18 news data
  • translating from Finnish to English
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the 1_prepare.sh to 6_test.sh scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files testdata.out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the 6_test.sh script to your own working directory.
  • Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
  • Adapt the WORKDIR path in 6_test.sh and run the script.
  • The output of script 6 corresponds to the files testdata.out.en and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

wmt18-fien-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on the raw versions of the WMT18 news data
  • translating from Finnish to English
  • using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the 1_prepare.sh to 4_test.sh scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
  • The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with module load nlpl-opennmt-py (without the -gpu suffix).
  • The output of script 6 should be similar to the provided files testdata_out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the 4_test.sh script to your own working directory.
  • Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
  • Adapt the WORKDIR path in 4_test.sh and run the script.
  • The output of script 4 corresponds to the files testdata_out.en and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

opus-noen-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on sentence-aligned data from the OPUS collection
  • translating from Norwegian to English
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the 1_prepare.sh to 6_test.sh scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files testdata_out.tok.en, testdata_out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the 6_test.sh script to your own working directory.
  • Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.true.no to your working directory.
  • Adapt the WORKDIR path in 6_test.sh and run the script.
  • The output of script 6 corresponds to the files testdata_out.tok.en, testdata_out.en and evaluation.txt.

Yves Scherrer, May 2019

opus-noen-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on sentence-aligned data from the OPUS collection
  • translating from Norwegian to English
  • using the OpenNMT-py toolkit.

The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and the OPUS corpus collection as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the 1_prepare.sh to 4_test.sh scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
  • The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with module load nlpl-opennmt-py (without the -gpu suffix).
  • The output of script 4 should be similar to the provided files testdata_out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the 4_test.sh script to your own working directory.
  • Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
  • Adapt the WORKDIR path in 4_test.sh and run the script.
  • The output of script 4 corresponds to the files testdata_out.en and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

iwslt18_helsinki-euen-marian

This directory contains training scripts and the resulting model files for a translation system:

  • trained on data from the IWSLT18 low-resource translation task on Basque-to-English
  • using the preprocessed and augmented datasets from the University of Helsinki submission
  • with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission. The goals of this example are twofold:

  • to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the 1_train.sh, 2_test.sh, validate.sh and composeXML.py scripts to your own working directory.
  • Adapt paths if necessary.
  • Run the script 1_train.sh, then 2_test.sh. The validate.sh script is automatically called during training and does not have to be run separately. The composeXML.py script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
  • The output of script 2 should be similar to the provided test.out.en and test.out.en.xml files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

Use the pre-trained model to translate unseen text

  • Copy the 2_test.sh and composeXML.py scripts to your own working directory.
  • Provide a tokenized, truecased and BPE-encoded test file or copy test.eu to your working directory.
  • Adapt the WORKDIR path in 2_test.sh and run the script.
  • The output of script 2 corresponds to the files test.out.en and test.out.en.xml.
  • The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019