Difference between revisions of "Translation/models"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(opus-noen-moses)
(opus-noen-onmt)
Line 150: Line 150:
  
 
This directory contains training scripts and the resulting model files for a translation system:
 
This directory contains training scripts and the resulting model files for a translation system:
* trained on parallel data from the OPUS repository
+
* trained on sentence-aligned data from the OPUS collection
 
* translating from Norwegian to English
 
* translating from Norwegian to English
 
* using the OpenNMT-py toolkit.
 
* using the OpenNMT-py toolkit.
  
''In progress.''
+
The goals of this example are twofold:
 +
* to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and the OPUS corpus collection as provided by the NLPL project for training new models,
 +
* to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
 +
The two use cases are described below.
 +
 
 +
=== Retrain a new model using the provided scripts ===
 +
 
 +
* Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
 +
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
 +
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
 +
* The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
 +
* The output of script 4 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.
 +
 
 +
=== Use the pre-trained model to translate unseen text ===
 +
 
 +
* Copy the `4_test.sh` script to your own working directory.
 +
* Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
 +
* Adapt the WORKDIR path in `4_test.sh` and run the script.
 +
* The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.
 +
 
 +
Yves Scherrer, May 2019
  
 
== iwslt18_helsinki-euen-marian ==
 
== iwslt18_helsinki-euen-marian ==

Revision as of 11:57, 9 May 2019

MT example scripts and pretrained models

The models and scripts are located at /proj*/nlpl/data/translation/pretrained-models/

wmt18_helsinki-enfi-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on WMT18 news data preprocessed by the University of Helsinki
  • translating from English to Finnish
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:

  • to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `6_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `6_test.sh` and run the script.
  • The output of script 6 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

wmt18_helsinki-enfi-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on WMT18 news data preprocessed by the University of Helsinki
  • translating from English to Finnish
  • using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. The goals of this example are twofold:

  • to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
  • The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
  • The output of script 4 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

Use the pre-trained model to translate unseen text

  • Copy the `4_test.sh` script to your own working directory.
  • Provide a tokenized, truecased and BPE-encoded test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `4_test.sh` and run the script.
  • The output of script 4 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

wmt18-fien-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on the raw versions of the WMT18 news data
  • translating from Finnish to English
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files `testdata.out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `6_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `6_test.sh` and run the script.
  • The output of script 6 corresponds to the files `testdata.out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019

wmt18-fien-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on the raw versions of the WMT18 news data
  • translating from Finnish to English
  • using the OpenNMT-py toolkit.

The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
  • The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
  • The output of script 6 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `4_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `4_test.sh` and run the script.
  • The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

opus-noen-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on sentence-aligned data from the OPUS collection
  • translating from Norwegian to English
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `6_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.true.no` to your working directory.
  • Adapt the WORKDIR path in `6_test.sh` and run the script.
  • The output of script 6 corresponds to the files `testdata_out.tok.en`, `testdata_out.en` and `evaluation.txt`.

Yves Scherrer, May 2019

opus-noen-onmt

This directory contains training scripts and the resulting model files for a translation system:

  • trained on sentence-aligned data from the OPUS collection
  • translating from Norwegian to English
  • using the OpenNMT-py toolkit.

The goals of this example are twofold:

  • to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and the OPUS corpus collection as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `4_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
  • The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with `module load nlpl-opennmt-py` (without the `-gpu` suffix).
  • The output of script 4 should be similar to the provided files `testdata_out.en` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `4_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `4_test.sh` and run the script.
  • The output of script 4 corresponds to the files `testdata_out.en` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, May 2019

iwslt18_helsinki-euen-marian

This directory contains training scripts and the resulting model files for a translation system:

  • trained on data from the IWSLT18 low-resource translation task on Basque-to-English
  • using the preprocessed and augmented datasets from the University of Helsinki submission
  • with the Marian NMT toolkit.

The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission. The goals of this example are twofold:

  • to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_train.sh`, `2_test.sh`, `validate.sh` and `composeXML.py` scripts to your own working directory.
  • Adapt paths if necessary.
  • Run the script `1_train.sh`, then `2_test.sh`. The `validate.sh` script is automatically called during training and does not have to be run separately. The `composeXML.py` script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
  • The output of script 2 should be similar to the provided `test.out.en` and `test.out.en.xml` files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.

Use the pre-trained model to translate unseen text

  • Copy the `2_test.sh` and `composeXML.py` scripts to your own working directory.
  • Provide a tokenized, truecased and BPE-encoded test file or copy `test.eu` to your working directory.
  • Adapt the WORKDIR path in `2_test.sh` and run the script.
  • The output of script 2 corresponds to the files `test.out.en` and `test.out.en.xml`.
  • The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.

Yves Scherrer, January 2019