Difference between revisions of "Translation/models"
(→MT example scripts and pretrained models) |
(→MT example scripts and pretrained models) |
||
(9 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
* using the Moses SMT toolkit. | * using the Moses SMT toolkit. | ||
− | The scripts (and the resulting model) are based on the [ | + | The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold: |
* to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models, | * to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models, | ||
* to provide a pre-trained, ready-to-use translation model. | * to provide a pre-trained, ready-to-use translation model. | ||
Line 17: | Line 17: | ||
=== Retrain a new model using the provided scripts === | === Retrain a new model using the provided scripts === | ||
− | * Copy the | + | * Copy the <tt>1_prepare.sh</tt> to <tt>6_test.sh</tt> scripts to your own working directory. |
* Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction. | * Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction. | ||
* Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations. | * Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations. | ||
− | * The output of script 6 should be similar to the provided files | + | * The output of script 6 should be similar to the provided files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning. |
=== Use the pre-trained model to translate unseen text === | === Use the pre-trained model to translate unseen text === | ||
− | * Copy the | + | * Copy the <tt>6_test.sh</tt> script to your own working directory. |
− | * Provide a tokenized and truecased test file ( | + | * Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory. |
− | * Adapt the WORKDIR path in | + | * Adapt the WORKDIR path in <tt>6_test.sh</tt> and run the script. |
− | * The output of script 6 corresponds to the files | + | * The output of script 6 corresponds to the files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets. |
Yves Scherrer, January 2019 | Yves Scherrer, January 2019 | ||
+ | |||
+ | == wmt18_helsinki-enfi-onmt == | ||
+ | |||
+ | This directory contains training scripts and the resulting model files for a translation system: | ||
+ | * trained on WMT18 news data preprocessed by the University of Helsinki | ||
+ | * translating from English to Finnish | ||
+ | * using the OpenNMT-py toolkit. | ||
+ | |||
+ | The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. | ||
+ | The goals of this example are twofold: | ||
+ | * to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models, | ||
+ | * to provide a pre-trained, ready-to-use translation model. | ||
+ | The two use cases are described below. | ||
+ | |||
+ | === Retrain a new model using the provided scripts === | ||
+ | |||
+ | * Copy the <tt>1_prepare.sh</tt> to <tt>4_test.sh</tt> scripts to your own working directory. | ||
+ | * Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction. | ||
+ | * Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. | ||
+ | * The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with <tt>module load nlpl-opennmt-py</tt> (without the <tt>-gpu</tt> suffix). | ||
+ | * The output of script 4 should be similar to the provided files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU. | ||
+ | |||
+ | === Use the pre-trained model to translate unseen text === | ||
+ | |||
+ | * Copy the <tt>4_test.sh</tt> script to your own working directory. | ||
+ | * Provide a tokenized, truecased and BPE-encoded test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory. | ||
+ | * Adapt the WORKDIR path in <tt>4_test.sh</tt> and run the script. | ||
+ | * The output of script 4 corresponds to the files <tt>testdata.out.fi</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets. | ||
+ | |||
+ | Yves Scherrer, May 2019 | ||
+ | |||
+ | == wmt18-fien-moses == | ||
+ | |||
+ | This directory contains training scripts and the resulting model files for a translation system: | ||
+ | * trained on the raw versions of the WMT18 news data | ||
+ | * translating from Finnish to English | ||
+ | * using the Moses SMT toolkit. | ||
+ | |||
+ | The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold: | ||
+ | * to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models, | ||
+ | * to provide a pre-trained, ready-to-use translation model. | ||
+ | The two use cases are described below. | ||
+ | |||
+ | === Retrain a new model using the provided scripts === | ||
+ | |||
+ | * Copy the <tt>1_prepare.sh</tt> to <tt>6_test.sh</tt> scripts to your own working directory. | ||
+ | * Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction. | ||
+ | * Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations. | ||
+ | * The output of script 6 should be similar to the provided files <tt>testdata.out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning. | ||
+ | |||
+ | === Use the pre-trained model to translate unseen text === | ||
+ | |||
+ | * Copy the <tt>6_test.sh</tt> script to your own working directory. | ||
+ | * Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory. | ||
+ | * Adapt the WORKDIR path in <tt>6_test.sh</tt> and run the script. | ||
+ | * The output of script 6 corresponds to the files <tt>testdata.out.en</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets. | ||
+ | |||
+ | Yves Scherrer, January 2019 | ||
+ | |||
+ | == wmt18-fien-onmt == | ||
+ | |||
+ | This directory contains training scripts and the resulting model files for a translation system: | ||
+ | * trained on the raw versions of the WMT18 news data | ||
+ | * translating from Finnish to English | ||
+ | * using the OpenNMT-py toolkit. | ||
+ | |||
+ | The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. | ||
+ | The goals of this example are twofold: | ||
+ | * to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models, | ||
+ | * to provide a pre-trained, ready-to-use translation model. | ||
+ | The two use cases are described below. | ||
+ | |||
+ | === Retrain a new model using the provided scripts === | ||
+ | |||
+ | * Copy the <tt>1_prepare.sh</tt> to <tt>4_test.sh</tt> scripts to your own working directory. | ||
+ | * Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction. | ||
+ | * Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. | ||
+ | * The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with <tt>module load nlpl-opennmt-py</tt> (without the <tt>-gpu</tt> suffix). | ||
+ | * The output of script 6 should be similar to the provided files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning. | ||
+ | |||
+ | === Use the pre-trained model to translate unseen text === | ||
+ | |||
+ | * Copy the <tt>4_test.sh</tt> script to your own working directory. | ||
+ | * Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory. | ||
+ | * Adapt the WORKDIR path in <tt>4_test.sh</tt> and run the script. | ||
+ | * The output of script 4 corresponds to the files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets. | ||
+ | |||
+ | Yves Scherrer, May 2019 | ||
+ | |||
+ | == opus-noen-moses == | ||
+ | |||
+ | This directory contains training scripts and the resulting model files for a translation system: | ||
+ | * trained on sentence-aligned data from the OPUS collection | ||
+ | * translating from Norwegian to English | ||
+ | * using the Moses SMT toolkit. | ||
+ | |||
+ | The scripts (and the resulting model) are based on the [http://www.statmt.org/moses/?n=Moses.Baseline Moses tutorial] which has additional information. The goals of this example are twofold: | ||
+ | * to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models, | ||
+ | * to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian. | ||
+ | The two use cases are described below. | ||
+ | |||
+ | === Retrain a new model using the provided scripts === | ||
+ | |||
+ | * Copy the <tt>1_prepare.sh</tt> to <tt>6_test.sh</tt> scripts to your own working directory. | ||
+ | * Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction. | ||
+ | * Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations. | ||
+ | * The output of script 6 should be similar to the provided files <tt>testdata_out.tok.en</tt>, <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning. | ||
+ | |||
+ | === Use the pre-trained model to translate unseen text === | ||
+ | |||
+ | * Copy the <tt>6_test.sh</tt> script to your own working directory. | ||
+ | * Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.true.no</tt> to your working directory. | ||
+ | * Adapt the WORKDIR path in <tt>6_test.sh</tt> and run the script. | ||
+ | * The output of script 6 corresponds to the files <tt>testdata_out.tok.en</tt>, <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. | ||
+ | |||
+ | Yves Scherrer, May 2019 | ||
+ | |||
+ | == opus-noen-onmt == | ||
+ | |||
+ | This directory contains training scripts and the resulting model files for a translation system: | ||
+ | * trained on sentence-aligned data from the OPUS collection | ||
+ | * translating from Norwegian to English | ||
+ | * using the OpenNMT-py toolkit. | ||
+ | |||
+ | The goals of this example are twofold: | ||
+ | * to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and the OPUS corpus collection as provided by the NLPL project for training new models, | ||
+ | * to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian. | ||
+ | The two use cases are described below. | ||
+ | |||
+ | === Retrain a new model using the provided scripts === | ||
+ | |||
+ | * Copy the <tt>1_prepare.sh</tt> to <tt>4_test.sh</tt> scripts to your own working directory. | ||
+ | * Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction. | ||
+ | * Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. | ||
+ | * The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with <tt>module load nlpl-opennmt-py</tt> (without the <tt>-gpu</tt> suffix). | ||
+ | * The output of script 4 should be similar to the provided files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Minor differences can be expected due to the non-deterministic nature of MERT tuning. | ||
+ | |||
+ | === Use the pre-trained model to translate unseen text === | ||
+ | |||
+ | * Copy the <tt>4_test.sh</tt> script to your own working directory. | ||
+ | * Provide a tokenized and truecased test file (<tt>1_prepare.sh</tt> shows how to do that) or copy <tt>testdata.en</tt> to your working directory. | ||
+ | * Adapt the WORKDIR path in <tt>4_test.sh</tt> and run the script. | ||
+ | * The output of script 4 corresponds to the files <tt>testdata_out.en</tt> and <tt>evaluation.txt</tt>. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets. | ||
+ | |||
+ | Yves Scherrer, May 2019 | ||
== iwslt18_helsinki-euen-marian == | == iwslt18_helsinki-euen-marian == | ||
Line 46: | Line 191: | ||
=== Retrain a new model using the provided scripts === | === Retrain a new model using the provided scripts === | ||
− | * Copy the | + | * Copy the <tt>1_train.sh</tt>, <tt>2_test.sh</tt>, <tt>validate.sh</tt> and <tt>composeXML.py</tt> scripts to your own working directory. |
* Adapt paths if necessary. | * Adapt paths if necessary. | ||
− | * Run the script | + | * Run the script <tt>1_train.sh</tt>, then <tt>2_test.sh</tt>. The <tt>validate.sh</tt> script is automatically called during training and does not have to be run separately. The <tt>composeXML.py</tt> script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel. |
− | * The output of script 2 should be similar to the provided | + | * The output of script 2 should be similar to the provided <tt>test.out.en</tt> and <tt>test.out.en.xml</tt> files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU. |
=== Use the pre-trained model to translate unseen text === | === Use the pre-trained model to translate unseen text === | ||
− | * Copy the | + | * Copy the <tt>2_test.sh</tt> and <tt>composeXML.py</tt> scripts to your own working directory. |
− | * Provide a tokenized, truecased and BPE-encoded test file or copy | + | * Provide a tokenized, truecased and BPE-encoded test file or copy <tt>test.eu</tt> to your working directory. |
− | * Adapt the WORKDIR path in | + | * Adapt the WORKDIR path in <tt>2_test.sh</tt> and run the script. |
− | * The output of script 2 corresponds to the files | + | * The output of script 2 corresponds to the files <tt>test.out.en</tt> and <tt>test.out.en.xml</tt>. |
* The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set. | * The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set. | ||
Yves Scherrer, January 2019 | Yves Scherrer, January 2019 |
Latest revision as of 12:04, 9 May 2019
Contents
MT example scripts and pretrained models
The models and scripts are located at /proj*/nlpl/data/translation/pretrained-models/
wmt18_helsinki-enfi-moses
This directory contains training scripts and the resulting model files for a translation system:
- trained on WMT18 news data preprocessed by the University of Helsinki
- translating from English to Finnish
- using the Moses SMT toolkit.
The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:
- to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
- to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.
Retrain a new model using the provided scripts
- Copy the 1_prepare.sh to 6_test.sh scripts to your own working directory.
- Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
- Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
- The output of script 6 should be similar to the provided files testdata.out.fi and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.
Use the pre-trained model to translate unseen text
- Copy the 6_test.sh script to your own working directory.
- Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
- Adapt the WORKDIR path in 6_test.sh and run the script.
- The output of script 6 corresponds to the files testdata.out.fi and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.
Yves Scherrer, January 2019
wmt18_helsinki-enfi-onmt
This directory contains training scripts and the resulting model files for a translation system:
- trained on WMT18 news data preprocessed by the University of Helsinki
- translating from English to Finnish
- using the OpenNMT-py toolkit.
The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. The goals of this example are twofold:
- to illustrate the use of the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
- to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.
Retrain a new model using the provided scripts
- Copy the 1_prepare.sh to 4_test.sh scripts to your own working directory.
- Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
- Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
- The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with module load nlpl-opennmt-py (without the -gpu suffix).
- The output of script 4 should be similar to the provided files testdata.out.fi and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.
Use the pre-trained model to translate unseen text
- Copy the 4_test.sh script to your own working directory.
- Provide a tokenized, truecased and BPE-encoded test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
- Adapt the WORKDIR path in 4_test.sh and run the script.
- The output of script 4 corresponds to the files testdata.out.fi and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.
Yves Scherrer, May 2019
wmt18-fien-moses
This directory contains training scripts and the resulting model files for a translation system:
- trained on the raw versions of the WMT18 news data
- translating from Finnish to English
- using the Moses SMT toolkit.
The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:
- to illustrate the use of the preprocessing pipeline, the Moses tools and MT data as provided by the NLPL project for training new models,
- to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.
Retrain a new model using the provided scripts
- Copy the 1_prepare.sh to 6_test.sh scripts to your own working directory.
- Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
- Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
- The output of script 6 should be similar to the provided files testdata.out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.
Use the pre-trained model to translate unseen text
- Copy the 6_test.sh script to your own working directory.
- Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
- Adapt the WORKDIR path in 6_test.sh and run the script.
- The output of script 6 corresponds to the files testdata.out.en and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.
Yves Scherrer, January 2019
wmt18-fien-onmt
This directory contains training scripts and the resulting model files for a translation system:
- trained on the raw versions of the WMT18 news data
- translating from Finnish to English
- using the OpenNMT-py toolkit.
The scripts (and the resulting model) are a slightly simplified version of the original Helsinki submissions. The goals of this example are twofold:
- to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and MT data as provided by the NLPL project for training new models,
- to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.
Retrain a new model using the provided scripts
- Copy the 1_prepare.sh to 4_test.sh scripts to your own working directory.
- Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
- Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
- The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with module load nlpl-opennmt-py (without the -gpu suffix).
- The output of script 6 should be similar to the provided files testdata_out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.
Use the pre-trained model to translate unseen text
- Copy the 4_test.sh script to your own working directory.
- Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
- Adapt the WORKDIR path in 4_test.sh and run the script.
- The output of script 4 corresponds to the files testdata_out.en and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.
Yves Scherrer, May 2019
opus-noen-moses
This directory contains training scripts and the resulting model files for a translation system:
- trained on sentence-aligned data from the OPUS collection
- translating from Norwegian to English
- using the Moses SMT toolkit.
The scripts (and the resulting model) are based on the Moses tutorial which has additional information. The goals of this example are twofold:
- to illustrate the use of the preprocessing pipeline, the Moses tools and the OPUS corpus collection as provided by the NLPL project for training new models,
- to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.
Retrain a new model using the provided scripts
- Copy the 1_prepare.sh to 6_test.sh scripts to your own working directory.
- Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
- Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
- The output of script 6 should be similar to the provided files testdata_out.tok.en, testdata_out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.
Use the pre-trained model to translate unseen text
- Copy the 6_test.sh script to your own working directory.
- Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.true.no to your working directory.
- Adapt the WORKDIR path in 6_test.sh and run the script.
- The output of script 6 corresponds to the files testdata_out.tok.en, testdata_out.en and evaluation.txt.
Yves Scherrer, May 2019
opus-noen-onmt
This directory contains training scripts and the resulting model files for a translation system:
- trained on sentence-aligned data from the OPUS collection
- translating from Norwegian to English
- using the OpenNMT-py toolkit.
The goals of this example are twofold:
- to illustrate the use of the preprocessing pipeline, the OpenNMT-py library and the OPUS corpus collection as provided by the NLPL project for training new models,
- to provide a pre-trained, ready-to-use translation model for a "low-resource language" from an MT point of view, Norwegian.
The two use cases are described below.
Retrain a new model using the provided scripts
- Copy the 1_prepare.sh to 4_test.sh scripts to your own working directory.
- Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
- Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in May 2019 and may have to be adapted.
- The scripts will not run out-of-the-box on Abel due to different installed versions of OpenNMT-py. The relevant module can be loaded on Abel with module load nlpl-opennmt-py (without the -gpu suffix).
- The output of script 4 should be similar to the provided files testdata_out.en and evaluation.txt. Minor differences can be expected due to the non-deterministic nature of MERT tuning.
Use the pre-trained model to translate unseen text
- Copy the 4_test.sh script to your own working directory.
- Provide a tokenized and truecased test file (1_prepare.sh shows how to do that) or copy testdata.en to your working directory.
- Adapt the WORKDIR path in 4_test.sh and run the script.
- The output of script 4 corresponds to the files testdata_out.en and evaluation.txt. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.
Yves Scherrer, May 2019
iwslt18_helsinki-euen-marian
This directory contains training scripts and the resulting model files for a translation system:
- trained on data from the IWSLT18 low-resource translation task on Basque-to-English
- using the preprocessed and augmented datasets from the University of Helsinki submission
- with the Marian NMT toolkit.
The scripts (and the resulting model) correspond to a slightly simplified version of the original Helsinki submission. The goals of this example are twofold:
- to illustrate the use of the Marian library and the MT data sets as provided by the NLPL project for training new models,
- to provide a pre-trained, ready-to-use translation model.
The two use cases are described below.
Retrain a new model using the provided scripts
- Copy the 1_train.sh, 2_test.sh, validate.sh and composeXML.py scripts to your own working directory.
- Adapt paths if necessary.
- Run the script 1_train.sh, then 2_test.sh. The validate.sh script is automatically called during training and does not have to be run separately. The composeXML.py script is automatically called during testing and does not have to be r un separately. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. Note that these scripts use the Marian version installed system-wide on Taito and may not run correctly on the earlier NLPL-installed Marian version available on Abel.
- The output of script 2 should be similar to the provided test.out.en and test.out.en.xml files. Minor differences can be expected due to the non-deterministic nature of neural network training on GPU.
Use the pre-trained model to translate unseen text
- Copy the 2_test.sh and composeXML.py scripts to your own working directory.
- Provide a tokenized, truecased and BPE-encoded test file or copy test.eu to your working directory.
- Adapt the WORKDIR path in 2_test.sh and run the script.
- The output of script 2 corresponds to the files test.out.en and test.out.en.xml.
- The result XML file is sent to the evaluation server. Uncomment this if you are not translating the official IWSLT 2018 test set.
Yves Scherrer, January 2019