Translation/models

From Nordic Language Processing Laboratory
Revision as of 13:38, 3 February 2019 by Yvessche (talk | contribs) (Created page with "= MT example scripts and pretrained models = == wmt18_helsinki-enfi-moses == This directory contains training scripts and the resulting model files for a translation system:...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

MT example scripts and pretrained models

wmt18_helsinki-enfi-moses

This directory contains training scripts and the resulting model files for a translation system:

  • trained on WMT18 news data preprocessed by the University of Helsinki
  • translating from English to Finnish
  • using the Moses SMT toolkit.

The scripts (and the resulting model) are based on the [Moses tutorial](http://www.statmt.org/moses/?n=Moses.Baseline) which has additional information. The goals of this example are twofold:

  • to illustrate the use of the Moses tools and MT data as provided by the NLPL project for training new models,
  • to provide a pre-trained, ready-to-use translation model.

The two use cases are described below.

Retrain a new model using the provided scripts

  • Copy the `1_prepare.sh` to `6_test.sh` scripts to your own working directory.
  • Adapt paths if necessary, e.g. if you want to use data for a different language pair or a different translation direction.
  • Run the scripts one by one. The time and memory requirements in the SLURM scripts are tuned to usage on Taito in January 2019 and may have to be adapted. The scripts have not been tested on Abel, but should run with the necessary adaptations.
  • The output of script 6 should be similar to the provided files `testdata.out.fi` and `evaluation.txt`. Minor differences can be expected due to the non-deterministic nature of MERT tuning.

Use the pre-trained model to translate unseen text

  • Copy the `6_test.sh` script to your own working directory.
  • Provide a tokenized and truecased test file (`1_prepare.sh` shows how to do that) or copy `testdata.en` to your working directory.
  • Adapt the WORKDIR path in `6_test.sh` and run the script.
  • The output of script 6 corresponds to the files `testdata.out.fi` and `evaluation.txt`. Note that evaluation will only work correctly if the test set is registered in the sacreBLEU database. This is typically the case for WMT and IWSLT test sets.

Yves Scherrer, January 2019