Difference between revisions of "Translation/home"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Background)
Line 5: Line 5:
 
Initially, the software and data are commissioned on the Finnish Taito supercluster.
 
Initially, the software and data are commissioned on the Finnish Taito supercluster.
  
'''Current status (1/2018):'''
+
= Available software and data =
  
* '''moses''' module: SMT pipeline (Moses + various word alignment tools) installed on Taito and Abel (Moses release 4.0)
+
=== Statistical machine translation and word alignment ===
* '''efmaral''' module: efmaral and eflomal word alignment tools installed on Taito and Abel
 
* Older versions of '''moses''' and '''efmaral''' modules (installed 7/2017) are still available on Taito
 
* MT datasets available under <code>/proj[ects]/nlpl/data/translation</code>:
 
** IWSLT17 (0.6G)
 
** WMT17 News Task (16G)
 
** Preprocessed WMT17 News files used for the Helsinki submissions (5G)
 
  
'''Coming up (Goal: 2/2018):'''
+
* Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
 +
** Release 4.0, installed on Abel and Taito as <code>moses/4.0-65c75ff</code> ([[#Using the Moses module|usage notes below]])
 +
** Release mmt-mvp-v0.12.1, installed on Taito as <code>"moses/mmt-mvp-v0.12.1-2739-gdc42bcb</code> (not recommended)
 +
* Additional word alignment tools efmaral and eflomal:
 +
** Most recent version <code>efmaral/0.1_2017_11_24</code>, installed on Abel and Taito ([[#Using the Efmaral module|usage notes below]])
 +
** Previous version <code>efmaral/0.1_2017_07_20</code>, installed on Taito (not recommended)
 +
 
 +
=== Neural machine translation ===
 +
 
 +
* Coming up (HNMT)
 +
 
 +
=== Datasets ===
 +
 
 +
* IWSLT17 parallel data (0.6G, on Taito and Abel):
 +
** <code>/proj[ects]/nlpl/data/translation/iwslt17</code>
 +
 
 +
* WMT17 news task parallel data (16G, on Taito and Abel):
 +
** <code>/proj[ects]/nlpl/data/translation/wmt17news</code>
 +
 
 +
* WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel):
 +
** <code>/proj[ects]/nlpl/data/translation/wmt17news_helsinki</code>
 +
 
 +
=== Models ===
 +
 
 +
* Coming up (Helsinki WMT2017 models, pretrained Edinburgh SMT models, ...)
  
* NMT toolkits
 
* Pretrained models (SMT, HNMT)
 
  
 
= Using the Moses module =
 
= Using the Moses module =

Revision as of 09:20, 20 February 2018

Background

An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT) is maintained for NLPL under the coordination of the University of Helsinki (UoH). Initially, the software and data are commissioned on the Finnish Taito supercluster.

Available software and data

Statistical machine translation and word alignment

  • Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with IRSTLM language model, with SALM:
    • Release 4.0, installed on Abel and Taito as moses/4.0-65c75ff (usage notes below)
    • Release mmt-mvp-v0.12.1, installed on Taito as "moses/mmt-mvp-v0.12.1-2739-gdc42bcb (not recommended)
  • Additional word alignment tools efmaral and eflomal:
    • Most recent version efmaral/0.1_2017_11_24, installed on Abel and Taito (usage notes below)
    • Previous version efmaral/0.1_2017_07_20, installed on Taito (not recommended)

Neural machine translation

  • Coming up (HNMT)

Datasets

  • IWSLT17 parallel data (0.6G, on Taito and Abel):
    • /proj[ects]/nlpl/data/translation/iwslt17
  • WMT17 news task parallel data (16G, on Taito and Abel):
    • /proj[ects]/nlpl/data/translation/wmt17news
  • WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Taito and Abel):
    • /proj[ects]/nlpl/data/translation/wmt17news_helsinki

Models

  • Coming up (Helsinki WMT2017 models, pretrained Edinburgh SMT models, ...)


Using the Moses module

  • Log into Taito or Abel
  • Activate the NLPL module repository:
module use -a /proj/nlpl/software/modulefiles/       # Taito
module use -a /projects/nlpl/software/modulefiles/   # Abel
  • Load the most recent version of the Moses module:
module load moses
  • Start using Moses, e.g. using the tutorial at http://statmt.org/moses/
  • The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted :
    • cmph, irstlm, xmlprc
    • with-mm
    • max-kenlm-order 10
    • max-factors 7
    • SALM + filter-pt
  • For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a separate module.) If you need to specify absolute paths in your scripts, you can find them on the help page of the module:
module help moses

Using the Efmaral module

  • Log into Taito or Abel
  • Activate the NLPL module repository:
module use -a /proj/nlpl/software/modulefiles/       # Taito
module use -a /projects/nlpl/software/modulefiles/   # Abel
  • Load the most recent version of the Efmaral module:
module load efmaral
  • You can use the align.py script directly:
align.py ...
  • You can use the efmaral module inside a Python3 script:
python3
>>> import efmaral
cd $EFMARALPATH
python3 scripts/evaluate.py efmaral \
   3rdparty/data/test.eng.hin.wa \
   3rdparty/data/test.eng 3rdparty/data/test.hin \
   3rdparty/data/trial.eng 3rdparty/data/trial.hin
  • The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
align_eflomal.py ...
  • You can also use the eflomal executable:
eflomal ...
  • You can also use the eflomal module in a Python3 script:
python3
>>> import eflomal
  • The atools executable (from fast_align) is also made available.


Contact: Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi