Difference between revisions of "Translation/home"
(34 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
= Background = | = Background = | ||
+ | |||
+ | [[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]] | ||
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT) | An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT) | ||
is maintained for NLPL under the coordination of the University of Helsinki (UoH). | is maintained for NLPL under the coordination of the University of Helsinki (UoH). | ||
− | + | The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters. | |
− | |||
= Available software and data = | = Available software and data = | ||
Line 10: | Line 11: | ||
=== Statistical machine translation and word alignment === | === Statistical machine translation and word alignment === | ||
− | * Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align | + | * The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]]). Note: the most recent version on Puhti (as of Oct 2022) is <code>nlpl-moses/4.0.1-3990724</code>. |
− | + | * The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]]). Note: the most recent version on Puhti (as of Oct 2022) is <code>nlpl-efmaral/1.0.1_20221015</code>. | |
− | |||
− | * | ||
− | |||
− | |||
− | |||
=== Neural machine translation === | === Neural machine translation === | ||
− | * | + | * '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]] |
− | * | + | * '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>. |
− | * | + | * '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/2.3.0</code>. |
− | |||
− | |||
=== General scripts for machine translation === | === General scripts for machine translation === | ||
− | * The ''nlpl-mttools'' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. | + | * The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details. Note: the most recent version on Puhti (as of Oct 2022) is <code>nlpl-mttools/20221015</code>. |
− | + | ||
− | * | + | === Tools for processing parallel corpora (OPUS tools) === |
+ | * The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module. [[#Using the OPUS Tools module|Usage notes below.]] | ||
+ | * '''Uplug''' is installed in the <code>nlpl-uplug</code> module. | ||
+ | * '''Udpipe''' is installed in the <code>nlpl-udpipe</code> module. | ||
+ | * '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module. | ||
=== Datasets === | === Datasets === | ||
+ | |||
+ | On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>. | ||
<ul> | <ul> | ||
− | <li> IWSLT17 parallel data (0.6G, on | + | <li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/> |
− | <pre>/ | + | <pre>$NLPL/data/translation/iwslt17</pre> |
+ | </li> | ||
+ | <li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/> | ||
+ | <pre>$NLPL/data/translation/wmt17news</pre> | ||
</li> | </li> | ||
− | <li> WMT17 news task | + | <li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/> |
− | <pre> | + | <pre>$NLPL/data/translation/wmt17news_helsinki</pre> |
</li> | </li> | ||
− | <li> | + | <li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/> |
− | <pre> | + | <pre>$NLPL/data/translation/iwslt18</pre> |
</li> | </li> | ||
− | <li> IWSLT18 (low-resource Basque-to-English task) | + | <li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/> |
− | <pre> | + | <pre>$NLPL/data/translation/iwslt18_helsinki</pre> |
</li> | </li> | ||
− | <li> | + | <li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/> |
− | <pre> | + | <pre>$NLPL/data/translation/wmt18news</pre> |
</li> | </li> | ||
− | <li> WMT18 news task | + | <li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/> |
− | <pre> | + | <pre>$NLPL/data/translation/wmt18news_helsinki</pre> |
</li> | </li> | ||
− | <li> | + | <li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/> |
− | <pre> | + | <pre>$NLPL/data/translation/wmt18news_helsinki</pre> |
</li> | </li> | ||
</ul> | </ul> | ||
Line 60: | Line 63: | ||
=== Models === | === Models === | ||
− | + | See [[Translation/models|this page]] for details. | |
= Using the Moses module = | = Using the Moses module = | ||
<ul> | <ul> | ||
− | |||
<li>Activate the NLPL module repository: | <li>Activate the NLPL module repository: | ||
− | <pre>module use -a / | + | <pre>module use -a /projappl/nlpl/software/modules/etc # Puhti |
− | module use -a / | + | module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre> |
</li> | </li> | ||
<li>Load the most recent version of the Moses module: | <li>Load the most recent version of the Moses module: | ||
Line 76: | Line 78: | ||
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted: | <li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted: | ||
<ul> | <ul> | ||
− | <li>cmph | + | <li>cmph, xmlprc</li> |
<li>with-mm</li> | <li>with-mm</li> | ||
<li>max-kenlm-order 10</li> | <li>max-kenlm-order 10</li> | ||
Line 90: | Line 92: | ||
<ul> | <ul> | ||
− | |||
<li>Activate the NLPL module repository: | <li>Activate the NLPL module repository: | ||
− | <pre>module use -a / | + | <pre>module use -a /projappl/nlpl/software/modules/etc # Puhti |
− | module use -a / | + | module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre> |
</li> | </li> | ||
<li>Load the most recent version of the Efmaral module: | <li>Load the most recent version of the Efmaral module: | ||
Line 127: | Line 128: | ||
</ul> | </ul> | ||
− | = Using the | + | = Using the OPUS Tools module = |
<ul> | <ul> | ||
− | <li> | + | <li>Activate the NLPL module repository: |
− | + | <pre>module use -a /projappl/nlpl/software/modules/etc # Puhti | |
− | <pre>module use -a / | + | module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre> |
− | module | ||
</li> | </li> | ||
− | <li> | + | <li>Load the OPUS tools module: |
− | + | <pre> | |
− | + | module load nlpl-opus | |
− | + | </pre> | |
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | <pre> | ||
− | module load nlpl- | ||
</li> | </li> | ||
− | <li> | + | <li>You can also load CWB, Uplug and Udpipe modules: |
− | <pre>module | + | <pre>module load nlpl-cwb</pre> |
+ | <pre>module nlpl-uplug</pre> | ||
+ | <pre>module load nlpl-udpipe</pre> | ||
</li> | </li> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
</ul> | </ul> | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
'''Contact:''' | '''Contact:''' | ||
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi | Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi |
Latest revision as of 11:54, 21 October 2022
Contents
Background
Translation activity on the Taito and Abel servers (outdated)
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT) is maintained for NLPL under the coordination of the University of Helsinki (UoH). The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.
Available software and data
Statistical machine translation and word alignment
- The Moses SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga:
nlpl-moses/4.0-a89691f
(usage notes below). Note: the most recent version on Puhti (as of Oct 2022) isnlpl-moses/4.0.1-3990724
. - The word alignment tools efmaral and eflomal are installed on Puhti and Saga in the nlpl-efmaral module:
nlpl-efmaral/0.1_20191218
(usage notes below). Note: the most recent version on Puhti (as of Oct 2022) isnlpl-efmaral/1.0.1_20221015
.
Neural machine translation
- Marian-NMT is installed on Puhti and Saga as
nlpl-marian-nmt/1.8.0-eba7aed
. Usage notes below. - OpenNMT-py is installed on Saga using NLPL-internal Pytorch:
nlpl-opennmt-py/1.0.0rc2/3.7
. - OpenNMT-py is installed on Puhti using system-wide Pytorch:
nlpl-opennmt-py/2.3.0
.
General scripts for machine translation
- The nlpl-mttools module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga:
nlpl-mttools/20191218
. See the mttools page for further details. Note: the most recent version on Puhti (as of Oct 2022) isnlpl-mttools/20221015
.
Tools for processing parallel corpora (OPUS tools)
- The bundle of OPUS tools is installed on Puhti and Saga in the
nlpl-opus
module. Usage notes below. - Uplug is installed in the
nlpl-uplug
module. - Udpipe is installed in the
nlpl-udpipe
module. - Corpus Work Bench is installed in the
nlpl-cwb
module.
Datasets
On Puhti, the $NLPL
project directory is located at /projappl/nlpl
. On Saga, the $NLPL
project directory is located at /cluster/shared/nlpl/
.
- IWSLT17 parallel data (0.6G, on Puhti and Saga):
$NLPL/data/translation/iwslt17
- WMT17 news task parallel data (16G, on Puhti and Saga):
$NLPL/data/translation/wmt17news
- WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):
$NLPL/data/translation/wmt17news_helsinki
- IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):
$NLPL/data/translation/iwslt18
- IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):
$NLPL/data/translation/iwslt18_helsinki
- WMT18 news task parallel data (17G, on Puhti and Saga):
$NLPL/data/translation/wmt18news
- WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):
$NLPL/data/translation/wmt18news_helsinki
- WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):
$NLPL/data/translation/wmt18news_helsinki
Models
See this page for details.
Using the Moses module
- Activate the NLPL module repository:
module use -a /projappl/nlpl/software/modules/etc # Puhti module use -a /cluster/shared/nlpl/software/modules/etc # Saga
- Load the most recent version of the Moses module:
module load nlpl-moses
- Start using Moses, e.g. using the tutorial at http://statmt.org/moses/
- The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:
- cmph, xmlprc
- with-mm
- max-kenlm-order 10
- max-factors 7
- SALM + filter-pt
- For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a separate module.)
If you need to specify absolute paths in your scripts, you can find them on the help page of the module:module help nlpl-moses
Using the Efmaral module
- Activate the NLPL module repository:
module use -a /projappl/nlpl/software/modules/etc # Puhti module use -a /cluster/shared/nlpl/software/modules/etc # Saga
- Load the most recent version of the Efmaral module:
module load nlpl-efmaral
- You can use the align.py script directly:
align.py ...
- You can use the efmaral module inside a Python3 script:
python3 >>> import efmaral
- You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:
cd $EFMARALPATH python3 scripts/evaluate.py efmaral \ 3rdparty/data/test.eng.hin.wa \ 3rdparty/data/test.eng 3rdparty/data/test.hin \ 3rdparty/data/trial.eng 3rdparty/data/trial.hin
- The Efmaral module also contains eflomal. You can use the alignment scripts as follows:
align_eflomal.py ...
- You can also use the eflomal executable:
eflomal ...
- You can also use the eflomal module in a Python3 script:
python3 >>> import eflomal
- The atools executable (from fast_align) is also made available.
Using the OPUS Tools module
- Activate the NLPL module repository:
module use -a /projappl/nlpl/software/modules/etc # Puhti module use -a /cluster/shared/nlpl/software/modules/etc # Saga
- Load the OPUS tools module:
module load nlpl-opus
- You can also load CWB, Uplug and Udpipe modules:
module load nlpl-cwb
module nlpl-uplug
module load nlpl-udpipe
Contact:
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi