http://wiki.nlpl.eu/api.php?action=feedcontributions&user=Drobac&feedformat=atomNordic Language Processing Laboratory - User contributions [en]2024-03-29T09:20:21ZUser contributionsMediaWiki 1.31.10http://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1009Infrastructure/software/eosc2020-03-09T12:55:51Z<p>Drobac: /* Package producers, maintenance */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements from different perspectives =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
* If possible, to submit jobs to different clusters.<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template (or recipe) - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
* Packages should be easily portable to different clusters<br />
<br />
== HPC ==<br />
<br />
CSC uses Spack, but only for the "middle level" of the software stack (the base level being the operating systems' own rpm packages). Using Spack also for the machine learning frameworks and libraries would be quite a lot of work as each version would need to be packaged for Spack by us manually.<br />
<br />
The machine learning tools such as PyTorch and TensorFlow are currently installed with Conda (miniconda3 to be exact). Most Python libraries are published on PyPI (Python package index) and the newest versions can be easily installed via pip, which makes keeping the Conda environments up-to-date quite easy. It is possible to ''freeze all the package versions to make it exactly reproducible''.<br />
<br />
The big drawback of using Conda in an HPC environment is that Conda creates a lot of files. Even a small Conda environment can easily be 50,000 files, making it quite slow to load on shared file systems such as Lustre, that is why the first import statement in Puhti always takes quite a long time.<br />
<br />
CSC has been experimenting with Singularity containers as an alternative or possibly even replacing Conda-environments in the future for CSC's installations of PyTorch and TensorFlow etc.<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1008Infrastructure/software/eosc2020-03-09T10:23:27Z<p>Drobac: /* Users */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements from different perspectives =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
* If possible, to submit jobs to different clusters.<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template (or recipe) - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
== HPC ==<br />
<br />
CSC uses Spack, but only for the "middle level" of the software stack (the base level being the operating systems' own rpm packages). Using Spack also for the machine learning frameworks and libraries would be quite a lot of work as each version would need to be packaged for Spack by us manually.<br />
<br />
The machine learning tools such as PyTorch and TensorFlow are currently installed with Conda (miniconda3 to be exact). Most Python libraries are published on PyPI (Python package index) and the newest versions can be easily installed via pip, which makes keeping the Conda environments up-to-date quite easy. It is possible to ''freeze all the package versions to make it exactly reproducible''.<br />
<br />
The big drawback of using Conda in an HPC environment is that Conda creates a lot of files. Even a small Conda environment can easily be 50,000 files, making it quite slow to load on shared file systems such as Lustre, that is why the first import statement in Puhti always takes quite a long time.<br />
<br />
CSC has been experimenting with Singularity containers as an alternative or possibly even replacing Conda-environments in the future for CSC's installations of PyTorch and TensorFlow etc.<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1007Infrastructure/software/eosc2020-03-09T10:14:01Z<p>Drobac: /* Requirements */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements from different perspectives =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template (or recipe) - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
== HPC ==<br />
<br />
CSC uses Spack, but only for the "middle level" of the software stack (the base level being the operating systems' own rpm packages). Using Spack also for the machine learning frameworks and libraries would be quite a lot of work as each version would need to be packaged for Spack by us manually.<br />
<br />
The machine learning tools such as PyTorch and TensorFlow are currently installed with Conda (miniconda3 to be exact). Most Python libraries are published on PyPI (Python package index) and the newest versions can be easily installed via pip, which makes keeping the Conda environments up-to-date quite easy. It is possible to ''freeze all the package versions to make it exactly reproducible''.<br />
<br />
The big drawback of using Conda in an HPC environment is that Conda creates a lot of files. Even a small Conda environment can easily be 50,000 files, making it quite slow to load on shared file systems such as Lustre, that is why the first import statement in Puhti always takes quite a long time.<br />
<br />
CSC has been experimenting with Singularity containers as an alternative or possibly even replacing Conda-environments in the future for CSC's installations of PyTorch and TensorFlow etc.<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1006Infrastructure/software/eosc2020-03-09T10:13:29Z<p>Drobac: /* HPC */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template (or recipe) - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
== HPC ==<br />
<br />
CSC uses Spack, but only for the "middle level" of the software stack (the base level being the operating systems' own rpm packages). Using Spack also for the machine learning frameworks and libraries would be quite a lot of work as each version would need to be packaged for Spack by us manually.<br />
<br />
The machine learning tools such as PyTorch and TensorFlow are currently installed with Conda (miniconda3 to be exact). Most Python libraries are published on PyPI (Python package index) and the newest versions can be easily installed via pip, which makes keeping the Conda environments up-to-date quite easy. It is possible to ''freeze all the package versions to make it exactly reproducible''.<br />
<br />
The big drawback of using Conda in an HPC environment is that Conda creates a lot of files. Even a small Conda environment can easily be 50,000 files, making it quite slow to load on shared file systems such as Lustre, that is why the first import statement in Puhti always takes quite a long time.<br />
<br />
CSC has been experimenting with Singularity containers as an alternative or possibly even replacing Conda-environments in the future for CSC's installations of PyTorch and TensorFlow etc.<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1005Infrastructure/software/eosc2020-03-09T10:12:12Z<p>Drobac: /* Package producers, maintenance */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template (or recipe) - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
== HPC ==<br />
<br />
CSC uses Spack, but only for the "middle level" of the software stack (the base level being the operating systems' own rpm packages). Using Spack also for the machine learning frameworks and libraries would be quite a lot of work as each version would need to be packaged for Spack by us manually.<br />
<br />
The machine learning tools such as PyTorch and TensorFlow are currently installed with Conda (miniconda3 to be exact). Most Python libraries are published on PyPI (Python package index) and the newest versions can be easily installed via pip, which makes keeping the Conda environments up-to-date quite easy. It is possible to ''freeze all the package versions to make it exactly reproducible''.<br />
<br />
The big drawback of using Conda in an HPC environment is that Conda creates a lot of files. Even a small Conda environment can easily be 50,000 files, making it quite slow to load on shared file systems such as Lustre, that is why the first import statement in Puhti always takes quite a long time.<br />
<br />
CSC has been experimenting with Singularity containers.<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1004Infrastructure/software/eosc2020-03-09T10:11:54Z<p>Drobac: /* HPC */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
== HPC ==<br />
<br />
CSC uses Spack, but only for the "middle level" of the software stack (the base level being the operating systems' own rpm packages). Using Spack also for the machine learning frameworks and libraries would be quite a lot of work as each version would need to be packaged for Spack by us manually.<br />
<br />
The machine learning tools such as PyTorch and TensorFlow are currently installed with Conda (miniconda3 to be exact). Most Python libraries are published on PyPI (Python package index) and the newest versions can be easily installed via pip, which makes keeping the Conda environments up-to-date quite easy. It is possible to ''freeze all the package versions to make it exactly reproducible''.<br />
<br />
The big drawback of using Conda in an HPC environment is that Conda creates a lot of files. Even a small Conda environment can easily be 50,000 files, making it quite slow to load on shared file systems such as Lustre, that is why the first import statement in Puhti always takes quite a long time.<br />
<br />
CSC has been experimenting with Singularity containers.<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1003Infrastructure/software/eosc2020-03-09T10:02:25Z<p>Drobac: /* Requirements */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
== HPC ==<br />
<br />
From the HPC point of view, we need:<br />
...<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1002Infrastructure/software/eosc2020-03-09T10:02:07Z<p>Drobac: /* Example Use Cases */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
<br />
== HPC ==<br />
<br />
From the HPC point of view, we need:<br />
...<br />
<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1001Infrastructure/software/eosc2020-03-09T10:01:43Z<p>Drobac: /* Requirements */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
<br />
== HPC ==<br />
<br />
From the HPC point of view, we need:<br />
...<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=1000Infrastructure/software/eosc2020-03-09T10:01:11Z<p>Drobac: /* Example Use Cases */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
<br />
== HPC ==<br />
<br />
From the HPC point of view, we need:<br />
...<br />
<br />
= Example Use Cases =<br />
<br />
'''Researcher A''' develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
'''Researcher B''' has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
'''Student C''' has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
'''PhD Fellow D''' wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
'''Research group E''' publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=999Infrastructure/software/eosc2020-03-09T10:00:45Z<p>Drobac: /* Example Use Cases */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
<br />
== HPC ==<br />
<br />
From the HPC point of view, we need:<br />
...<br />
<br />
= Example Use Cases =<br />
<br />
Researcher A develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
Researcher B has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
Student C has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
PhD Fellow D wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
Research group E publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=998Infrastructure/software/eosc2020-03-09T10:00:37Z<p>Drobac: /* Requirements */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C below).<br />
* Use a specific combination of Python packages (see PhD Fellow D below).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).<br />
<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).<br />
<br />
<br />
== HPC ==<br />
<br />
From the HPC point of view, we need:<br />
...<br />
<br />
= Example Use Cases =<br />
<br />
[[Researcher A]] develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
Researcher B has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
Student C has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
PhD Fellow D wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
Research group E publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=997Infrastructure/software/eosc2020-03-09T09:58:22Z<p>Drobac: /* Example Use Cases */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
= Requirements =<br />
<br />
== Users ==<br />
<br />
Users would like to:<br />
<br />
* Easily use NLP software (don't need to install themselves).<br />
* Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A).<br />
* Possibly have shared courses among research labs in the nordic countries (see Student C).<br />
* Use a specific combination of Python packages (see PhD Fellow D).<br />
* Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E).<br />
<br />
<br />
== Package producers, maintenance ==<br />
<br />
For easier production of packages and maintenance, we would like to have:<br />
<br />
* (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).<br />
* All the packages installed with the same installation template - to easily create new software packages and for easier maintenance.<br />
* Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B).<br />
<br />
<br />
== HPC ==<br />
<br />
From the HPC point of view, we need:<br />
...<br />
<br />
<br />
<br />
<br />
= Example Use Cases =<br />
<br />
[[Researcher A]] develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
Researcher B has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backward-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
Student C has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing, the student needs to modify some existing code.<br />
The course could be shared among research labs in the nordic countries …<br />
<br />
PhD Fellow D wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
Research group E publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools make it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=996Infrastructure/software/eosc2020-03-09T09:25:16Z<p>Drobac: /* Summary */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
= Example Use Cases =<br />
<br />
Researcher A develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
Researcher B has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backwards-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
Student C has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing the student needs to modify some existing code.<br />
The course could be shared among research labs in the nodic countries …<br />
<br />
PhD Fellow D wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
Research group E publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools makes it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Infrastructure/software/eosc&diff=995Infrastructure/software/eosc2020-03-09T09:22:50Z<p>Drobac: /* Data */</p>
<hr />
<div>= Background =<br />
<br />
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.<br />
<br />
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from<br />
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.<br />
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to<br />
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.<br />
<br />
Typical types of data include potentially large<br />
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of<br />
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),<br />
pre-computed representations of<br />
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]<br />
(so-called word embeddings), or more specialized training and evaluation sets<br />
for supervised machine learning tasks like parsing or machine translation.<br />
<br />
After some two years of activity in the NLPL project, its community has collectively<br />
installed some 80 shared software modules and around eight terabytes of primary source data.<br />
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted<br />
for close to five percent of the total on the Norwegian Abel supercluster.<br />
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no<br />
small task; duplication of data, software, and effort should be minimized.<br />
Further, reproducibility and replicability play an increasingly important role in NLP research.<br />
Other researchers must be enabled to re-run the same experiment (and obtain the same results),<br />
ideally also several years after the original publication.<br />
<br />
= What is an NLPL User? =<br />
<br />
* Developer of NLP resources and tools (not an end-user of such tools)<br />
* Student (MSc or PhD) who learns to develop NLP tools and algorithms<br />
* Mostly runs on superclusters, increasingly wants (multi-)gpus<br />
<br />
= What does an NLPL User Need? =<br />
<br />
* Development environment (essential libraries and software packages)<br />
* Data for training (heavy machine learning), development, tuning, testing<br />
* Computing resources (CPU hours, more and more GPU hours)<br />
<br />
= Example Use Cases =<br />
<br />
Researcher A develops a new model of neural machine translation by<br />
implementing an extension to OpenNMT-py<br />
(a library for neural sequence-to-sequence models under heavy development).<br />
The implementation happens in a branch of the official OpenNMT GitHub package.<br />
The new extension requires the latest version with cutting-edge libraries of<br />
PyTorch and some external libraries from Facebook research and a lesser known NLP lab in China.<br />
The new code needs to be tested by training on standard data sets using GPU<br />
jobs that run for about 3 days per job.<br />
To compare baselines with various versions of the code and different training parameters,<br />
the researcher needs to run 20 parallel training jobs.<br />
Evaluation is done using standard benchmark test sets.<br />
The deadline for the next paper is in 10 days.<br />
Thanks to NLPL the same development environment (modules or otherwise) is<br />
in place in Norway and Finland and the same data is also accessible from those servers.<br />
The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …<br />
<br />
Researcher B has been working on developing and fine-tuning their document classification<br />
system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim,<br />
NumPy, SciPy, Keras, and TensorFlow).<br />
As they augment their architecture with a character-level convolutional layer, they<br />
stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1<br />
(when using the default OpenBLAS back-end),<br />
rendering convolutions about twenty times slower than they should be.<br />
They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because<br />
it introduces some changes that are not backwards-compatible with the current<br />
Gensim release.<br />
StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything<br />
else unchanged.<br />
NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows<br />
Researcher B to just change one version number in their <code>module load</code><br />
incantation.<br />
<br />
Student C has the assignment to train a few models with a known NLP package to compare<br />
different settings of training parameters and approaches to data processing.<br />
For data processing the student needs to modify some existing code.<br />
The course could be shared among research labs in the nodic countries …<br />
<br />
PhD Fellow D wants to test a cutting-edge method that was published in the latest NLP conference.<br />
There is some experimental code on GitHub but it requires some specific combination of<br />
Python packages and the whole thing is implemented in Julia.<br />
Fortunately, NLPL has already most of the packages in place that would be difficult<br />
to compile on the ancient CentOS setup otherwise.<br />
After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …<br />
<br />
Research group E publishes a new model for sentiment analysis and a paper describes it.<br />
They want to ensure that the results are replicable and, therefore, they want to publish the code,<br />
the data and the exact setup.<br />
Maybe they could create a containerized distribution?<br />
The NLPL environment tools makes it relatively straightforward to package this up …<br />
<br />
= Software =<br />
<br />
Relevant software modules comprise general-purpose run-time environments like Java and Python,<br />
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of<br />
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.<br />
NLPL users typically ‘mix and match’ several of these components, to then build their own<br />
code on top.<br />
They will often require specific versions of individual modules, sometimes for good reasons.<br />
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests<br />
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with<br />
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or<br />
TensorFlow.<br />
<br />
For compatibility with third-party code and for reproducibility, users should largely<br />
be free (within reason) to pick the module versions they (believe they) require, modules must<br />
not change once installed (and announced), and historic or older module versions should<br />
remain functional over time, ideally many years into the future.<br />
The NLPL approach to meeting these demands has been to<br />
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]<br />
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)<br />
as individual modules and inasmuch as possible provide each module for multiple base language<br />
versions.<br />
Abstractly, this design appears adequate and scalable, but module installation needs to be<br />
automated further, uniformity across different computing environments improved, and users<br />
better guided in navigating the resulting (large) space of only partially interoperable<br />
modules.<br />
<br />
Uniformity across different computing environments, essentially means that the exact same<br />
versions of tools (and bundles) are available, and of course that they behave the same on<br />
all systems.<br />
To accomplish this goal, it may ultimately be necessary to build the complete software<br />
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)<br />
in the NLPL modules collection.<br />
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely<br />
that Python installations will differ (in minor versions, compiler and library versions,<br />
optional add-on components, and such) across different systems.<br />
<br />
= Containerization =<br />
<br />
So far, NLPL has shied away from using containers, in part simply because of lacking<br />
support on some of the target systems (notably Taito), in part because of a concern<br />
for reduced transparency from the user point of view.<br />
Also, containerizing individual software modules severely challenges modularization:<br />
There is no straightforward way to ‘mix and match’ multiple containers into a uniform<br />
process environment.<br />
<br />
However, provisioning the ''full NLPL'' software (and possibly data) environment inside<br />
a container may offer some benefits, for example compatibility with cloud environments,<br />
increased uniformity across different systems, and potentially longer-term reproducibility.<br />
On this view, modularization would obtain within the container, just as it does in the<br />
current environments on, for example, Abel, Puhti, Saga, and Taito.<br />
<br />
= Data =<br />
<br />
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.<br />
<br />
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.<br />
<br />
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.<br />
<br />
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.<br />
<br />
<br />
= Summary =</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=991Corpora/OPUS2020-03-02T12:52:15Z<p>Drobac: /* http://opus.nlpl.eu */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2.<br />
<br />
For instructions on how to access the data and use the tools, check:<br />
* Information for [[#NLPL Users|NLPL Users]]<br />
<br />
More detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)<br />
<br />
<br />
=== NLPL Users ===<br />
<br />
The OPUS corpus is now hosted at [https://www.csc.fi/ CSC], the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the '''puhti''' shell. The core data is also available on the Norwegian cluster '''saga''' provided by [https://www.sigma2.no/ sigma2].<br />
<br />
If you have access to those systems then you will be able to access the data from the file system:<br />
<br />
<pre>on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)</pre><br />
<br />
On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus:<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS module:<br />
<pre>module load nlpl-opus</pre><br />
</li><br />
</ul><br />
<br />
With this, you will have access to essential tools that make it easier to read and process the data sets.</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=990Corpora/OPUS2020-03-02T11:58:24Z<p>Drobac: /* NLPL Users */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from puhti and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
<br />
* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)<br />
<br />
<br />
=== NLPL Users ===<br />
<br />
The OPUS corpus is now hosted at [https://www.csc.fi/ CSC], the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the '''puhti''' shell. The core data is also available on the Norwegian cluster '''saga''' provided by [https://www.sigma2.no/ sigma2].<br />
<br />
If you have access to those systems then you will be able to access the data from the file system:<br />
<br />
<pre>on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)</pre><br />
<br />
On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus:<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS module:<br />
<pre>module load nlpl-opus</pre><br />
</li><br />
</ul><br />
<br />
With this, you will have access to essential tools that make it easier to read and process the data sets.<br />
<br />
=== Tools ===</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=989Corpora/OPUS2020-03-02T11:54:04Z<p>Drobac: /* NLPL Users */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from puhti and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
<br />
* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)<br />
<br />
<br />
=== NLPL Users ===<br />
<br />
The OPUS corpus is now hosted at [[https://www.csc.fi/ CSC]], the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the '''puhti''' shell. The core data is also available on the Norwegian cluster '''saga''' provided by [[https://www.sigma2.no/ sigma2]].<br />
<br />
If you have access to those systems then you will be able to access the data from the file system:<br />
<br />
on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)<br />
<br />
<pre>on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)</pre><br />
<br />
<br />
On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus:<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS module:<br />
<pre>module load nlpl-opus</pre><br />
</li><br />
</ul><br />
<br />
With this, you will have access to essential tools that make it easier to read and process the data sets.<br />
<br />
=== Tools ===</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=988Corpora/OPUS2020-03-02T11:51:24Z<p>Drobac: /* NLPL Users */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from puhti and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
<br />
* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)<br />
<br />
<br />
=== NLPL Users ===<br />
<br />
The OPUS corpus is now hosted at [https://www.csc.fi/ CSC], the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the '''puhti''' shell. The core data is also available on the Norwegian cluster '''saga''' provided by [https://www.sigma2.no/ sigma2].<br />
<br />
If you have access to those systems then you will be able to access the data from the file system:<br />
<br />
on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)<br />
<br />
<pre>on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)</pre><br />
<br />
<br />
On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus:<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS module:<br />
<pre>module load nlpl-opus</pre><br />
</li><br />
</ul><br />
<br />
With this, you will have access to essential tools that make it easier to read and process the data sets.<br />
<br />
=== Tools ===</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=987Corpora/OPUS2020-03-02T11:51:06Z<p>Drobac: /* NLPL Users */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from puhti and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
<br />
* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)<br />
<br />
<br />
=== NLPL Users ===<br />
<br />
The OPUS corpus is now hosted at [https://www.csc.fi/ CSC], the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the '''puhti''' shell. The core data is also available on the Norwegian cluster '''saga''' provided by [https://www.sigma2.no/ sigma2].<br />
<br />
If you have access to those systems then you will be able to access the data from the file system:<br />
<br />
on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)<br />
<br />
<pre>on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)</pre><br />
<br />
<br />
On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus:<br />
<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS module:<br />
<pre>module load nlpl-opus</pre><br />
</li><br />
</ul><br />
<br />
With this, you will have access to essential tools that make it easier to read and process the data sets.<br />
<br />
=== Tools ===</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=986Corpora/OPUS2020-03-02T11:49:01Z<p>Drobac: /* NLPL Users */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from puhti and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
<br />
* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)<br />
<br />
<br />
=== NLPL Users ===<br />
<br />
The OPUS corpus is now hosted at [https://www.csc.fi/ CSC], the national scientific infrastructure provider of Finland and the resources are directly available for users of their services. The OPUS server runs in that environment but the data sets and tools are also directly available from the '''puhti''' shell. The core data is also available on the Norwegian cluster '''saga''' provided by [https://www.sigma2.no/ sigma2].<br />
<br />
If you have access to those systems then you will be able to access the data from the file system:<br />
<br />
on puhti: /projappl/nlpl/data/OPUS/<br />
on saga: /projects/nlpl/data/OPUS/ (only raw XML data)<br />
<br />
On both systems, you can also use tools that are packaged for working with the data (and other NLPL related activities). The basic tools for working with OPUS data can be loaded with the module nlpl-opus:<br />
<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS module:<br />
<pre>module load nlpl-opus</pre><br />
</li><br />
</ul><br />
<br />
With this, you will have access to essential tools that make it easier to read and process the data sets.<br />
<br />
=== Tools ===</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=985Corpora/OPUS2020-03-02T11:42:48Z<p>Drobac: /* http://opus.nlpl.eu */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from puhti and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
<br />
* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)<br />
<br />
<br />
=== NLPL Users ===<br />
<br />
=== Tools ===</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=984Translation/home2020-03-02T10:50:06Z<p>Drobac: /* Tools for processing parallel corpora (OPUS tools) */</p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module. [[#Using the OPUS Tools module|Usage notes below.]]<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Udpipe''' is installed in the <code>nlpl-udpipe</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
= Using the OPUS Tools module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS tools module:<br />
<pre><br />
module load nlpl-opus<br />
</pre><br />
</li><br />
<li>You can also load CWB, Uplug and Udpipe modules:<br />
<pre>module load nlpl-cwb</pre><br />
<pre>module nlpl-uplug</pre><br />
<pre>module load nlpl-udpipe</pre><br />
</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=983Translation/home2020-03-02T10:19:40Z<p>Drobac: /* Available software and data */</p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module. [[#Using the OPUS Tools module|Usage notes below.]]<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
= Using the OPUS Tools module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS tools module:<br />
<pre><br />
module load nlpl-opus<br />
</pre><br />
</li><br />
<li>You can also load CWB, Uplug and Udpipe modules:<br />
<pre>module load nlpl-cwb</pre><br />
<pre>module nlpl-uplug</pre><br />
<pre>module load nlpl-udpipe</pre><br />
</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=982Translation/home2020-03-02T10:12:51Z<p>Drobac: /* Using the OPUS Tools module */</p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module.<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
= Using the OPUS Tools module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS tools module:<br />
<pre><br />
module load nlpl-opus<br />
</pre><br />
</li><br />
<li>You can also load CWB, Uplug and Udpipe modules:<br />
<pre>module load nlpl-cwb</pre><br />
<pre>module nlpl-uplug</pre><br />
<pre>module load nlpl-udpipe</pre><br />
</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=981Translation/home2020-03-02T10:12:26Z<p>Drobac: /* Using the OPUS Tools module */</p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module.<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
= Using the OPUS Tools module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the OPUS tools module:<br />
<pre><br />
module load nlpl-opus<br />
</pre><br />
</li><br />
<li>You can also load Uplug, Udpipe and CWB modules:<br />
<pre>module load nlpl-cwb</pre><br />
<pre>module nlpl-uplug</pre><br />
<pre>module load nlpl-udpipe</pre><br />
</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=980Translation/home2020-03-02T10:09:19Z<p>Drobac: /* Using the Efmaral module */</p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module.<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
= Using the OPUS Tools module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=979Translation/home2020-03-02T10:07:42Z<p>Drobac: /* Tools for processing parallel corpora (OPUS tools) */</p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of '''OPUS tools''' is installed on Puhti and Saga in the <code>nlpl-opus</code> module.<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=978Translation/home2020-03-02T10:07:18Z<p>Drobac: /* Tools for processing parallel corpora (OPUS tools) */</p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of OPUS tools is installed on Puhti and Saga in the <code>nlpl-opus</code> module.<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Translation/home&diff=977Translation/home2020-03-02T10:06:50Z<p>Drobac: </p>
<hr />
<div>= Background =<br />
<br />
[[Translation/taito_abel|Translation activity on the Taito and Abel servers (outdated)]]<br />
<br />
This page is currently being updated (YS 16.12.2019)<br />
<br />
An experimentation environment for Statistical and Neural Machine Translations (SMT and NMT)<br />
is maintained for NLPL under the coordination of the University of Helsinki (UoH).<br />
The software and data are commissioned on the Finnish Puhti and on the Norwegian Saga superclusters.<br />
<br />
= Available software and data =<br />
<br />
=== Statistical machine translation and word alignment ===<br />
<br />
* The '''Moses''' SMT pipeline with word alignment tools GIZA++, MGIZA, fast_align, with SALM (release 4.0) is installed on Puhti and Saga: <code>nlpl-moses/4.0-a89691f</code> ([[#Using the Moses module|usage notes below]])<br />
* The word alignment tools '''efmaral and eflomal''' are installed on Puhti and Saga in the nlpl-efmaral module: <code>nlpl-efmaral/0.1_20191218</code> ([[#Using the Efmaral module|usage notes below]])<br />
<br />
=== Neural machine translation ===<br />
<br />
* '''Marian-NMT''' is installed on Puhti and Saga as <code>nlpl-marian-nmt/1.8.0-eba7aed</code>. [[#Using the Marian-NMT module|Usage notes below.]]<br />
* '''OpenNMT-py''' is installed on Saga using NLPL-internal Pytorch: <code>nlpl-opennmt-py/1.0.0rc2/3.7</code>.<br />
* '''OpenNMT-py''' is installed on Puhti using system-wide Pytorch: <code>nlpl-opennmt-py/nlpl-opennmt-py/1.0.0</code>.<br />
<br />
=== General scripts for machine translation ===<br />
<br />
* The '''nlpl-mttools''' module provides a series of preprocessing and evaluation scripts useful for any kind of machine translation research, independently of the toolkit. It is installed on Puhti and Saga: <code>nlpl-mttools/20191218</code>. See [[Translation/mttools|the mttools page]] for further details.<br />
<br />
=== Tools for processing parallel corpora (OPUS tools) ===<br />
* The bundle of OPUS tools is installed on Puhti and Saga in the <code>nlpl-opus module</code>.<br />
* '''Uplug''' is installed in the <code>nlpl-uplug</code> module.<br />
* '''Corpus Work Bench''' is installed in the <code>nlpl-cwb</code> module.<br />
<br />
=== Datasets ===<br />
<br />
On Puhti, the <code>$NLPL</code> project directory is located at <code>/projappl/nlpl</code>. On Saga, the <code>$NLPL</code> project directory is located at <code>/cluster/shared/nlpl/</code>.<br />
<br />
<ul><br />
<li> IWSLT17 parallel data (0.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt17</pre><br />
</li><br />
<li> WMT17 news task parallel data (16G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news</pre><br />
</li><br />
<li> WMT17 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (5G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt17news_helsinki</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) parallel data (0.9G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18</pre><br />
</li><br />
<li> IWSLT18 (low-resource Basque-to-English task) preprocessed data from the Helsinki submission, with additional synthetic training data (2.6G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/iwslt18_helsinki</pre><br />
</li><br />
<li> WMT18 news task parallel data (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news</pre><br />
</li><br />
<li> WMT18 news task data preprocessed (tokenized, truecased and BPE-encoded) for the Helsinki submissions (17G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
<li> WMT19 news task data (German-English and Finnish-English), consisting of cleaned parallel data and backtranslations used in the Helsinki submissions (28G, on Puhti and Saga):<br/><br />
<pre>$NLPL/data/translation/wmt18news_helsinki</pre><br />
</li><br />
</ul><br />
<br />
=== Models ===<br />
<br />
See [[Translation/models|this page]] for details.<br />
<br />
= Using the Moses module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Moses module:<br />
<pre>module load nlpl-moses/4.0-a89691f</pre><br />
</li><br />
<li>Start using Moses, e.g. using the tutorial at http://statmt.org/moses/</li><br />
<li>The module contains the standard installation as described at http://www.statmt.org/moses/?n=Development.GetStarted:<br />
<ul><br />
<li>cmph, xmlprc</li><br />
<li>with-mm</li><br />
<li>max-kenlm-order 10</li><br />
<li>max-factors 7</li><br />
<li>SALM + filter-pt</li><br />
</ul></li><br />
<li>For word alignment, you can use GIZA++, Mgiza and fast_align. (The word alignment tools efmaral and eflomal are part of a [[#Using the Efmaral module|separate module]].)<br/>If you need to specify absolute paths in your scripts, you can find them on the help page of the module:<br />
<pre>module help nlpl-moses/4.0-a89691f</pre><br />
</li><br />
</ul><br />
<br />
= Using the Efmaral module =<br />
<br />
<ul><br />
<li>Activate the NLPL module repository:<br />
<pre>module use -a /projappl/nlpl/software/modules/etc # Puhti<br />
module use -a /cluster/shared/nlpl/software/modules/etc # Saga</pre><br />
</li><br />
<li>Load the Efmaral module:<br />
<pre><br />
module load nlpl-efmaral/0.1_20191218<br />
</pre><br />
</li><br />
<li>You can use the align.py script directly:<br />
<pre>align.py ...</pre><br />
</li><br />
<li>You can use the efmaral module inside a Python3 script:<br />
<pre>python3<br />
>>> import efmaral</pre><br />
</li><br />
<li>You can test the example given at https://github.com/robertostling/efmaral by changing to the installation directory:<br />
<pre>cd $EFMARALPATH<br />
python3 scripts/evaluate.py efmaral \<br />
3rdparty/data/test.eng.hin.wa \<br />
3rdparty/data/test.eng 3rdparty/data/test.hin \<br />
3rdparty/data/trial.eng 3rdparty/data/trial.hin</pre><br />
</li><br />
<li>The Efmaral module also contains eflomal. You can use the alignment scripts as follows:<br />
<pre>align_eflomal.py ...</pre><br />
</li><br />
<li>You can also use the eflomal executable:<br />
<pre>eflomal ...</pre><br />
</li><br />
<li>You can also use the eflomal module in a Python3 script:<br />
<pre>python3<br />
>>> import eflomal</pre><br />
</li><br />
<li>The atools executable (from fast_align) is also made available.</li><br />
</ul><br />
<br />
<br />
'''Contact:'''<br />
Yves Scherrer, University of Helsinki, firstname.lastname@helsinki.fi</div>Drobachttp://wiki.nlpl.eu/index.php?title=Corpora/OPUS&diff=976Corpora/OPUS2020-02-28T13:14:21Z<p>Drobac: /* http://opus.nlpl.eu */</p>
<hr />
<div><br />
== http://opus.nlpl.eu ==<br />
<br />
OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from puhti and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:<br />
<br />
* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]<br />
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources<br />
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]<br />
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]<br />
<br />
The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php<br />
<br />
Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)</div>Drobac