Nordic Language Processing Laboratory - User contributions [en]

Infrastructure/software/eosc

2019-10-18T07:26:46Z

Joerg: /* Data */

= Background =

This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.

The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.

Typical types of data include potentially large
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),
pre-computed representations of
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]
(so-called word embeddings), or more specialized training and evaluation sets
for supervised machine learning tasks like parsing or machine translation.

After some two years of activity in the NLPL project, its community has collectively
installed some 80 shared software modules and around six terabytes of primary source data.
In May 2019, <code>module load</code> operations for NLPL-maintained software accounted
for close to five percent of the total on the Norwegian Abel supercluster.
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no
small task; duplication of data, software, and effort should be minimized.
Further, reproducibility and replicability play an increasingly important role in NLP research.
Other researchers must be enabled to re-run the same experiment (and obtain the same results),
ideally also several years after the original publication.

= Software =

Relevant software modules comprise general-purpose run-time environments like Java and Python,
machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of
discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others.
NLPL users typically ‘mix and match’ several of these components, to then build their own
code on top.
They will often require specific versions of individual modules, sometimes for good reasons.
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or
TensorFlow.

For compatibility with third-party code and for reproducibility, users should largely
be free (within reason) to pick the module versions they (believe they) require, modules must
not change once installed (and announced), and historic or older module versions should
remain functional over time, ideally many years into the future.
The NLPL approach to meeting these demands has been to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/catalogue ‘unbundle’]
to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.)
as individual modules and inasmuch as possible provide each module for multiple base language
versions.
Abstractly, this design appears adequate and scalable, but module installation needs to be
automated further, uniformity across different computing environments improved, and users
better guided in navigating the resulting (large) space of only partially interoperable
modules.

Uniformity across different computing environments, essentially means that the exact same
versions of tools (and bundles) are available, and of course that they behave the same on
all systems.
To accomplish this goal, it may ultimately be necessary to build the complete software
stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system)
in the NLPL modules collection.
Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely
that Python installations will differ (in minor versions, compiler and library versions,
optional add-on components, and such) across different systems.

= Containerization =

So far, NLPL has shied away from using containers, in part simply because of lacking
support on some of the target systems (notably Taito), in part because of a concern
for reduced transparency from the user point of view.
Also, containerizing individual software modules severely challenges modularization:
There is no straightforward way to ‘mix and match’ multiple containers into a uniform
process environment.

However, provisioning the ''full NLPL'' software (and possibly data) environment inside
a container may offer some benefits, for example compatibility with cloud environments,
increased uniformity across different systems, and potentially longer-term reproducibility.
On this view, modularization would obtain within the container, just as it does in the
current environments on, for example, Abel, Puhti, Saga, and Taito.

= Data =

For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.

Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.

Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.

Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.

Infrastructure/software/catalogue

2019-02-01T14:07:30Z

Joerg: /* Activity G: OPUS Parallel Corpus */

= Background =

This page provides a high-level summary of NLPL-specific software installed on either of our two systems.
As a rule of thumb, NLPL aims to build on generic software installations provided by the
system maintainers (e.g. development tools and libraries that are not discipline-specific),
using the [http://modules.sourceforge.net/ <tt>module</tt>s infrastructure].
For example, an environment like OpenNMT is unlikely to be used by other disciplines,
and NLPL stands to gain from in-house, shared expertise that comes with maintaining
a project-specific installation.
On the other hand, the CUDA libraries are general extensions to the operating system
that most users of deep learning frameworks on gpus will want to use; hence, CUDA is
most appropriately installed by the core system maintainers.
Frameworks like PyTorch and TensorFlow, arguably, present a middle ground to this
rule of thumb:
In principle, they are not discipline-specific, but in mid-2018 at least the demand for
installations of these frameworks is strong within NLPL, and the project will likely
benefit from growing its competencies in this area.

= Module Catalogue =

The discipline-specific modules maintained by NLPL are not activated by default.
To make available the NLPL directory of module configurations, on top of the
pre-configured, system-wide modules, one needs to:

<pre>
module use -a /proj*/nlpl/software/modulefiles/
</pre>

We will at times assume a shell variable <tt>$NLPLROOT</tt> that points to the
top-level project directory, i.e. <tt>/projects/nlpl/</tt> (on Abel) or
<tt>/proj/nlpl/</tt> (on Taito).
For NLPL users, we recommend that one adds the above <tt>module use</tt> command
to the shell start-up script, e.g. <tt>.bashrc</tt> in the user home directory.

= Activity A: Basic Infrastructure =

Interoperability of NLPL installations with each other, as well as with system-wide
software that is maintained by the core operations teams for Abel and Taito, is no
small challenge; neither is parallelism across the two systems, for example in
available software (and versions) and techniques for ‘mixing and matching’.
These challenges are discussed in some more detail with regard to the
[http://wiki.nlpl.eu/index.php/Infrastructure/software/python Python programming environment]
and with regard to
[http://wiki.nlpl.eu/index.php/Infrastructure/software/frameworks common Deep Learning frameworks].

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cython/0.29.1 || C Extensions for Python || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/nltk nlpl-nltk/3.3] || Natural Language Toolkit (NLTK) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/pytorch nlpl-pytorch/0.4.1] || PyTorch Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/spacy nlpl-spacy/2.0.12] || spaCy: Natural Language Processing in Python || Abel, Taito || October 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Infrastructure/software/tensorflow nlpl-tensorflow/1.11] || TensorFlow Deep Learning Framework (CPU and GPU) || Abel, Taito || September 2018 || Stephan Oepen
|}

= Activity B: Statistical and Neural Machine Translation =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/mmt-mvp-v0.12.1-2739-gdc42bcb] || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Moses_module nlpl-moses/4.0-65c75ff] || Moses SMT System Release 4.0, including GIZA++, MGIZA, fast_align, SALM<br/>Some minor fixes added to existing install 2/2018.<br/> Should not break compatibility except when using tokenizer.perl for Finnish or Swedish. || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_07_20] || efmaral and eflomal word alignment tools || Taito || July 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2017_11_24] || efmaral and eflomal word alignment tools || Taito, Abel || November 2017 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Efmaral_module nlpl-efmaral/0.1_2018_12_13/17] || efmaral and eflomal word alignment tools || Taito, Abel || December 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_HNMT_module nlpl-hnmt/1.0.1] || HNMT neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| [http://wiki.nlpl.eu/index.php/Translation/opennmt-py nlpl-opennmt-py/0.2.1] || OpenNMT Python Library || Abel, Taito || September 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_Marian_module nlpl-marian/1.2.0] || Marian neural machine translation system || Taito || March 2018 || Yves Scherrer
|-
| marian/1.5 || Marian neural machine translation system || Taito || June 2018 || CSC staff
|-
| [http://wiki.nlpl.eu/index.php/Translation/home#Using_the_mttools_module nlpl-mttools/2018_12_23] || A collection of preprocessing and evaluation script for machine translation || Taito, Abel || December 2018 || Yves Scherrer
|}

= Activity C: Data-Driven Parsing =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| [http://wiki.nlpl.eu/index.php/Parsing/uuparser nlpl-uuparser] || Uppsala Parser || Abel || December 2018 ||
|-
| [http://wiki.nlpl.eu/index.php/Parsing/udpipe nlpl-udpipe/1.2.1-devel] || UDPipe 1.2 with Pre-Trained Models || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| [http://wiki.nlpl.eu/index.php/Parsing/dozat nlpl-dozat/201812] || Stanford Graph-Based Parser by Tim Dozat (v3) || Abel || December 2018 || Stephan Oepen
|-
| [http://wiki.nlpl.eu/index.php/Parsing/repp nlpl-repp/201812] || REPP Tokenizer (and Sentence Splitter) || Abel || December 2018 || Stephan Oepen
|}

= Activity E: Pre-Trained Word Embeddings =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-gensim/3.6.0 || GenSim: Topic Modeling for Humans || Taito, Abel || October 2018 || Stephan Oepen
|}

= Activity G: OPUS Parallel Corpus =

{| class="wikitable"
|-
! Module Name/Version !! Description !! System !! Install Date !! Maintainer
|-
| nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.1 || Various OPUS Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|-
| nlpl-opus/0.2 || Various OPUS Tools || Taito, Abel || 2018 || Jörg Tiedemann
|-
| nlpl-opus/201901 || Various OPUS Tools || Taito, Abel || January 2019 || Jörg Tiedemann
|-
| nlpl-uplug/0.3.8dev || UPlug Parallel Corpus Tools || Taito, Abel || November 2017 || Jörg Tiedemann
|}

Corpora/OPUS

2018-12-19T18:50:56Z

Joerg: /* http://opus.nlpl.eu */

== http://opus.nlpl.eu ==

OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from taito and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:

* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]
* Information about the [http://opus.nlpl.eu/trac#WebAPI OPUS API] for finding resources
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]

The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php

Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)

Corpora/home

2017-11-15T22:30:03Z

Joerg: /* Large Corpora */

NLPL provides various large data sets. They are available from the connected infrastructure. Please, check the individual pages of each resource.

* [[Corpora/OPUS|OPUS - the collection of open parallel corpora]]

Corpora/OPUS

2017-11-15T22:29:46Z

Joerg: /* OPUS */

== http://opus.nlpl.eu ==

OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from taito and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:

* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]

The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php

Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)

Corpora/OPUS

2017-11-15T22:26:51Z

Joerg: Created page with "= OPUS = URL: http://opus.nlpl.eu OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The..."

= OPUS =

URL: http://opus.nlpl.eu

OPUS is a collection of open parallel corpora in many languages. It provides bilingually aligned data sets, interfaces, tools and more. The data sets are available in various common formats and are provided for download and for use within the NLPL infrastructure. The service is hosted at CSC in Finland and the core of the data is also available from sigma2 on abel. Tools for processing the data are accessible from taito and more detailed information can be found on the [http://opus.nlpl.eu/trac OPUS Wiki]:

* Information for [http://opus.nlpl.eu/trac/wiki/NLPL NLPL Users]
* Information about [http://opus.nlpl.eu/trac/wiki/DataFormats data formats]
* Information about [http://opus.nlpl.eu/trac/wiki/Tools tools]
* Information about [http://opus.nlpl.eu/trac/wiki/QueryInterfaces on-line interfaces]
* Information about [http://opus.nlpl.eu/trac/wiki/WordAlign word alignment] and the [http://opus.nlpl.eu/trac/wiki/WordAlignDB alignment lexicon]

The on-line search interface is available from http://opus.nlpl.eu/bin/opuscqp.pl and the word-alignment-based lexicon is accessible from http://opus.nlpl.eu/lex.php

Contact: [http://blogs.helsinki.fi/tiedeman/ Jörg Tiedemann] via e-mail - firstname.lastname at helsinki.fi (first name without dots)

Corpora/home

2017-11-15T22:07:03Z

Joerg: Created page with "= Large Corpora = NLPL provides various large data sets. They are available from the connected infrastructure. Please, check the individual pages of each resource. * [[Corpo..."

= Large Corpora =

NLPL provides various large data sets. They are available from the connected infrastructure. Please, check the individual pages of each resource.

* [[Corpora/OPUS|OPUS - the collection of open parallel corpora]]

Infrastructure/software/catalogue

2017-11-06T21:22:34Z

Joerg: /* Module catalogue */

= Background =

High-level summary of NLPL-specific software installed on either of our two systems

= Module catalogue =

Add the NLPL directory of module configurations:
* on Taito:
module use -a /proj/nlpl/software/modulefiles/
* on Abel:
module use -a /projects/nlpl/software/modulefiles/

{| class="wikitable"
|-
! Activity !! Module name/version !! Included software !! Installed on !! Install date
|-
| B || moses/mmt-mvp-v0.12.1-2739-gdc42bcb || Moses SMT system, including GIZA++, MGIZA, fast_align || Taito || 7/2017
|-
| B || efmaral/0.1_2017_07_20 || efmaral and eflomal word alignment tools || Taito || 7/2017
|-
| C, G || nlpl-udpipe/1.2.1-devel || UDPipe with pre-trained models || Taito, Abel || 11/2017
|-
| G || nlpl-cwb/3.4.12 || Corpus Work Bench (CWB) || Taito, Abel || 11/2017
|-
| G || nlpl-opus/0.1 || various OPUS tools || Taito, Abel || 11/2017
|-
| G || nlpl-uplug/0.3.8dev || Uplug parallel corpus tools || Taito, Abel || 11/2017
|}

Infrastructure/wiki

2017-10-16T08:18:56Z

Joerg: Created page with "= Background = NLPL maintains an installation of the [http://www.mediawiki.org MediaWiki] software to collaboratively maintain information that is relevant to project members..."

= Background =

NLPL maintains an installation of the [http://www.mediawiki.org MediaWiki] software
to collaboratively maintain information that is relevant to project members and other
users of the infrastructure, including technical documentation.
Bjørn Lindi and Stephan Oepen serve as the wiki administrators, i.e. will try to assist
in all technical questions regarding access and structure.

= Gaining Access =

To prevent wiki spam, the NLPL wiki is currently not set up for self-help account creation.
To again editing rights for the wiki, please email you Abel account name to
Bjørn Lindi or Stephan Oepen.

= Editing Conventions =

For the collaborative authoring process to maintain some semblance of structure,
please apply the following conventions in editing existing pages and, in particular,
when creating new pages:

Page names should be comprised of all lower-case english words, organized in a page hierarchy
that follows the project structure and other organizational principles, e.g. the names
of individual software packages closer to the leafs of the hierarchy.

Translation/documentation

2017-10-16T08:03:03Z

Joerg: Created page with "= Background = This page provides the user documentation for the software installation of Machine Translation (MT) tools on Taito, particularly the open-source [http://www.st..."

= Background =

This page provides the user documentation for the software installation of Machine Translation (MT) tools on Taito,
particularly the open-source [http://www.statmt.org/moses/ Moses] environment for experimentation with statistical MT (SMT).