Difference between revisions of "Infrastructure/software/eosc"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Software)
(Data)
Line 79: Line 79:
  
 
= Data =
 
= Data =
 +
 +
For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.
 +
 +
Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.
 +
 +
Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.
 +
 +
Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.

Revision as of 07:26, 18 October 2019

Background

This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.

The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from MSc students to professors; there is much variation in computational experience and ‘Un*x foo’. Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.

Typical types of data include potentially large document collections (for example 130 billion words of English extracted from the Common Crawl or vast collections of translated texts in multiple languages), pre-computed representations of word or sentence meaning (so-called word embeddings), or more specialized training and evaluation sets for supervised machine learning tasks like parsing or machine translation.

After some two years of activity in the NLPL project, its community has collectively installed some 80 shared software modules and around six terabytes of primary source data. In May 2019, module load operations for NLPL-maintained software accounted for close to five percent of the total on the Norwegian Abel supercluster. In sum, preparing the software and data environment for the ‘average’ NLP experiment is no small task; duplication of data, software, and effort should be minimized. Further, reproducibility and replicability play an increasingly important role in NLP research. Other researchers must be enabled to re-run the same experiment (and obtain the same results), ideally also several years after the original publication.

Software

Relevant software modules comprise general-purpose run-time environments like Java and Python, machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others. NLPL users typically ‘mix and match’ several of these components, to then build their own code on top. They will often require specific versions of individual modules, sometimes for good reasons. Between 2017 and 2019, the NLPL infrastructure task force has received installation requests for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with additional constraints regarding supported versions of, for example, NumPy, PyTorch, or TensorFlow.

For compatibility with third-party code and for reproducibility, users should largely be free (within reason) to pick the module versions they (believe they) require, modules must not change once installed (and announced), and historic or older module versions should remain functional over time, ideally many years into the future. The NLPL approach to meeting these demands has been to ‘unbundle’ to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.) as individual modules and inasmuch as possible provide each module for multiple base language versions. Abstractly, this design appears adequate and scalable, but module installation needs to be automated further, uniformity across different computing environments improved, and users better guided in navigating the resulting (large) space of only partially interoperable modules.

Uniformity across different computing environments, essentially means that the exact same versions of tools (and bundles) are available, and of course that they behave the same on all systems. To accomplish this goal, it may ultimately be necessary to build the complete software stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system) in the NLPL modules collection. Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely that Python installations will differ (in minor versions, compiler and library versions, optional add-on components, and such) across different systems.

Containerization

So far, NLPL has shied away from using containers, in part simply because of lacking support on some of the target systems (notably Taito), in part because of a concern for reduced transparency from the user point of view. Also, containerizing individual software modules severely challenges modularization: There is no straightforward way to ‘mix and match’ multiple containers into a uniform process environment.

However, provisioning the full NLPL software (and possibly data) environment inside a container may offer some benefits, for example compatibility with cloud environments, increased uniformity across different systems, and potentially longer-term reproducibility. On this view, modularization would obtain within the container, just as it does in the current environments on, for example, Abel, Puhti, Saga, and Taito.

Data

For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.

Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.

Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.

Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.