Infrastructure/software/eosc

From Nordic Language Processing Laboratory
Jump to: navigation, search

Background

This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.

The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from MSc students to professors; there is much variation in computational experience and ‘Un*x foo’. Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.

Typical types of data include potentially large document collections (for example 130 billion words of English extracted from the Common Crawl or vast collections of translated texts in multiple languages), pre-computed representations of word or sentence meaning (so-called word embeddings), or more specialized training and evaluation sets for supervised machine learning tasks like parsing or machine translation.

After some two years of activity in the NLPL project, its community has collectively installed some 80 shared software modules and around eight terabytes of primary source data. In May 2019, module load operations for NLPL-maintained software accounted for close to five percent of the total on the Norwegian Abel supercluster. In sum, preparing the software and data environment for the ‘average’ NLP experiment is no small task; duplication of data, software, and effort should be minimized. Further, reproducibility and replicability play an increasingly important role in NLP research. Other researchers must be enabled to re-run the same experiment (and obtain the same results), ideally also several years after the original publication.

What is an NLPL User?

  • Developer of NLP resources and tools (not an end-user of such tools)
  • Student (MSc or PhD) who learns to develop NLP tools and algorithms
  • Mostly runs on superclusters, increasingly wants (multi-)gpus

What does an NLPL User Need?

  • Development environment (essential libraries and software packages)
  • Data for training (heavy machine learning), development, tuning, testing
  • Computing resources (CPU hours, more and more GPU hours)


Requirements from different perspectives

Users

Users would like to:

  • Easily use NLP software (don't need to install themselves).
  • Have the same software environment on different Nordic clusters, in order to use resources from all the available clusters. This is important when researchers have a need for lots of resources in a short time (see Researcher A below).
  • Possibly have shared courses among research labs in the nordic countries (see Student C below).
  • Use a specific combination of Python packages (see PhD Fellow D below).
  • Easily package the current environment setup, in order to share and make possible to easily replicate the research results (see Research group E below).
  • If possible, to submit jobs to different clusters.

Package producers, maintenance

For easier production of packages and maintenance, we would like to have:

  • (Semi-)automatic documentation update about the installed software (which package contains which tools, with version numbers).
  • All the packages installed with the same installation template (or recipe) - to easily create new software packages and for easier maintenance.
  • Highly modular setup - it should be possible to easily change the version of specific software in the existing environment package (see Researcher B below).
  • Packages should be easily portable to different clusters

HPC

CSC uses Spack, but only for the "middle level" of the software stack (the base level being the operating systems' own rpm packages). Using Spack also for the machine learning frameworks and libraries would be quite a lot of work as each version would need to be packaged for Spack by us manually.

The machine learning tools such as PyTorch and TensorFlow are currently installed with Conda (miniconda3 to be exact). Most Python libraries are published on PyPI (Python package index) and the newest versions can be easily installed via pip, which makes keeping the Conda environments up-to-date quite easy. It is possible to freeze all the package versions to make it exactly reproducible.

The big drawback of using Conda in an HPC environment is that Conda creates a lot of files. Even a small Conda environment can easily be 50,000 files, making it quite slow to load on shared file systems such as Lustre, that is why the first import statement in Puhti always takes quite a long time.

CSC has been experimenting with Singularity containers as an alternative or possibly even replacing Conda-environments in the future for CSC's installations of PyTorch and TensorFlow etc.

Example Use Cases

Researcher A develops a new model of neural machine translation by implementing an extension to OpenNMT-py (a library for neural sequence-to-sequence models under heavy development). The implementation happens in a branch of the official OpenNMT GitHub package. The new extension requires the latest version with cutting-edge libraries of PyTorch and some external libraries from Facebook research and a lesser-known NLP lab in China. The new code needs to be tested by training on standard data sets using GPU jobs that run for about 3 days per job. To compare baselines with various versions of the code and different training parameters, the researcher needs to run 20 parallel training jobs. Evaluation is done using standard benchmark test sets. The deadline for the next paper is in 10 days. Thanks to NLPL the same development environment (modules or otherwise) is in place in Norway and Finland and the same data is also accessible from those servers. The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …

Researcher B has been working on developing and fine-tuning their document classification system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim, NumPy, SciPy, Keras, and TensorFlow). As they augment their architecture with a character-level convolutional layer, they stumble into a known problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1 (when using the default OpenBLAS back-end), rendering convolutions about twenty times slower than they should be. They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because it introduces some changes that are not backward-compatible with the current Gensim release. StackOverflow suggests upgrading NumPy to release 1.16.3, while keeping everything else unchanged. NLPL quickly installs a fresh NumPy environment module, and its highly modular setup allows Researcher B to just change one version number in their module load incantation.

Student C has the assignment to train a few models with a known NLP package to compare different settings of training parameters and approaches to data processing. For data processing, the student needs to modify some existing code. The course could be shared among research labs in the nordic countries …

PhD Fellow D wants to test a cutting-edge method that was published in the latest NLP conference. There is some experimental code on GitHub but it requires some specific combination of Python packages and the whole thing is implemented in Julia. Fortunately, NLPL has already most of the packages in place that would be difficult to compile on the ancient CentOS setup otherwise. After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …

Research group E publishes a new model for sentiment analysis and a paper describes it. They want to ensure that the results are replicable and, therefore, they want to publish the code, the data and the exact setup. Maybe they could create a containerized distribution? The NLPL environment tools make it relatively straightforward to package this up …

Software

Relevant software modules comprise general-purpose run-time environments like Java and Python, machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others. NLPL users typically ‘mix and match’ several of these components, to then build their own code on top. They will often require specific versions of individual modules, sometimes for good reasons. Between 2017 and 2019, the NLPL infrastructure task force has received installation requests for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with additional constraints regarding supported versions of, for example, NumPy, PyTorch, or TensorFlow.

For compatibility with third-party code and for reproducibility, users should largely be free (within reason) to pick the module versions they (believe they) require, modules must not change once installed (and announced), and historic or older module versions should remain functional over time, ideally many years into the future. The NLPL approach to meeting these demands has been to ‘unbundle’ to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.) as individual modules and inasmuch as possible provide each module for multiple base language versions. Abstractly, this design appears adequate and scalable, but module installation needs to be automated further, uniformity across different computing environments improved, and users better guided in navigating the resulting (large) space of only partially interoperable modules.

Uniformity across different computing environments, essentially means that the exact same versions of tools (and bundles) are available, and of course that they behave the same on all systems. To accomplish this goal, it may ultimately be necessary to build the complete software stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system) in the NLPL modules collection. Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely that Python installations will differ (in minor versions, compiler and library versions, optional add-on components, and such) across different systems.

Containerization

So far, NLPL has shied away from using containers, in part simply because of lacking support on some of the target systems (notably Taito), in part because of a concern for reduced transparency from the user point of view. Also, containerizing individual software modules severely challenges modularization: There is no straightforward way to ‘mix and match’ multiple containers into a uniform process environment.

However, provisioning the full NLPL software (and possibly data) environment inside a container may offer some benefits, for example compatibility with cloud environments, increased uniformity across different systems, and potentially longer-term reproducibility. On this view, modularization would obtain within the container, just as it does in the current environments on, for example, Abel, Puhti, Saga, and Taito.

Data

For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.

Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.

Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.

Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.