Infrastructure/software/eosc

From Nordic Language Processing Laboratory
Revision as of 15:38, 23 October 2019 by Oe (talk | contribs) (Example Use Cases)
Jump to: navigation, search

Background

This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.

The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from MSc students to professors; there is much variation in computational experience and ‘Un*x foo’. Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.

Typical types of data include potentially large document collections (for example 130 billion words of English extracted from the Common Crawl or vast collections of translated texts in multiple languages), pre-computed representations of word or sentence meaning (so-called word embeddings), or more specialized training and evaluation sets for supervised machine learning tasks like parsing or machine translation.

After some two years of activity in the NLPL project, its community has collectively installed some 80 shared software modules and around six terabytes of primary source data. In May 2019, module load operations for NLPL-maintained software accounted for close to five percent of the total on the Norwegian Abel supercluster. In sum, preparing the software and data environment for the ‘average’ NLP experiment is no small task; duplication of data, software, and effort should be minimized. Further, reproducibility and replicability play an increasingly important role in NLP research. Other researchers must be enabled to re-run the same experiment (and obtain the same results), ideally also several years after the original publication.

What is an NLPL User?

  • Developer of NLP resources and tools (not an end-user of such tools)
  • Student (MSc or PhD) who learns to develop NLP tools and algorithms

What does an NLPL User Need?

  • Development environment (essential libraries and software packages)
  • Data for training (heavy machine learning), development, tuning, testing
  • Computing resources (CPU hours, more and more GPU hours)

Example Use Cases

Researcher A develops a new model of neural machine translation by implementing an extension to OpenNMT-py (a library for neural sequence-to-sequence models under heavy development). The implementation happens in a branch of the official OpenNMT GitHub package. The new extension requires the latest version with cutting-edge libraries of PyTorch and some external libraries from Facebook research and a lesser known NLP lab in China. The new code needs to be tested by training on standard data sets using GPU jobs that run for about 3 days per job. To compare baselines with various versions of the code and different training parameters, the researcher needs to run 20 parallel training jobs. Evaluation is done using standard benchmark test sets. The deadline for the next paper is in 10 days. Thanks to NLPL the same development environment (modules or otherwise) is in place in Norway and Finland and the same data is also accessible from those servers. The researcher can run the experiments using all facilities around and gets the results in-time and can submit the paper …

Researcher B has been working on developing and fine-tuning their document classification system for a while, using a combination of six or so Python add-on modules (NLTK, Gensim, NumPy, SciPy, Keras, and TensorFlow). As they augment their architecture with a character-level convolutional layer, they stumble into a know problem in running Pytorch 1.0.0 in combination with NumPy 1.16.1, rendering convolutions about twenty times slower than they should be. They cannot afford to upgrade to the most recent PyTorch 1.1.0 right now, because it introduces some changes that are not backwards-compatible with Gensim. StackOverflow suggests replacing NumPy with release 1.16.3, while keeping everything else unchanged. NLPL quickly installs a fresh NumPy module, and its highly modular setup allows Researcher C to just change one version number in their module load incantation.

Student C has the assignment to train a few models with a known NLP package to compare different settings of training parameters and approaches to data processing. For data processing the student needs to modify some existing code. The course could be shared among research labs in the nodic countries …

PhD Fellow D wants to test a cutting-edge method that was published in the latest NLP conference. There is some experimental code on GitHub but it requires some specific combination of Python packages and the whole thing is implemented in Julia. Fortunately, NLPL has already most of the packages in place that would be difficult to compile on the ancient CentOS setup otherwise. After testing the code, the PhD has some ideas on modifying the algorithm to test some improvements …

Research group E publishes a new model for sentiment analysis and a paper describes it. They want to ensure that the results are replicable and, therefore, they want to publish the code, the data and the exact setup. Maybe they could create a containerized distribution? The NLPL environment tools makes it relatively straightforward to package this up …

Software

Relevant software modules comprise general-purpose run-time environments like Java and Python, machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others. NLPL users typically ‘mix and match’ several of these components, to then build their own code on top. They will often require specific versions of individual modules, sometimes for good reasons. Between 2017 and 2019, the NLPL infrastructure task force has received installation requests for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with additional constraints regarding supported versions of, for example, NumPy, PyTorch, or TensorFlow.

For compatibility with third-party code and for reproducibility, users should largely be free (within reason) to pick the module versions they (believe they) require, modules must not change once installed (and announced), and historic or older module versions should remain functional over time, ideally many years into the future. The NLPL approach to meeting these demands has been to ‘unbundle’ to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.) as individual modules and inasmuch as possible provide each module for multiple base language versions. Abstractly, this design appears adequate and scalable, but module installation needs to be automated further, uniformity across different computing environments improved, and users better guided in navigating the resulting (large) space of only partially interoperable modules.

Uniformity across different computing environments, essentially means that the exact same versions of tools (and bundles) are available, and of course that they behave the same on all systems. To accomplish this goal, it may ultimately be necessary to build the complete software stack ‘from the ground up’, i.e. include all dependencies (beyond the core operating system) in the NLPL modules collection. Otherwise, if one were to build on top of a pre-installed Python, for example, it is likely that Python installations will differ (in minor versions, compiler and library versions, optional add-on components, and such) across different systems.

Containerization

So far, NLPL has shied away from using containers, in part simply because of lacking support on some of the target systems (notably Taito), in part because of a concern for reduced transparency from the user point of view. Also, containerizing individual software modules severely challenges modularization: There is no straightforward way to ‘mix and match’ multiple containers into a uniform process environment.

However, provisioning the full NLPL software (and possibly data) environment inside a container may offer some benefits, for example compatibility with cloud environments, increased uniformity across different systems, and potentially longer-term reproducibility. On this view, modularization would obtain within the container, just as it does in the current environments on, for example, Abel, Puhti, Saga, and Taito.

Data

For the NLPL community it is important to have direct access to essential data sets from the command line. A mounted file system is, therefore, the preferred solution at least at the moment with current workflows. NLPL currently tries to synchronise data sets between the Nordic HPC clusters providing a standardised view on data sets with proper versioning for replication purposes. The structure within the root data folder is roughly like this: 'nlpl-activity/dataset-name/optional-subfolder/release'. The release refers to the version or the date of the release. The path is preferably composed of lower-cased plain ASCII characters only but upper-case letters may appear if necessary.

Mirroring the data is currently done via cron-jobs and the master copy of each data set is on one specific server depending on the main responsible person who maintains the resource.

Some datasets are available for external users without HPC access, for example, OPUS parallel data. This is currently done via ObjectStorage and cPouta at CSC. That collection follows the same release structures as the mounted releases and is also in sync with that data.

Goals for NLPL in EOSC: Better streamline the data maintenance and mirroring procedures. Improve data access libraries and tools to make the work with data sets more transparent. Replicability if results is important. Unnecessary data copying and duplication should be avoided. Documentation is essential.