Infrastructure/software/eosc
Background
This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from MSc students to professors; there is much variation in computational experience and ‘Un*x foo’. Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.
Typical types of data include potentially large document collections (for example 130 billion words of English extracted from the Common Crawl or vast collections of translated texts in multiple languages), pre-computed representations of word or sentence meaning (so-called word embeddings), or more specialized training and evaluation sets for supervised machine learning tasks like parsing or machine translation.
After some two years of activity in the NLPL project, its community has collectively
installed some 80 shared software modules and around six terabytes of primary source data.
In May 2019, module load
operations for NLPL-maintained software accounted
for close to five percent of the total on the Norwegian Abel supercluster.
In sum, preparing the software and data environment for the ‘average’ NLP experiment is no
small task; duplication of data, software, and effort should be minimized.
Further, reproducibility and replicability play an increasingly important role in NLP research.
Other researchers must be enabled to re-run the same experiment (and obtain the same results),
ideally also several years after the original publication.
Software
Relevant software modules comprise general-purpose frameworks like Java and Python, machine learning