Difference between revisions of "Infrastructure/software/eosc"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Software)
(Software)
Line 36: Line 36:
 
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests
 
Between 2017 and 2019, the NLPL infrastructure task force has received installation requests
 
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with
 
for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with
addition constraints regarding supported versions of, for example, NumPy, PyTorch, or
+
additional constraints regarding supported versions of, for example, NumPy, PyTorch, or
 
TensorFlow.
 
TensorFlow.
  
For reasons of compatibility with third-party code and reproducibility, users should largely
+
For compatibility with third-party code and for reproducibility, users should largely
 
be free (within reason) to pick the module versions they (believe they) require, modules must
 
be free (within reason) to pick the module versions they (believe they) require, modules must
 
not change once installed (and announced), and historic or older module versions should
 
not change once installed (and announced), and historic or older module versions should
 
remain functional over time, ideally many years into the future.
 
remain functional over time, ideally many years into the future.
 
The NLPL approach to meeting these demands has been to ‘unbundle’ to a high degree, i.e.
 
The NLPL approach to meeting these demands has been to ‘unbundle’ to a high degree, i.e.
provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.) as separate
+
provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.) as individual
 
modules and inasmuch as possible provide each module for multiple base language
 
modules and inasmuch as possible provide each module for multiple base language
 
versions.
 
versions.
 
Abstractly, this design appears adequate and scalable, but module installation needs to be
 
Abstractly, this design appears adequate and scalable, but module installation needs to be
 
automated further, uniformity across different computing environments improved, and users
 
automated further, uniformity across different computing environments improved, and users
supported better in navigating the resulting (large) space of partially interoperable
+
better guided in navigating the resulting (large) space of only partially interoperable
 
modules.
 
modules.
  
 
= Data =
 
= Data =

Revision as of 21:29, 8 September 2019

Background

This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.

The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from MSc students to professors; there is much variation in computational experience and ‘Un*x foo’. Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.

Typical types of data include potentially large document collections (for example 130 billion words of English extracted from the Common Crawl or vast collections of translated texts in multiple languages), pre-computed representations of word or sentence meaning (so-called word embeddings), or more specialized training and evaluation sets for supervised machine learning tasks like parsing or machine translation.

After some two years of activity in the NLPL project, its community has collectively installed some 80 shared software modules and around six terabytes of primary source data. In May 2019, module load operations for NLPL-maintained software accounted for close to five percent of the total on the Norwegian Abel supercluster. In sum, preparing the software and data environment for the ‘average’ NLP experiment is no small task; duplication of data, software, and effort should be minimized. Further, reproducibility and replicability play an increasingly important role in NLP research. Other researchers must be enabled to re-run the same experiment (and obtain the same results), ideally also several years after the original publication.

Software

Relevant software modules comprise general-purpose run-time environments like Java and Python, machine learning frameworks like DyNet, PyTorch, SciPy, or TensorFlow, and a myriad of discipline-specific tools like CoreNLP, Gensim, Marian, NLTK, Open NMT, spaCy, and others. NLPL users typically ‘mix and match’ several of these components, to then build their own code on top. They will often require specific versions of individual modules, sometimes for good reasons. Between 2017 and 2019, the NLPL infrastructure task force has received installation requests for individual Python add-ons against language versions 2.7, 3.5, and 3.7, sometimes with additional constraints regarding supported versions of, for example, NumPy, PyTorch, or TensorFlow.

For compatibility with third-party code and for reproducibility, users should largely be free (within reason) to pick the module versions they (believe they) require, modules must not change once installed (and announced), and historic or older module versions should remain functional over time, ideally many years into the future. The NLPL approach to meeting these demands has been to ‘unbundle’ to a high degree, i.e. provision separate add-ons (like Gensim, NumPy, SciPy, TensorFlow, etc.) as individual modules and inasmuch as possible provide each module for multiple base language versions. Abstractly, this design appears adequate and scalable, but module installation needs to be automated further, uniformity across different computing environments improved, and users better guided in navigating the resulting (large) space of only partially interoperable modules.

Data