Difference between revisions of "Infrastructure/software/eosc"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= Background = This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project. The NLPL research community (in late 2019) is compri...")
 
(Background)
Line 5: Line 5:
 
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from
 
The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from
 
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.
 
MSc students to professors; there is much variation in computational experience and ‘Un*x foo’.
Likewise, computing tasks vary a lot; NLP research quite generally is both data- and compute-intensive.
+
Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to
 +
thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.
  
 
Typical types of data include potentially large
 
Typical types of data include potentially large
 
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of
 
[http://wiki.nlpl.eu/index.php/Corpora/home document collections] (for example 130 billion words of
English extracted from the Common Crawl), pre-computed representations of
+
English extracted from the Common Crawl or vast collections of translated texts in multiple languages),
 +
pre-computed representations of
 
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]
 
[http://wiki.nlpl.eu/index.php/Vectors/home word or sentence meaning]
 
(so-called word embeddings), or more specialized training and evaluation sets
 
(so-called word embeddings), or more specialized training and evaluation sets
 
for supervised machine learning tasks like parsing or machine translation.
 
for supervised machine learning tasks like parsing or machine translation.
  
 +
In a nutshell, preparing the software and data environment for the ‘average’ NLP experiment is no
 +
small tasks; duplication of data, software, and effort should be minimized.
 +
Further, reproducibility and replicability play an increasingly central role in NLP research.
  
 
= Software =
 
= Software =

Revision as of 07:49, 31 August 2019

Background

This page provides a working document for requirements in the NLP(L) use case in the EOSC Nordic project.

The NLPL research community (in late 2019) is comprised of many dozens of active users, ranging from MSc students to professors; there is much variation in computational experience and ‘Un*x foo’. Likewise, computing tasks vary a lot, ranging from maybe a handful of single-cpu jobs to thousands of (mildly) parallel or multi-gpu tasks; NLP research quite generally is both data- and compute-intensive.

Typical types of data include potentially large document collections (for example 130 billion words of English extracted from the Common Crawl or vast collections of translated texts in multiple languages), pre-computed representations of word or sentence meaning (so-called word embeddings), or more specialized training and evaluation sets for supervised machine learning tasks like parsing or machine translation.

In a nutshell, preparing the software and data environment for the ‘average’ NLP experiment is no small tasks; duplication of data, software, and effort should be minimized. Further, reproducibility and replicability play an increasingly central role in NLP research.

Software

Data