Difference between revisions of "Lumi/pilot"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(LUMI-G Pilot)
(Data: Norwegian)
 
(8 intermediate revisions by 2 users not shown)
Line 2: Line 2:
  
 
[[File:norbert.png|thumb|right|150px]]
 
[[File:norbert.png|thumb|right|150px]]
In late 2021, the shared LUMI supercomputer will (likely)
+
In late 2021, the shared
open for trial usage of its vast gpu partition.
+
[https://www.lumi-supercomputer.eu/ LUMI supercomputer]
 +
will (likely) open for trial usage of its vast gpu partition.
 
NLPL partners in Finland (Turku and Helsinki) and Norway
 
NLPL partners in Finland (Turku and Helsinki) and Norway
 
(Oslo) are coordinating their efforts towards the creation
 
(Oslo) are coordinating their efforts towards the creation
Line 13: Line 14:
  
 
= Model Architectures =
 
= Model Architectures =
 +
* [[Eosc/pretraining#BERT|BERT]] (separate Bokmål and Nynorsk models)
 +
* [[Eosc/pretraining#RoBERTa|RoBERTa]]
 +
* [[Eosc/pretraining#ELECTRA|ELECTRA]]
 +
* [https://openai.com/blog/gpt-2-1-5b-release/ GPT]
 +
* [https://arxiv.org/pdf/1910.10683.pdf T5]
 +
* Large language models with linguistically motivated inductive biases (linked to the dScience PhD position); one example is Google's [https://www.aclweb.org/anthology/2020.emnlp-main.19/ ETC].
  
 +
= Software Support =
 +
See the links above for particular model's requirements.
 +
 +
In general, we rely on Python (>=3.7) and its [https://www.scipy.org/ SciPy] stack.
  
= Software Support =
+
We definitely will require fully functional GPU-enabled installations of PyTorch (1.8.1) and TensorFlow (preferably, both 1.15.5 and 2.4.1).
  
 +
Multi-GPU and multi-node training must be possible. In the NVIDIA world, [https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html NCCL] and [https://github.com/horovod/horovod Horovod] are used for this.
 +
In the AMD world? No idea.
  
 
= Data: Norwegian =
 
= Data: Norwegian =
 +
 +
* Collaboration with the National Library ([https://github.com/NBAiLab/notram/blob/master/guides/corpus_description.md Colossal Norwegian Corpus])?
 +
* Extracting the Norwegian part from the [https://github.com/allenai/allennlp/discussions/5056 C4 dataset]?
 +
* Additional news collections (Lilja?)

Latest revision as of 23:54, 25 March 2021

LUMI-G Pilot

Norbert.png

In late 2021, the shared LUMI supercomputer will (likely) open for trial usage of its vast gpu partition. NLPL partners in Finland (Turku and Helsinki) and Norway (Oslo) are coordinating their efforts towards the creation of very large-scale (neural) language models for multiple Nordic languages. This work is part of the Nordic Language Modeling (NorLM) initiative.

Model Architectures

  • BERT (separate Bokmål and Nynorsk models)
  • RoBERTa
  • ELECTRA
  • GPT
  • T5
  • Large language models with linguistically motivated inductive biases (linked to the dScience PhD position); one example is Google's ETC.

Software Support

See the links above for particular model's requirements.

In general, we rely on Python (>=3.7) and its SciPy stack.

We definitely will require fully functional GPU-enabled installations of PyTorch (1.8.1) and TensorFlow (preferably, both 1.15.5 and 2.4.1).

Multi-GPU and multi-node training must be possible. In the NVIDIA world, NCCL and Horovod are used for this. In the AMD world? No idea.

Data: Norwegian