Difference between revisions of "Lumi/pilot"
(Created page with "= LUMI-G Pilot = In late 2021, the shared LUMI supercomputer will (likely) open for trial usage of its vast gpu partition. NLPL partners in Finland (Turku and Helsinki) and N...") |
(→Model Architectures) |
||
(17 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
− | = | + | = Very Large Language Models in the Nordics (VLLMN) = |
− | In | + | [[File:norbert.png|thumb|right|150px]] |
− | open for trial usage of its vast gpu partition. | + | In the summer of 2022, the shared |
+ | [https://www.lumi-supercomputer.eu/ LUMI supercomputer] | ||
+ | will (likely) open for trial usage of its vast gpu partition. | ||
NLPL partners in Finland (Turku and Helsinki) and Norway | NLPL partners in Finland (Turku and Helsinki) and Norway | ||
(Oslo) are coordinating their efforts towards the creation | (Oslo) are coordinating their efforts towards the creation | ||
of very large-scale (neural) language models for multiple | of very large-scale (neural) language models for multiple | ||
Nordic languages. | Nordic languages. | ||
+ | This work is part of the | ||
+ | [http://wiki.nlpl.eu/Vectors/norlm Nordic Language Modeling] | ||
+ | (NorLM) initiative. | ||
= Model Architectures = | = Model Architectures = | ||
+ | Prioritized: | ||
+ | * [[Eosc/pretraining#BERT|NorBERT 3]] on the [[Eosc/NorBERT3_corpus|concatenation of NorBERT1/NorBERT2 corpora]] (''base'' and ''large'' versions) | ||
+ | * Separate BERT-base models for Bokmål and Nynorsk | ||
+ | * [https://arxiv.org/pdf/1910.10683.pdf T5] on the NorBERT3 corpus: at least the unsupervised denoising objective stage | ||
+ | |||
+ | Less prioritized: | ||
+ | * [https://openai.com/blog/gpt-2-1-5b-release/ GPT-2/3] | ||
+ | * Ablations with BERT | ||
+ | * [[Eosc/pretraining#ELECTRA|ELECTRA]] | ||
+ | * (separate Bokmål and Nynorsk models) | ||
+ | * [[Eosc/pretraining#RoBERTa|RoBERTa]] | ||
+ | * Large language models with linguistically motivated inductive biases (linked to David Samuel PhD topic); one example is Google's [https://www.aclweb.org/anthology/2020.emnlp-main.19/ ETC]. | ||
= Software Support = | = Software Support = | ||
+ | See the links above for particular model's requirements. | ||
+ | |||
+ | In general, we rely on Python (>=3.9) and its [https://www.scipy.org/ SciPy] stack. | ||
+ | |||
+ | We definitely will require fully functional GPU-enabled installations of PyTorch (1.11) and TensorFlow (preferably, both 1.15.5 and 2.8.2). | ||
+ | Multi-GPU and multi-node training must be possible. In the NVIDIA world, [https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html NCCL] and [https://github.com/horovod/horovod Horovod] are used for this. | ||
+ | In the AMD world? [https://pytorch.org/blog/pytorch-for-amd-rocm-platform-now-available-as-python-package/ MIOpen & RCCL] ? | ||
= Data: Norwegian = | = Data: Norwegian = | ||
+ | |||
+ | * Collaboration with the National Library ([https://github.com/NBAiLab/notram/blob/master/guides/corpus_description.md Colossal Norwegian Corpus]): we now have the public part of it (''/cluster/projects/nn9851k/corpora/NCC'' on Saga) | ||
+ | * Extracting the Norwegian part from the [https://github.com/allenai/allennlp/discussions/5056 C4 dataset]: ''/cluster/projects/nn9851k/corpora/c4'' on Saga | ||
+ | * Additional news collections from MediaFutures SFI (Lilja?) |
Latest revision as of 14:13, 12 October 2022
Contents
Very Large Language Models in the Nordics (VLLMN)
In the summer of 2022, the shared LUMI supercomputer will (likely) open for trial usage of its vast gpu partition. NLPL partners in Finland (Turku and Helsinki) and Norway (Oslo) are coordinating their efforts towards the creation of very large-scale (neural) language models for multiple Nordic languages. This work is part of the Nordic Language Modeling (NorLM) initiative.
Model Architectures
Prioritized:
- NorBERT 3 on the concatenation of NorBERT1/NorBERT2 corpora (base and large versions)
- Separate BERT-base models for Bokmål and Nynorsk
- T5 on the NorBERT3 corpus: at least the unsupervised denoising objective stage
Less prioritized:
- GPT-2/3
- Ablations with BERT
- ELECTRA
- (separate Bokmål and Nynorsk models)
- RoBERTa
- Large language models with linguistically motivated inductive biases (linked to David Samuel PhD topic); one example is Google's ETC.
Software Support
See the links above for particular model's requirements.
In general, we rely on Python (>=3.9) and its SciPy stack.
We definitely will require fully functional GPU-enabled installations of PyTorch (1.11) and TensorFlow (preferably, both 1.15.5 and 2.8.2).
Multi-GPU and multi-node training must be possible. In the NVIDIA world, NCCL and Horovod are used for this. In the AMD world? MIOpen & RCCL ?
Data: Norwegian
- Collaboration with the National Library (Colossal Norwegian Corpus): we now have the public part of it (/cluster/projects/nn9851k/corpora/NCC on Saga)
- Extracting the Norwegian part from the C4 dataset: /cluster/projects/nn9851k/corpora/c4 on Saga
- Additional news collections from MediaFutures SFI (Lilja?)