Very Large Language Models in the Nordics (VLLMN)
In the summer of 2022, the shared LUMI supercomputer will (likely) open for trial usage of its vast gpu partition. NLPL partners in Finland (Turku and Helsinki) and Norway (Oslo) are coordinating their efforts towards the creation of very large-scale (neural) language models for multiple Nordic languages. This work is part of the Nordic Language Modeling (NorLM) initiative.
- Ablations with BERT
- BERT (separate Bokmål and Nynorsk models)
- Large language models with linguistically motivated inductive biases (linked to the dScience PhD position); one example is Google's ETC.
See the links above for particular model's requirements.
In general, we rely on Python (>=3.8) and its SciPy stack.
We definitely will require fully functional GPU-enabled installations of PyTorch (1.11) and TensorFlow (preferably, both 1.15.5 and 2.8.2).
- Collaboration with the National Library (Colossal Norwegian Corpus): we now have the public part of it (/cluster/projects/nn9851k/corpora/NCC on Saga)
- Extracting the Norwegian part from the C4 dataset: /cluster/projects/nn9851k/corpora/c4 on Saga
- Additional news collections from MediaFutures SFI (Lilja?)