Difference between revisions of "Vectors/norlm"
(→Related Work) |
|||
(22 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | = Norwegian Large | + | = Norwegian Large Language Models = |
[[File:norbert.png|thumb|right|150px]] | [[File:norbert.png|thumb|right|150px]] | ||
Welcome to the emerging collection of large-scale contextualized | Welcome to the emerging collection of large-scale contextualized | ||
− | language models for the Norwegian language. | + | and generative language models for the Norwegian language. |
− | NorLM | + | NorLM (or, more recently, NORA.LLM) originated as a joint initiative of the projects |
− | [https://www.eosc-nordic.eu/ EOSC-Nordic] (European Open Science Cloud) | + | [https://www.eosc-nordic.eu/ EOSC-Nordic] (European Open Science Cloud), |
[https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] | [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] | ||
− | (Sentiment Analysis for Norwegian), | + | (Sentiment Analysis for Norwegian), |
+ | and [https://hplt-project.org HPLT] (High-Performance Language Technologies), | ||
+ | in collaboration with the | ||
+ | [https://ai.nb.no AI Laboratory of the National Library of Norway] | ||
+ | and the [https://www.sigma2.no National e-Infrastructure Services], | ||
coordinated by the | coordinated by the | ||
[https://www.mn.uio.no/ifi/english/research/groups/ltg/ Language Technology Group] (LTG) | [https://www.mn.uio.no/ifi/english/research/groups/ltg/ Language Technology Group] (LTG) | ||
Line 17: | Line 21: | ||
NLP architectures, as well as to enable others to develop their own large-scale models, for example for | NLP architectures, as well as to enable others to develop their own large-scale models, for example for | ||
domain- or application-specific tasks, language variants, or even other languages than Norwegian. | domain- or application-specific tasks, language variants, or even other languages than Norwegian. | ||
+ | |||
+ | Under the auspices of the | ||
+ | [http://wiki.nlpl.eu NLPL] use case in EOSC-Nordic, we | ||
+ | are coordinating with colleagues in Denmark, Finland, and Sweden | ||
+ | on a collection of large contextualized language models for the | ||
+ | Nordic languages, including language variants or related groups | ||
+ | of languages, as linguistically or technologically appropriate. | ||
= Available Models = | = Available Models = | ||
Line 22: | Line 33: | ||
At this initial stage of development, Norwegian models for two common architecture variants are available: | At this initial stage of development, Norwegian models for two common architecture variants are available: | ||
− | * [[Vectors/norlm/ | + | * [[Vectors/norlm/norelmo|NorELMo: LSTM-Based Architectures]] |
* [[Vectors/norlm/norbert|NorBERT: Transformer-Based Architectures]] | * [[Vectors/norlm/norbert|NorBERT: Transformer-Based Architectures]] | ||
+ | * [https://huggingface.co/collections/ltg/nort5-653bd26401eb025af225ee32 NorT5: Combined Encoder–Decoder Architecture] | ||
+ | * [https://huggingface.co/norallm NorMistral & NorBLOOM: Generative Language Models] | ||
We emphatically welcome all kinds of user feedback, including of course suggestions for improvement | We emphatically welcome all kinds of user feedback, including of course suggestions for improvement | ||
or suggestions for additional types of Norwegian contextualized language models or associated tools. | or suggestions for additional types of Norwegian contextualized language models or associated tools. | ||
− | Please contact us via the | + | Please contact us via the Nor(aL)LM technical coordinator, |
[https://www.mn.uio.no/ifi/english/people/aca/andreku/ Andrey Kutuzov]. | [https://www.mn.uio.no/ifi/english/people/aca/andreku/ Andrey Kutuzov]. | ||
+ | |||
= License and Access = | = License and Access = | ||
All Norwegian language models from the NorLM initiative are | All Norwegian language models from the NorLM initiative are | ||
− | publicly available for download from the | + | publicly available for download under open source licenses, either from the |
− | [http://vectors.nlpl.eu/repository NLPL Vectors Repository] | + | [http://vectors.nlpl.eu/repository NLPL Vectors Repository], or through |
− | + | the [https://huggingface.co/norallm Huggingface Hub]. | |
+ | NorBERT, NorT5, and the newer generative models are also directly | ||
+ | supported in the | ||
[https://huggingface.co/transformers/ Huggingface Transformers Library]. | [https://huggingface.co/transformers/ Huggingface Transformers Library]. | ||
− | + | = Related Work = | |
− | + | ||
− | [ | + | Our paper [https://aclanthology.org/2021.nodalida-main.4/ Large-Scale Contextualised Language Modelling for Norwegian] was presented at the 2021 Nordic Conference on Computational Linguistics (NoDaliDa). |
= Acknowledgements = | = Acknowledgements = | ||
Line 48: | Line 64: | ||
[https://www.eosc-nordic.eu/ EOSC-Nordic] project; data preparation and evaluation | [https://www.eosc-nordic.eu/ EOSC-Nordic] project; data preparation and evaluation | ||
were supported by the Norwegian | were supported by the Norwegian | ||
− | [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] project. | + | [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] |
+ | and the Horizon Europe | ||
+ | [https://hplt-project.org HPLT] projects. | ||
We are indebted to all funding agencies involved, the University of Oslo, and the | We are indebted to all funding agencies involved, the University of Oslo, and the | ||
Norwegian tax payer. | Norwegian tax payer. |
Latest revision as of 23:14, 15 February 2024
Contents
Norwegian Large Language Models
Welcome to the emerging collection of large-scale contextualized and generative language models for the Norwegian language. NorLM (or, more recently, NORA.LLM) originated as a joint initiative of the projects EOSC-Nordic (European Open Science Cloud), SANT (Sentiment Analysis for Norwegian), and HPLT (High-Performance Language Technologies), in collaboration with the AI Laboratory of the National Library of Norway and the National e-Infrastructure Services, coordinated by the Language Technology Group (LTG) at the University of Oslo.
We are working to provide these models and supporting tools for researchers and developers in Natural Language Processing (NLP) for the Norwegian language. We do so in the hope of facilitating scientific experimentation with and practical applications of state-of-the-art NLP architectures, as well as to enable others to develop their own large-scale models, for example for domain- or application-specific tasks, language variants, or even other languages than Norwegian.
Under the auspices of the NLPL use case in EOSC-Nordic, we are coordinating with colleagues in Denmark, Finland, and Sweden on a collection of large contextualized language models for the Nordic languages, including language variants or related groups of languages, as linguistically or technologically appropriate.
Available Models
At this initial stage of development, Norwegian models for two common architecture variants are available:
- NorELMo: LSTM-Based Architectures
- NorBERT: Transformer-Based Architectures
- NorT5: Combined Encoder–Decoder Architecture
- NorMistral & NorBLOOM: Generative Language Models
We emphatically welcome all kinds of user feedback, including of course suggestions for improvement or suggestions for additional types of Norwegian contextualized language models or associated tools. Please contact us via the Nor(aL)LM technical coordinator, Andrey Kutuzov.
License and Access
All Norwegian language models from the NorLM initiative are publicly available for download under open source licenses, either from the NLPL Vectors Repository, or through the Huggingface Hub. NorBERT, NorT5, and the newer generative models are also directly supported in the Huggingface Transformers Library.
Related Work
Our paper Large-Scale Contextualised Language Modelling for Norwegian was presented at the 2021 Nordic Conference on Computational Linguistics (NoDaliDa).
Acknowledgements
The NorLM resources are being developed on the Norwegian national supercomputing services operated by UNINETT Sigma2, the National Infrastructure for High Performance Computing and Data Storage in Norway. Software provisioning was financially supported through the European EOSC-Nordic project; data preparation and evaluation were supported by the Norwegian SANT and the Horizon Europe HPLT projects. We are indebted to all funding agencies involved, the University of Oslo, and the Norwegian tax payer.