Difference between revisions of "Vectors/norlm"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Models)
(Related Work)
(19 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Background =
+
= Norwegian Large-scale Language Models =
  
Welcome to the emerging collection of very large contextualized
+
[[File:norbert.png|thumb|right|150px]]
 +
Welcome to the emerging collection of large-scale contextualized
 
language models for the Norwegian language.
 
language models for the Norwegian language.
 +
NorLM is a joint initiative of the projects
 +
[https://www.eosc-nordic.eu/ EOSC-Nordic] (European Open Science Cloud) and
 +
[https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT]
 +
(Sentiment Analysis for Norwegian),
 +
coordinated by the
 +
[https://www.mn.uio.no/ifi/english/research/groups/ltg/ Language Technology Group] (LTG)
 +
at the University of Oslo.
  
= Models =
+
We are working to provide these models and supporting tools for researchers and developers in Natural
 +
Language Processing (NLP) for the Norwegian language.
 +
We do so in the hope of facilitating scientific experimentation with and practical applications of state-of-the-art
 +
NLP architectures, as well as to enable others to develop their own large-scale models, for example for
 +
domain- or application-specific tasks, language variants, or even other languages than Norwegian.
  
* [[Vectors/norlm/elmo|ELMo: LSTM-Based Architectures]]
+
Under the auspices of the
* [[Vectors/norlm/norbert|BERT: Transformer-Based Architectures]]
+
[http://wiki.nlpl.eu NLPL] use case in EOSC-Nordic, we
 +
are coordinating with colleagues in Denmark, Finland, and Sweden
 +
on a collection of large contextualized language models for the
 +
Nordic languages, including language variants or related groups
 +
of languages, as linguistically or technologically appropriate.
 +
 
 +
= Available Models =
 +
 
 +
At this initial stage of development, Norwegian models for two common architecture variants are available:
 +
 
 +
* [[Vectors/norlm/norelmo|NorELMo: LSTM-Based Architectures]]
 +
* [[Vectors/norlm/norbert|NorBERT: Transformer-Based Architectures]]
 +
 
 +
We emphatically welcome all kinds of user feedback, including of course suggestions for improvement
 +
or suggestions for additional types of Norwegian contextualized language models or associated tools.
 +
Please contact us via the NorLM technical coordinator,
 +
[https://www.mn.uio.no/ifi/english/people/aca/andreku/ Andrey Kutuzov].
 +
 
 +
= License and Access =
 +
 
 +
All Norwegian language models from the NorLM initiative are
 +
publicly available for download from the
 +
[http://vectors.nlpl.eu/repository NLPL Vectors Repository], with a [https://creativecommons.org/licenses/by/4.0/ CC BY 4.0 license].
 +
The NorBERT model is also included with the
 +
[https://huggingface.co/transformers/ Huggingface Transformers Library].
 +
 
 +
To receive announcements of updates and availability of additional
 +
models, please self-subscribe to our very low-traffic NorLM
 +
[http://lists.nlpl.eu/mailman/listinfo/norlm mailing list].
 +
 
 +
= Related Work =
 +
 
 +
Our paper "Large-Scale Contextualised Language Modelling for Norwegian" is accepted to [https://nodalida2021.github.io/ NoDaliDa'2021 conference].
 +
 
 +
[https://arxiv.org/abs/2104.06546 Full text is available on arXiv].
 +
 
 +
= Acknowledgements =
 +
 
 +
The NorLM resources are being developed on the Norwegian national supercomputing services operated by
 +
[https://www.sigma2.no/ UNINETT Sigma2], the National Infrastructure for High Performance Computing and Data Storage in Norway.
 +
Software provisioning was financially supported through the European
 +
[https://www.eosc-nordic.eu/ EOSC-Nordic] project; data preparation and evaluation
 +
were supported by the Norwegian
 +
[https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] project.
 +
We are indebted to all funding agencies involved, the University of Oslo, and the
 +
Norwegian tax payer.

Revision as of 09:36, 15 April 2021

Norwegian Large-scale Language Models

Norbert.png

Welcome to the emerging collection of large-scale contextualized language models for the Norwegian language. NorLM is a joint initiative of the projects EOSC-Nordic (European Open Science Cloud) and SANT (Sentiment Analysis for Norwegian), coordinated by the Language Technology Group (LTG) at the University of Oslo.

We are working to provide these models and supporting tools for researchers and developers in Natural Language Processing (NLP) for the Norwegian language. We do so in the hope of facilitating scientific experimentation with and practical applications of state-of-the-art NLP architectures, as well as to enable others to develop their own large-scale models, for example for domain- or application-specific tasks, language variants, or even other languages than Norwegian.

Under the auspices of the NLPL use case in EOSC-Nordic, we are coordinating with colleagues in Denmark, Finland, and Sweden on a collection of large contextualized language models for the Nordic languages, including language variants or related groups of languages, as linguistically or technologically appropriate.

Available Models

At this initial stage of development, Norwegian models for two common architecture variants are available:

We emphatically welcome all kinds of user feedback, including of course suggestions for improvement or suggestions for additional types of Norwegian contextualized language models or associated tools. Please contact us via the NorLM technical coordinator, Andrey Kutuzov.

License and Access

All Norwegian language models from the NorLM initiative are publicly available for download from the NLPL Vectors Repository, with a CC BY 4.0 license. The NorBERT model is also included with the Huggingface Transformers Library.

To receive announcements of updates and availability of additional models, please self-subscribe to our very low-traffic NorLM mailing list.

Related Work

Our paper "Large-Scale Contextualised Language Modelling for Norwegian" is accepted to NoDaliDa'2021 conference.

Full text is available on arXiv.

Acknowledgements

The NorLM resources are being developed on the Norwegian national supercomputing services operated by UNINETT Sigma2, the National Infrastructure for High Performance Computing and Data Storage in Norway. Software provisioning was financially supported through the European EOSC-Nordic project; data preparation and evaluation were supported by the Norwegian SANT project. We are indebted to all funding agencies involved, the University of Oslo, and the Norwegian tax payer.