Difference between revisions of "Vectors/norlm/norelmo"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Evaluation)
 
(14 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
=NorELMo: Embeddings from Language Models for Norwegian=
 
=NorELMo: Embeddings from Language Models for Norwegian=
  
'''NorELMo''' is a set of bidirectional recurrent ELMo language models trained from scratch on Norwegian Wikipedia.
+
'''NorELMo''' is a set of bidirectional recurrent ELMo language models trained from scratch on Norwegian Wikipedia, trained as part of the ongoing [http://norlm.nlpl.eu NorLM initiative].  
ELMo was the first contextualized architecture to become well-known in the NLP community. [[https://www.aclweb.org/anthology/N18-1202/ Peters et al 2018]] describing it got the Best Paper award at the NAACL 2018 conference.
+
ELMo was the first contextualized architecture to become well-known in the NLP community. [[https://www.aclweb.org/anthology/N18-1202/ Peters et al 2018]] describing it got the Best Paper award at the NAACL 2018 conference.
  
 
NorELMO models can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks. In many cases, they may be a viable alternative to [[Vectors/norlm/norbert|NorBERT]], especially if computational resources are scarce.
 
NorELMO models can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks. In many cases, they may be a viable alternative to [[Vectors/norlm/norbert|NorBERT]], especially if computational resources are scarce.
Line 10: Line 10:
 
Download from the [http://vectors.nlpl.eu/repository/ NLPL Vector repository]:
 
Download from the [http://vectors.nlpl.eu/repository/ NLPL Vector repository]:
  
- ID 210: trained on ''lemmatized'' Norwegian Wikipedia Dump of September 2020 ([http://vectors.nlpl.eu/repository/20/210.zip download])
+
- ID 210: trained on ''lemmatized'' Norwegian Wikipedia Dump of September 2020, about 160 million words ([http://vectors.nlpl.eu/repository/20/210.zip download])
  
- ID 211: trained on ''tokenized'' Norwegian Wikipedia Dump of September 2020 ([http://vectors.nlpl.eu/repository/20/210.zip download])
+
- ID 211: trained on ''tokenized'' Norwegian Wikipedia Dump of September 2020, about 160 million words ([http://vectors.nlpl.eu/repository/20/210.zip download])
  
==Training Corpus==
+
- ID 217 ('''NorELMo30'''): trained on the [[Vectors/norlm/norbert#Training_Corpus|NorBERT training corpus]] (Wikipedia and Norsk Aviskorpus) with 30 000 words in the target vocabulary ([http://vectors.nlpl.eu/repository/20/217.zip download])
*[https://dumps.wikimedia.org/nowiki/latest/ Norwegian Bokmål Wikipedia] dump from September 2020; 160 million words;
+
 
 +
- ID 218 ('''NorELMo100'''): trained on the [[Vectors/norlm/norbert#Training_Corpus|NorBERT training corpus]] (Wikipedia and Norsk Aviskorpus) with 100 000 words in the target vocabulary ([http://vectors.nlpl.eu/repository/20/218.zip download])
  
 
==Preprocessing==
 
==Preprocessing==
Both models were trained on the corpus tokenized using [https://ufal.mff.cuni.cz/udpipe UDPipe]. The '''lemmatized''' model is trained on the version of the corpus where raw word forms were replaced with their lemmas (`kontorer' --> `kontor'). For different tasks, different models can be better.
+
All the models were trained on the corpus tokenized using [https://ufal.mff.cuni.cz/udpipe UDPipe]. The '''lemmatized''' model is trained on the version of the corpus where raw word forms were replaced with their lemmas (`kontorer' --> `kontor'). For different tasks, different models can be better.
  
 
==Vocabulary==
 
==Vocabulary==
Both models were trained with vocabularies comprising 100 000 most frequent words in the corresponding training corpus. The vocabularies are published together with the models in the archives linked above.
+
Independent of the vocabulary size, an ELMo model can process arbitrary word tokens, due to its architecture (where the first CNN layer converts input strings to non-contextual word embeddings). Thus, the size of the vocabulary controls only the number of words used as targets for the language modelling task in the course of training. Supposedly, the model with a larger vocabulary is more effective in treating less frequent words at the cost of being less effective with more frequent words.
 +
 
 +
The vocabulary sizes for NorELMo models are stated earlier. The vocabularies are published together with the models in the archives linked above.
  
 
==Training workflow==
 
==Training workflow==
 
Each models was trained for 3 epochs with batch size 192. We employed a [https://github.com/ltgoslo/simple_elmo_training version of the original ELMo training code updated to work better with the recent TensorFlow versions].  All the hyperparameters were left at their default values, except LSTM dimensionality reduced to 2048 from the default 4096.
 
Each models was trained for 3 epochs with batch size 192. We employed a [https://github.com/ltgoslo/simple_elmo_training version of the original ELMo training code updated to work better with the recent TensorFlow versions].  All the hyperparameters were left at their default values, except LSTM dimensionality reduced to 2048 from the default 4096.
 +
 +
==Evaluation==
 +
 +
See the [[Vectors/norlm/norbert#Evaluation|description of the evaluation setup here]]
 +
 +
In parentheses, the ''time'' required to adapt the model to a particular task (fine-tuning or training a classifier over a frozen model) is given in ''minutes''.
 +
 +
{| class="wikitable"
 +
|-
 +
! Task/Model !! mBERT !! NorBERT !! NB-BERT-Base !! NorELMo30 !! NorELMo100
 +
|-
 +
| Part-of-Speech tagging (accuracy) || ​98.0 ''(245)'' || 98.5 (238) || '''98.7''' ''(244)'' || 98.1 ''(8)'' || 98.0 ''(8)''
 +
|-
 +
| Fine-grained sentiment analysis (Sentiment Graph F1) || ​31.7 ''(444)'' || '''34.8''' ''(438)'' || '''34.8''' ''(404)'' || 34.5 ''(446)'' || 34.2 ''(434)''
 +
|-
 +
| Sentence-level binary sentiment classification (F1 score) || ​67.7 ''(37)'' || 77.1 ''(35)'' || '''83.9''' ''(37)'' || 75.0 ''(5)'' || 75.0 ''(5)''
 +
|}
  
 
==Usage==
 
==Usage==

Latest revision as of 13:16, 12 March 2021

Back to NorLM

NorELMo: Embeddings from Language Models for Norwegian

NorELMo is a set of bidirectional recurrent ELMo language models trained from scratch on Norwegian Wikipedia, trained as part of the ongoing NorLM initiative. ELMo was the first contextualized architecture to become well-known in the NLP community. [Peters et al 2018] describing it got the Best Paper award at the NAACL 2018 conference.

NorELMO models can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks. In many cases, they may be a viable alternative to NorBERT, especially if computational resources are scarce.

Download from the NLPL Vector repository:

- ID 210: trained on lemmatized Norwegian Wikipedia Dump of September 2020, about 160 million words (download)

- ID 211: trained on tokenized Norwegian Wikipedia Dump of September 2020, about 160 million words (download)

- ID 217 (NorELMo30): trained on the NorBERT training corpus (Wikipedia and Norsk Aviskorpus) with 30 000 words in the target vocabulary (download)

- ID 218 (NorELMo100): trained on the NorBERT training corpus (Wikipedia and Norsk Aviskorpus) with 100 000 words in the target vocabulary (download)

Preprocessing

All the models were trained on the corpus tokenized using UDPipe. The lemmatized model is trained on the version of the corpus where raw word forms were replaced with their lemmas (`kontorer' --> `kontor'). For different tasks, different models can be better.

Vocabulary

Independent of the vocabulary size, an ELMo model can process arbitrary word tokens, due to its architecture (where the first CNN layer converts input strings to non-contextual word embeddings). Thus, the size of the vocabulary controls only the number of words used as targets for the language modelling task in the course of training. Supposedly, the model with a larger vocabulary is more effective in treating less frequent words at the cost of being less effective with more frequent words.

The vocabulary sizes for NorELMo models are stated earlier. The vocabularies are published together with the models in the archives linked above.

Training workflow

Each models was trained for 3 epochs with batch size 192. We employed a version of the original ELMo training code updated to work better with the recent TensorFlow versions. All the hyperparameters were left at their default values, except LSTM dimensionality reduced to 2048 from the default 4096.

Evaluation

See the description of the evaluation setup here

In parentheses, the time required to adapt the model to a particular task (fine-tuning or training a classifier over a frozen model) is given in minutes.

Task/Model mBERT NorBERT NB-BERT-Base NorELMo30 NorELMo100
Part-of-Speech tagging (accuracy) ​98.0 (245) 98.5 (238) 98.7 (244) 98.1 (8) 98.0 (8)
Fine-grained sentiment analysis (Sentiment Graph F1) ​31.7 (444) 34.8 (438) 34.8 (404) 34.5 (446) 34.2 (434)
Sentence-level binary sentiment classification (F1 score) ​67.7 (37) 77.1 (35) 83.9 (37) 75.0 (5) 75.0 (5)

Usage

The NorELMO models are published in two formats:

1. TensorFlow checkpoints

2. HDF5 model files

We recommend to use our simple-elmo Python library to do stuff with NorELMo models.