Difference between revisions of "Vectors/norlm/norelmo"

From Nordic Language Processing Laboratory
Jump to: navigation, search
Line 9: Line 9:
  
 
- ID 211: trained on ''tokenized'' Norwegian Wikipedia Dump of September 2020 ([http://vectors.nlpl.eu/repository/20/210.zip download])
 
- ID 211: trained on ''tokenized'' Norwegian Wikipedia Dump of September 2020 ([http://vectors.nlpl.eu/repository/20/210.zip download])
 +
 +
==Training Corpus==
 +
*[https://dumps.wikimedia.org/nowiki/latest/ Norwegian Bokmål Wikipedia] dump from September 2020; 160 million words;
 +
 +
==Preprocessing==
 +
Both models were trained on the corpus tokenized using [https://ufal.mff.cuni.cz/udpipe UDPipe]. The '''lemmatized''' model is trained on the version of the corpus where raw word forms were replaced with their lemmas (`kontorer' --> `kontor'). For different tasks, different models can be better.
 +
 +
==Vocabulary==
 +
Both models were trained with vocabularies comprising 100 000 most frequent words in the corresponding training corpus. The vocabularies are published together with the models in thr archives linked above.
 +
 +
==Training workflow==
 +
Each models was trained for 3 epochs with batch size 192. We employed a [https://github.com/ltgoslo/simple_elmo_training version of the original ELMo training code updated to work better with the recent TensorFlow versions].  All the hyperparameters were left at their default values, except LSTM dimensionality reduced to 2048 from the default 4096.
 +
 +
==Usage==
 +
The NorELMO models are published in two formats:
 +
 +
1. TensorFlow checkpoints
 +
2. HDF5 model files
 +
 +
We recommend to use our [https://pypi.org/project/simple-elmo/ simple-elmo] Python library to do stuff with NorELMo models.

Revision as of 11:44, 13 January 2021

NorELMo: Embeddings from Language Models for Norwegian

NorELMo is a set of bidirectional recurrent ELMo language models [Peters et al 2018] trained from scratch on Norwegian Wikipedia. The models can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks.

Download from the NLPL Vector repository:

- ID 210: trained on lemmatized Norwegian Wikipedia Dump of September 2020 (download)

- ID 211: trained on tokenized Norwegian Wikipedia Dump of September 2020 (download)

Training Corpus

Preprocessing

Both models were trained on the corpus tokenized using UDPipe. The lemmatized model is trained on the version of the corpus where raw word forms were replaced with their lemmas (`kontorer' --> `kontor'). For different tasks, different models can be better.

Vocabulary

Both models were trained with vocabularies comprising 100 000 most frequent words in the corresponding training corpus. The vocabularies are published together with the models in thr archives linked above.

Training workflow

Each models was trained for 3 epochs with batch size 192. We employed a version of the original ELMo training code updated to work better with the recent TensorFlow versions. All the hyperparameters were left at their default values, except LSTM dimensionality reduced to 2048 from the default 4096.

Usage

The NorELMO models are published in two formats:

1. TensorFlow checkpoints 2. HDF5 model files

We recommend to use our simple-elmo Python library to do stuff with NorELMo models.