From Nordic Language Processing Laboratory
Revision as of 22:47, 24 January 2021 by Andreku (talk | contribs) (NorBERT: Bidirectional Encoder Representations from Transformers)
Jump to: navigation, search

Back to NorLM

NorBERT: Bidirectional Encoder Representations from Transformers

NorBERT is a BERT deep learning language model [Devlin et al 2019] trained from scratch for Norwegian. The model can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks. These models are part of the ongoing NorLM initiative for very large contextualized Norwegian language models and associated tools and recipies. The NorBERT training setup builds on prior work on FinBERT by our collaborators at the University of Turku.

- Download from the NLPL Vector Repository

- Use with the Huggingface Transformers library

NorBERT features a custom 30 000 WordPiece vocabulary that has much better coverage of Norwegian words than the multilingual BERT (mBERT) models from Google:

Vocabulary Example of a tokenized sentence
NorBERT Denne gjengen håper at de sammen skal bidra til å gi kvinne ##fotball ##en i Kristiansand et lenge etterl ##engt ##et løft .
mBERT Denne g ##jeng ##en h ##å ##per at de sammen skal bid ##ra til å gi k ##vinne ##fo ##t ##ball ##en i Kristiansand et lenge etter ##len ##gte ##t l ##ø ##ft .


We have currently evaluated NorBERT on two standard benchmarks: Part-of-Speech tagging on Bokmål (taken from the Universal Dependencies project) and sentence-level binary sentiment classification (created by aggregating the fine-grained annotations in NoReC_fine and removing sentences with conflicting or no sentiment).

Data Train Dev Test
POS 15,696 2,409 1939
Sentiment 2,675 516 417

We fine-tune NorBERT and mBERT for 10 epochs and keep the best model on the dev set. NorBERT outperforms mBERT on both tasks: on POS by 0.6 percentage points, and by 15.6 on sentiment.

Model/task mBERT NorBERT NB-BERT-Base
Part-of-Speech tagging (accuracy) ​97.7 98.4 98.5
Sentence-level binary sentiment classification ​67.0 81.8

Training Corpus

We use clean training corpora with ordered sentences:

In total, this comprises about two billion (1 907 072 909) word tokens in 203 million (202 802 665) sentences, both in Bokmål and in Nynorsk; thus, this is a joint model. In the future, separate Bokmål and Nynorsk models are planned as well.


1. Wikipedia texts were extracted using segment_wiki.

2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.

3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.

4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.


The vocabulary for the model is of size 30 000 and contains cased entries with diacritics. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.

The vocabulary was generated using the SentencePiece algorithm and Tokenizers library (code). The resulting Tokenizers model was converted to the standard BERT WordPiece format.

NorBERT Model Details


NorBERT corresponds in its configuration to Google's Bert-Base Cased for English, with 12 layers and hidden size 768. Configuration file

Training Overview

NorBERT was trained on the Norwegian academic HPC system called Saga. Most of the time the training was distributed across 4 compute nodes and 16 NVIDIA P100 GPUs. Training took approximately 3 weeks. Instructions for reproducing the training setup with EasyBuild

Training Code

Similar to the creators of FinBERT, we employed the BERT implementation by NVIDIA (version 20.06.08) which allows relatively fast multi-node and multi-GPU training.

We made minor changes to this code, mostly to update it to the newer TensorFlow versions (our patches).

All the utils we used at the preprocessing and training are published in our Github repository.

Training Workflow

The Phase 1 (training with maximum sequence length of 128) was being done with batch size 48 and global batch size 48*16=768. Since one global batch contains 768 sentences, approximately 265 000 training steps constitute 1 epoch (one pass over the whole corpus). We have done 3 epochs: 795 000 training steps.

The Phase 2 (training with maximum sequence length of 512) was being done with batch size 8 and global batch size 8*16=128. We aimed at mimicking the original BERT in that at Phase 2 the model should see about 1/9 of the number of sentences seen during Phase 1. Thus, we needed about 68 million sentences, which at the global batch size of 128 boils down to 531 000 training steps more.

Full logs and loss plots can be found here (the training was on pause on December 25 and 26, since we were solving problems with mixed precision training).