Difference between revisions of "Vectors/norlm/norbert"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Training Corpus)
(Updated links to NorBERT)
 
(23 intermediate revisions by the same user not shown)
Line 3: Line 3:
 
= NorBERT: Bidirectional Encoder Representations from Transformers =
 
= NorBERT: Bidirectional Encoder Representations from Transformers =
  
'''NorBERT''' is a BERT deep learning language model [[https://www.aclweb.org/anthology/N19-1423/ Devlin et al 2019]] trained from scratch for Norwegian. The model can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks.
+
'''NorBERT''' is a series of BERT deep learning language models [[https://www.aclweb.org/anthology/N19-1423/ Devlin et al 2019]] trained from scratch for Norwegian. The models can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks.
 
These models are part of the ongoing
 
These models are part of the ongoing
 
[http://norlm.nlpl.eu NorLM initiative] for very large contextualized
 
[http://norlm.nlpl.eu NorLM initiative] for very large contextualized
Line 12: Line 12:
 
[https://turkunlp.org/ University of Turku].
 
[https://turkunlp.org/ University of Turku].
  
 +
==NorBERT 1==
 
- '''[http://vectors.nlpl.eu/repository/20/216.zip Download from the NLPL Vector Repository]'''
 
- '''[http://vectors.nlpl.eu/repository/20/216.zip Download from the NLPL Vector Repository]'''
  
- '''[https://huggingface.co/ltgoslo/norbert Use with the Huggingface Transformers library]'''
+
- '''[https://huggingface.co/ltg/norbert Use with the Huggingface Transformers library]'''
  
 
- Available locally on Saga: ''/cluster/shared/nlpl/data/vectors/latest/216/''
 
- Available locally on Saga: ''/cluster/shared/nlpl/data/vectors/latest/216/''
  
'''NorBERT''' features a custom 30 000 WordPiece vocabulary that has much better coverage of Norwegian words than the multilingual BERT (mBERT) models from Google:
+
'''NorBERT 1''' features a custom 30 000 WordPiece vocabulary.
 +
 
 +
==NorBERT 2==
 +
- '''[http://vectors.nlpl.eu/repository/20/221.zip Download from the NLPL Vector Repository]'''
 +
 
 +
- '''[https://huggingface.co/ltg/norbert2 Use with the Huggingface Transformers library]'''
 +
 
 +
- Available locally on Saga: ''/cluster/shared/nlpl/data/vectors/latest/221/''
 +
 
 +
'''NorBERT 2''' features a custom 50 000 WordPiece vocabulary.
 +
 
 +
NorBERT vocabularies have much better coverage of Norwegian words than the multilingual BERT (mBERT) models from Google:
  
 
{| class="wikitable"
 
{| class="wikitable"
Line 33: Line 45:
 
== Release history ==
 
== Release history ==
  
... February 2022 - '''version 2'''. Completely new model trained from scratch on the very large corpus of Norwegian (C4 + NCC, about 20 billion word tokens). It features a 50 000 words vocabulary and was trained using Whole Word Masking.
+
7 February 2022 - '''version 2'''. NorBERT 2: a completely new model trained from scratch on the very large corpus of Norwegian (C4 + NCC, about 15 billion word tokens). It features a 50 000 words vocabulary and was trained using Whole Word Masking.
  
 
13 February 2021 - '''version 1.1'''. Fixes an issue with duplicate entries in the NorBERT vocabulary. In rare cases it could lead to warnings and errors. The model itself is unchanged.
 
13 February 2021 - '''version 1.1'''. Fixes an issue with duplicate entries in the NorBERT vocabulary. In rare cases it could lead to warnings and errors. The model itself is unchanged.
Line 41: Line 53:
 
== Evaluation ==
 
== Evaluation ==
  
We have currently evaluated NorBERT on three benchmarks: Part-of-Speech tagging on Bokmål and Nynorsk (taken from [https://universaldependencies.org/ the Universal Dependencies project]), fine-grained sentiment analysis (with data from [https://github.com/ltgoslo/norec_fine NoReC_fine]) and sentence-level binary sentiment classification (with data from aggregating the fine-grained annotations in [https://github.com/ltgoslo/norec_fine NoReC_fine] and removing sentences with conflicting or no sentiment).
+
We have currently evaluated NorBERT on four benchmarks: Part-of-Speech tagging on Bokmål and Nynorsk (taken from [https://universaldependencies.org/ the Universal Dependencies project]), fine-grained sentiment analysis (with data from [https://github.com/ltgoslo/norec_fine NoReC_fine]), sentence-level binary sentiment classification (with data from aggregating the fine-grained annotations in [https://github.com/ltgoslo/norec_fine NoReC_fine] and removing sentences with conflicting or no sentiment) and named entity recognition (with data from [https://github.com/ltgoslo/norne NorNE]).
  
 
Data amounts (in sentences):
 
Data amounts (in sentences):
Line 52: Line 64:
 
|-
 
|-
 
| POS Nynorsk || 14,174 || 1,890 || 1,511
 
| POS Nynorsk || 14,174 || 1,890 || 1,511
 +
|-
 +
| NER Bokmål || 15,696 || 2,409 || 1,939
 +
|-
 +
| NER Nynorsk || 14,174 || 1,890 || 1,511
 
|-
 
|-
 
| Sentiment || 2,675|| 516 || 417
 
| Sentiment || 2,675|| 516 || 417
 
|}
 
|}
  
For POS tagging and binary sentiment classification, we fine-tune NorBERT, [https://huggingface.co/bert-base-multilingual-cased Multilingual BERT] and [https://github.com/NBAiLab/notram NB-BERT-Base] for 20 epochs and keep the best model on the dev set. For fine-grained sentiment analysis, we use BERT token embeddings as features, with frozen model.
+
For POS tagging and binary sentiment classification, we fine-tune NorBERT, [https://huggingface.co/bert-base-multilingual-cased Multilingual BERT] and [https://github.com/NBAiLab/notram NB-BERT-Base] for 10 epochs and keep the best model on the dev set. For fine-grained sentiment analysis, we use BERT token embeddings as features, with frozen model. For NER, we  fine-tuning the pre-trained model for 20 epochs with early stopping.
 +
 
 
NorBERT outperforms mBERT on both tasks: on POS tagging by 0.5 percentage points, by 9.4 percentage points on binary sentiment classification, and by 2.1 points of targeted F1 score on fine-grained sentiment analysis.  
 
NorBERT outperforms mBERT on both tasks: on POS tagging by 0.5 percentage points, by 9.4 percentage points on binary sentiment classification, and by 2.1 points of targeted F1 score on fine-grained sentiment analysis.  
 
NorBERT is on par with NB-BERT-Base on POS tagging, is a bit worse in binary sentiment classification and better in fine-grained sentiment analysis.
 
NorBERT is on par with NB-BERT-Base on POS tagging, is a bit worse in binary sentiment classification and better in fine-grained sentiment analysis.
Line 62: Line 79:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
! Model/task !! mBERT !! NorBERT !! NorBERT 2 !! NB-BERT-Base
+
! Model/task !! mBERT !! XLM-R !! NorBERT !! NorBERT 2 !! NB-BERT-Base
 
|-
 
|-
| Part-of-Speech tagging Bokmål (accuracy) || ​98.0 || 98.5 || 98.3 || '''98.7'''
+
| Part-of-Speech tagging Bokmål (accuracy) [https://github.com/ltgoslo/NorBERT/blob/main/benchmarking/experiments/pos_finetuning.py code] || ​98.0 || 97.5 || 98.5 || 98.3 || '''98.7'''
 
|-
 
|-
| Part-of-Speech tagging Nynorsk (accuracy) || ​97.9 || 98.0 || 98.0 || '''98.3'''
+
| Part-of-Speech tagging Nynorsk (accuracy) [https://github.com/ltgoslo/NorBERT/blob/main/benchmarking/experiments/pos_finetuning.py code] || ​97.9 || 97.3 || 98.0 || 97.7 || '''98.3'''
 
|-
 
|-
| Fine-grained sentiment analysis (Targeted F1) || ​34.8 || '''36.9''' ||  || 36.0
+
| Fine-grained sentiment analysis (Mean Targeted F1 across 5 runs) [https://github.com/jerbarnes/sentiment_graphs code] || ​34.9 || 33.9 || '''35.0''' || 33.8 || 34.4
 
|-
 
|-
| Binary sentiment analysis (F1 score) || 67.7 || 77.1 || 80.3 || '''83.9'''
+
| Binary sentiment analysis (F1 score) [https://github.com/ltgoslo/NorBERT/blob/main/benchmarking/experiments/sentiment_finetuning.py code] || 67.7 || 71.8 || 77.1 || '''84.2''' || 83.9
 
|-
 
|-
| Named entity recognition Bokmål (F1 score) || 78.8  || 85.5  || 88.9 ||  '''90.2'''
+
| Named entity recognition Bokmål (F1 score) [https://github.com/ltgoslo/NorBERT/blob/main/benchmarking/experiments/bert_ner.py code] || 78.8  || 84.5 || 85.5  || 88.2 ||  '''90.2'''
 
|-
 
|-
| Named entity recognition Nynorsk (F1 score) || 81.7 ||  82.8 ||  86.2 || '''88.6'''
+
| Named entity recognition Nynorsk (F1 score) [https://github.com/ltgoslo/NorBERT/blob/main/benchmarking/experiments/bert_ner.py code] || 81.7 || 86.6 ||  82.8 ||  84.5 || '''88.6'''
 
|-
 
|-
 
|}
 
|}
Line 92: Line 109:
 
===NorBERT 2===
 
===NorBERT 2===
 
*[https://huggingface.co/datasets/NbAiLab/NCC Norwegian Colossal Corpus] (NCC), non-copyrighted part; 5 billion words;
 
*[https://huggingface.co/datasets/NbAiLab/NCC Norwegian Colossal Corpus] (NCC), non-copyrighted part; 5 billion words;
*[https://aclanthology.org/2021.naacl-main.41/ C4 web-crawled corpus], Norwegian part; 9.5 billion words.
+
*[https://aclanthology.org/2021.naacl-main.41/ C4 web-crawled corpus], Norwegian part; a random sample of about 9.5 billion words.
  
 
In total, this comprises about 15 billion word tokens in about 1 billion sentences, both in Bokmål and in Nynorsk.
 
In total, this comprises about 15 billion word tokens in about 1 billion sentences, both in Bokmål and in Nynorsk.
Line 106: Line 123:
  
 
==Vocabulary==
 
==Vocabulary==
The vocabulary for the model is of size 30 000 and contains ''cased entries with diacritics''. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.
+
The vocabulary for the ''NorBERT 1'' model is of size 30 000 and contains ''cased entries with diacritics''. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.
  
 
The vocabulary was generated using the SentencePiece algorithm and Tokenizers library ([https://github.com/ltgoslo/NorBERT/blob/main/tokenization/spiece_tokenizer.py code]). The resulting [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_sentencepiece_vocab_30k.json Tokenizers model] was [https://github.com/ltgoslo/NorBERT/blob/main/tokenization/sent2wordpiece.py converted] to the standard [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_wordpiece_vocab_30k.txt BERT WordPiece format].
 
The vocabulary was generated using the SentencePiece algorithm and Tokenizers library ([https://github.com/ltgoslo/NorBERT/blob/main/tokenization/spiece_tokenizer.py code]). The resulting [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_sentencepiece_vocab_30k.json Tokenizers model] was [https://github.com/ltgoslo/NorBERT/blob/main/tokenization/sent2wordpiece.py converted] to the standard [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_wordpiece_vocab_30k.txt BERT WordPiece format].
 +
 +
The vocabulary for the ''NorBERT 2'' model is of size 50 000. It was generated using the [https://pypi.org/project/sentencepiece/ original SentencePiece library].
  
 
=NorBERT Model Details=
 
=NorBERT Model Details=
Line 116: Line 135:
  
 
==Training Overview==
 
==Training Overview==
NorBERT was trained on the Norwegian academic HPC system called [https://documentation.sigma2.no/hpc_machines/saga.html Saga]. Most of the time the training was distributed across 4 compute nodes and 16 NVIDIA P100 GPUs. Training took approximately 3 weeks. [http://wiki.nlpl.eu/index.php/Eosc/pretraining/nvidia Instructions for reproducing the training setup with EasyBuild]
+
''NorBERT 1'' was trained on the Norwegian academic HPC system called [https://documentation.sigma2.no/hpc_machines/saga.html Saga]. Most of the time the training was distributed across 4 compute nodes and 16 NVIDIA P100 GPUs. Training took approximately 3 weeks. [http://wiki.nlpl.eu/index.php/Eosc/pretraining/nvidia Instructions for reproducing the training setup with EasyBuild]
 +
 
 +
''NorBERT 2'' was trained on the Norwegian academic HPC system called [https://www.uio.no/english/services/it/research/platforms/edu-research/help/hpc/docs/fox/system-overview.md Fox]. The training was distributed across 1 compute node and 4 NVIDIA A100 GPUs. It took approximately 4 weeks.
  
 
==Training Code==
 
==Training Code==
Line 122: Line 143:
  
 
We made minor changes to this code, mostly to update it to the newer TensorFlow versions ([https://github.com/ltgoslo/NorBERT/tree/main/patches_for_NVIDIA_BERT our patches]).
 
We made minor changes to this code, mostly to update it to the newer TensorFlow versions ([https://github.com/ltgoslo/NorBERT/tree/main/patches_for_NVIDIA_BERT our patches]).
 +
 +
NorBERT 2 was trained with the [https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/LanguageModeling/BERT TensorFlow 2 branch].
  
 
All the utils we used at the preprocessing and training are published in [https://github.com/ltgoslo/NorBERT our Github repository].
 
All the utils we used at the preprocessing and training are published in [https://github.com/ltgoslo/NorBERT our Github repository].
  
 
==Training Workflow==
 
==Training Workflow==
 +
 +
===NorBERT 1===
 +
 
The Phase 1 (training with maximum sequence length of 128) was being done with batch size 48 and global batch size 48*16=768. Since one global batch contains 768 sentences, approximately 265 000 training steps constitute 1 epoch (one pass over the whole corpus). We have done 3 epochs: 795 000 training steps.
 
The Phase 1 (training with maximum sequence length of 128) was being done with batch size 48 and global batch size 48*16=768. Since one global batch contains 768 sentences, approximately 265 000 training steps constitute 1 epoch (one pass over the whole corpus). We have done 3 epochs: 795 000 training steps.
  
Line 131: Line 157:
  
 
Full logs and loss plots can be found [https://github.com/ltgoslo/NorBERT/tree/main/logs here] (the training was on pause on December 25 and 26, since we were solving problems with mixed precision training).
 
Full logs and loss plots can be found [https://github.com/ltgoslo/NorBERT/tree/main/logs here] (the training was on pause on December 25 and 26, since we were solving problems with mixed precision training).
 +
 +
===NorBERT 2===
 +
The Phase 1 (training with maximum sequence length of 128) was being done with batch size 160 and global batch size 160*4=640. Since one global batch contains 640 sentences (training instances), approximately 1 560 000 training steps constitute 1 epoch (one pass over the whole corpus). We have done 2 000 000 training steps in this phase.
 +
 +
The Phase 2 (training with maximum sequence length of 512) was being done with batch size 24 and global batch size 24*4=96. We aimed at mimicking the original BERT in that at Phase 2 the model should see about 1/9 of the number of sentences seen during Phase 1. Thus, we needed about 111 million sentences, which at the global batch size of 96 boils down to 1 160 000 training steps more. We actually did 1 400 000 training steps in this phase.

Latest revision as of 10:47, 21 March 2023

Back to NorLM

NorBERT: Bidirectional Encoder Representations from Transformers

NorBERT is a series of BERT deep learning language models [Devlin et al 2019] trained from scratch for Norwegian. The models can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks. These models are part of the ongoing NorLM initiative for very large contextualized Norwegian language models and associated tools and recipies. The NorBERT training setup builds on prior work on FinBERT by our collaborators at the University of Turku.

NorBERT 1

- Download from the NLPL Vector Repository

- Use with the Huggingface Transformers library

- Available locally on Saga: /cluster/shared/nlpl/data/vectors/latest/216/

NorBERT 1 features a custom 30 000 WordPiece vocabulary.

NorBERT 2

- Download from the NLPL Vector Repository

- Use with the Huggingface Transformers library

- Available locally on Saga: /cluster/shared/nlpl/data/vectors/latest/221/

NorBERT 2 features a custom 50 000 WordPiece vocabulary.

NorBERT vocabularies have much better coverage of Norwegian words than the multilingual BERT (mBERT) models from Google:

Vocabulary Example of a tokenized sentence
NorBERT 2 Denne gjengen håper at de sammen skal bidra til å gi kvinne ##fotball ##en i Kristiansand et lenge etterleng ##tet løft .
NorBERT Denne gjengen håper at de sammen skal bidra til å gi kvinne ##fotball ##en i Kristiansand et lenge etterl ##engt ##et løft .
mBERT Denne g ##jeng ##en h ##å ##per at de sammen skal bid ##ra til å gi k ##vinne ##fo ##t ##ball ##en i Kristiansand et lenge etter ##len ##gte ##t l ##ø ##ft .

Release history

7 February 2022 - version 2. NorBERT 2: a completely new model trained from scratch on the very large corpus of Norwegian (C4 + NCC, about 15 billion word tokens). It features a 50 000 words vocabulary and was trained using Whole Word Masking.

13 February 2021 - version 1.1. Fixes an issue with duplicate entries in the NorBERT vocabulary. In rare cases it could lead to warnings and errors. The model itself is unchanged.

13 January 2021 - version 1.0 (deprecated)

Evaluation

We have currently evaluated NorBERT on four benchmarks: Part-of-Speech tagging on Bokmål and Nynorsk (taken from the Universal Dependencies project), fine-grained sentiment analysis (with data from NoReC_fine), sentence-level binary sentiment classification (with data from aggregating the fine-grained annotations in NoReC_fine and removing sentences with conflicting or no sentiment) and named entity recognition (with data from NorNE).

Data amounts (in sentences):

Data Train Dev Test
POS Bokmål 15,696 2,409 1,939
POS Nynorsk 14,174 1,890 1,511
NER Bokmål 15,696 2,409 1,939
NER Nynorsk 14,174 1,890 1,511
Sentiment 2,675 516 417

For POS tagging and binary sentiment classification, we fine-tune NorBERT, Multilingual BERT and NB-BERT-Base for 10 epochs and keep the best model on the dev set. For fine-grained sentiment analysis, we use BERT token embeddings as features, with frozen model. For NER, we fine-tuning the pre-trained model for 20 epochs with early stopping.

NorBERT outperforms mBERT on both tasks: on POS tagging by 0.5 percentage points, by 9.4 percentage points on binary sentiment classification, and by 2.1 points of targeted F1 score on fine-grained sentiment analysis. NorBERT is on par with NB-BERT-Base on POS tagging, is a bit worse in binary sentiment classification and better in fine-grained sentiment analysis.

Model/task mBERT XLM-R NorBERT NorBERT 2 NB-BERT-Base
Part-of-Speech tagging Bokmål (accuracy) code ​98.0 97.5 98.5 98.3 98.7
Part-of-Speech tagging Nynorsk (accuracy) code ​97.9 97.3 98.0 97.7 98.3
Fine-grained sentiment analysis (Mean Targeted F1 across 5 runs) code ​34.9 33.9 35.0 33.8 34.4
Binary sentiment analysis (F1 score) code 67.7 71.8 77.1 84.2 83.9
Named entity recognition Bokmål (F1 score) code 78.8 84.5 85.5 88.2 90.2
Named entity recognition Nynorsk (F1 score) code 81.7 86.6 82.8 84.5 88.6

Training Corpus

NorBERT 1

We use clean training corpora with ordered sentences:

In total, this comprises about two billion (1 907 072 909) word tokens in 203 million (202 802 665) sentences, both in Bokmål and in Nynorsk; thus, this is a joint model. In the future, separate Bokmål and Nynorsk models are planned as well.

NorBERT 2

In total, this comprises about 15 billion word tokens in about 1 billion sentences, both in Bokmål and in Nynorsk.

Preprocessing

1. Wikipedia texts were extracted using segment_wiki.

2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.

3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.

4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.

Vocabulary

The vocabulary for the NorBERT 1 model is of size 30 000 and contains cased entries with diacritics. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.

The vocabulary was generated using the SentencePiece algorithm and Tokenizers library (code). The resulting Tokenizers model was converted to the standard BERT WordPiece format.

The vocabulary for the NorBERT 2 model is of size 50 000. It was generated using the original SentencePiece library.

NorBERT Model Details

Configuration

NorBERT corresponds in its configuration to Google's Bert-Base Cased for English, with 12 layers and hidden size 768. Configuration file

Training Overview

NorBERT 1 was trained on the Norwegian academic HPC system called Saga. Most of the time the training was distributed across 4 compute nodes and 16 NVIDIA P100 GPUs. Training took approximately 3 weeks. Instructions for reproducing the training setup with EasyBuild

NorBERT 2 was trained on the Norwegian academic HPC system called Fox. The training was distributed across 1 compute node and 4 NVIDIA A100 GPUs. It took approximately 4 weeks.

Training Code

Similar to the creators of FinBERT, we employed the BERT implementation by NVIDIA (version 20.06.08) which allows relatively fast multi-node and multi-GPU training.

We made minor changes to this code, mostly to update it to the newer TensorFlow versions (our patches).

NorBERT 2 was trained with the TensorFlow 2 branch.

All the utils we used at the preprocessing and training are published in our Github repository.

Training Workflow

NorBERT 1

The Phase 1 (training with maximum sequence length of 128) was being done with batch size 48 and global batch size 48*16=768. Since one global batch contains 768 sentences, approximately 265 000 training steps constitute 1 epoch (one pass over the whole corpus). We have done 3 epochs: 795 000 training steps.

The Phase 2 (training with maximum sequence length of 512) was being done with batch size 8 and global batch size 8*16=128. We aimed at mimicking the original BERT in that at Phase 2 the model should see about 1/9 of the number of sentences seen during Phase 1. Thus, we needed about 68 million sentences, which at the global batch size of 128 boils down to 531 000 training steps more.

Full logs and loss plots can be found here (the training was on pause on December 25 and 26, since we were solving problems with mixed precision training).

NorBERT 2

The Phase 1 (training with maximum sequence length of 128) was being done with batch size 160 and global batch size 160*4=640. Since one global batch contains 640 sentences (training instances), approximately 1 560 000 training steps constitute 1 epoch (one pass over the whole corpus). We have done 2 000 000 training steps in this phase.

The Phase 2 (training with maximum sequence length of 512) was being done with batch size 24 and global batch size 24*4=96. We aimed at mimicking the original BERT in that at Phase 2 the model should see about 1/9 of the number of sentences seen during Phase 1. Thus, we needed about 111 million sentences, which at the global batch size of 96 boils down to 1 160 000 training steps more. We actually did 1 400 000 training steps in this phase.