Difference between revisions of "Vectors/norlm/norbert"
(→NorBERT: Bidirectional Encoder Representations from Transformers) |
(→Evaluation) |
||
Line 18: | Line 18: | ||
! Model/task !! mBERT !! NorBERT | ! Model/task !! mBERT !! NorBERT | ||
|- | |- | ||
− | | Part-of-Speech tagging || 97.9 || 98.5 | + | | Part-of-Speech tagging || 97.9 || '''98.5''' |
|- | |- | ||
− | | Sentence-level binary sentiment classification || 66.7 || 82.3 | + | | Sentence-level binary sentiment classification || 66.7 || '''82.3''' |
|} | |} | ||
− | |||
==Training corpus== | ==Training corpus== |
Revision as of 00:39, 12 January 2021
Contents
NorBERT: Bidirectional Encoder Representations from Transformers
NorBERT is a BERT deep learning model language trained from scratch for Norwegian. The model can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks.
NorBERT features a custom 30 000 WordPiece vocabulary that has much better coverage of Norwegian words than the multilingual BERT (mBERT) models from Google:
Vocabulary | Example of a tokenized sentence |
---|---|
NorBERT | Denne gjengen håper at de sammen skal bidra til å gi kvinne ##fotball ##en i Kristiansand et lenge etterl ##engt ##et løft . |
mBERT | Denne g ##jeng ##en h ##å ##per at de sammen skal bid ##ra til å gi k ##vinne ##fo ##t ##ball ##en i Kristiansand et lenge etter ##len ##gte ##t l ##ø ##ft . |
Evaluation
Model/task | mBERT | NorBERT |
---|---|---|
Part-of-Speech tagging | 97.9 | 98.5 |
Sentence-level binary sentiment classification | 66.7 | 82.3 |
Training corpus
We use clean training corpora with ordered sentences:
- Norsk Aviskorpus (NAK); 1.7 billion words;
- Bøkmal Wikipedia; 160 million words;
- Nynorsk Wikipedia; 40 million words;
In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a joint model. In the future, separate Børmal and Nynorsk models are planned as well.
Preprocessing
1. Wikipedia texts were extracted using segment_wiki.
2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.
3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.
4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.
Vocabulary
The vocabulary for the model is of size 30 000 and contains cased entries with diacritics. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.
The vocabulary was generated using the SentencePiece algorithm and Tokenizers library (code). The resulting Tokenizers model was converted to the standard BERT WordPiece format.
NorBERT model
Configuration
NorBERT corresponds in its configuration to Google's Bert-Base Cased for English, with 12 layers and hidden size 768. Configuration file
Training overview
NorBERT was trained on the Norwegian academic HPC system called Saga. Most of the time the training was distributed across 4 compute nodes and 16 NVIDIA P100 GPUs. Training took approximately 3 weeks. Instructions for reproducing the training setup with EasyBuild
Training code
Similar to the creators of FinBERT, we employed the BERT implementation by NVIDIA (version 20.06.08) which allows relatively fast multi-node and multi-GPU training.
We made minor changes to this code, mostly to update it to the newer TensorFlow versions (our patches).