Difference between revisions of "Vectors/norlm/norbert"

From Nordic Language Processing Laboratory
Jump to: navigation, search
Line 18: Line 18:
  
 
4. The resulting corpus was sentence-segmented using [https://stanfordnlp.github.io/stanza/performance.html Stanza]. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.
 
4. The resulting corpus was sentence-segmented using [https://stanfordnlp.github.io/stanza/performance.html Stanza]. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.
 +
 +
==Vocabulary==
 +
The vocabulary for the model is of size 30 000 and contains ''cased entries with diacritics''. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.
 +
 +
The vocabulary was generated using the SentencePiece algorithm and Tokenizers library ([https://github.com/ltgoslo/NorBERT/blob/main/tokenization/spiece_tokenizer.py code]). The resulting [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_sentencepiece_vocab_30k.json Tokenizers model] was [https://github.com/ltgoslo/NorBERT/blob/main/tokenization/sent2wordpiece.py converted] to the standard [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_wordpiece_vocab_30k.txt BERT WordPiece format].

Revision as of 23:53, 11 January 2021

NorBERT: Bidirectional Encoder Representations from Transformers

Training corpus

We use clean training corpora with ordered sentences:

In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a joint model. In the future, separate Børmal and Nynorsk models are planned as well.

Preprocessing

1. Wikipedia texts were extracted using segment_wiki.

2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.

3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.

4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.

Vocabulary

The vocabulary for the model is of size 30 000 and contains cased entries with diacritics. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.

The vocabulary was generated using the SentencePiece algorithm and Tokenizers library (code). The resulting Tokenizers model was converted to the standard BERT WordPiece format.