NorBERT: Bidirectional Encoder Representations from Transformers
We use clean training corpora with ordered sentences:
- Norsk Aviskorpus (NAK); 1.7 billion words;
- Bøkmal Wikipedia; 160 million words;
- Nynorsk Wikipedia; 40 million words;
In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a joint model. In the future, separate Børmal and Nynorsk models are planned as well.
1. Wikipedia texts were extracted using segment_wiki.
2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.
3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.
4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.