NorBERT: Bidirectional Encoder Representations from Transformers

Training corpus

We use clean training corpora with ordered sentences:

Norsk Aviskorpus (NAK); 1.7 billion words;
Bøkmal Wikipedia; 160 million words;
Nynorsk Wikipedia; 40 million words;

In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a joint model. In the future, separate Børmal and Nynorsk models are planned as well.

Preprocessing

1. Wikipedia texts were extracted using segment_wiki.

2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.

3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.

4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.

Vectors/norlm/norbert

NorBERT: Bidirectional Encoder Representations from Transformers

Training corpus

Preprocessing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools