Revision as of 23:47, 11 January 2021

NorBERT: Bidirectional Encoder Representations from Transformers

Training corpus

We use clean training corpora with ordered sentences:

Norsk Aviskorpus (NAK); 1.7 billion words;
Bøkmal Wikipedia; 160 million words;
Nynorsk Wikipedia; 40 million words;

In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a joint model. In the future, separate Børmal and Nynorsk models are planned as well.

Preprocessing

1. Wikipedia texts were extracted using segment_wiki.

2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.

3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.

4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.

@@ Line 1: / Line 1: @@
 = NorBERT: Bidirectional Encoder Representations from Transformers =
+==Training corpus==
+We use clean training corpora with ordered sentences:
+*[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus] (NAK); 1.7 billion words;
+*[https://dumps.wikimedia.org/nowiki/latest/ Bøkmal Wikipedia]; 160 million words;
+*[https://dumps.wikimedia.org/nnwiki/latest/ Nynorsk Wikipedia]; 40 million words;
+In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a ''joint'' model. In the future, separate Børmal and Nynorsk models are planned as well.
+==Preprocessing ==
+. Wikipedia texts were extracted using [https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py segment_wiki].
+. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/detokenize.py self-made de-tokenizer].
+. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/recode.sh converted] to UTF-8 before any other pre-processing.
+. The resulting corpus was sentence-segmented using [https://stanfordnlp.github.io/stanza/performance.html Stanza]. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.

Difference between revisions of "Vectors/norlm/norbert"

Revision as of 23:47, 11 January 2021

NorBERT: Bidirectional Encoder Representations from Transformers

Training corpus

Preprocessing

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools