Difference between revisions of "Vectors/norlm/norbert"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= NorBERT: Bidirectional Encoder Representations from Transformers =")
 
Line 1: Line 1:
 
= NorBERT: Bidirectional Encoder Representations from Transformers =
 
= NorBERT: Bidirectional Encoder Representations from Transformers =
 +
 +
==Training corpus==
 +
We use clean training corpora with ordered sentences:
 +
 +
*[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus] (NAK); 1.7 billion words;
 +
*[https://dumps.wikimedia.org/nowiki/latest/ Bøkmal Wikipedia]; 160 million words;
 +
*[https://dumps.wikimedia.org/nnwiki/latest/ Nynorsk Wikipedia]; 40 million words;
 +
 +
In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a ''joint'' model. In the future, separate Børmal and Nynorsk models are planned as well.
 +
 +
==Preprocessing ==
 +
1. Wikipedia texts were extracted using [https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py segment_wiki].
 +
 +
2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/detokenize.py self-made de-tokenizer].
 +
 +
3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/recode.sh converted] to UTF-8 before any other pre-processing.
 +
 +
4. The resulting corpus was sentence-segmented using [https://stanfordnlp.github.io/stanza/performance.html Stanza]. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.

Revision as of 23:47, 11 January 2021

NorBERT: Bidirectional Encoder Representations from Transformers

Training corpus

We use clean training corpora with ordered sentences:

In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a joint model. In the future, separate Børmal and Nynorsk models are planned as well.

Preprocessing

1. Wikipedia texts were extracted using segment_wiki.

2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.

3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.

4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.