Difference between revisions of "Vectors/norlm/norbert"
(Created page with "= NorBERT: Bidirectional Encoder Representations from Transformers =") |
|||
Line 1: | Line 1: | ||
= NorBERT: Bidirectional Encoder Representations from Transformers = | = NorBERT: Bidirectional Encoder Representations from Transformers = | ||
+ | |||
+ | ==Training corpus== | ||
+ | We use clean training corpora with ordered sentences: | ||
+ | |||
+ | *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus] (NAK); 1.7 billion words; | ||
+ | *[https://dumps.wikimedia.org/nowiki/latest/ Bøkmal Wikipedia]; 160 million words; | ||
+ | *[https://dumps.wikimedia.org/nnwiki/latest/ Nynorsk Wikipedia]; 40 million words; | ||
+ | |||
+ | In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a ''joint'' model. In the future, separate Børmal and Nynorsk models are planned as well. | ||
+ | |||
+ | ==Preprocessing == | ||
+ | 1. Wikipedia texts were extracted using [https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py segment_wiki]. | ||
+ | |||
+ | 2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/detokenize.py self-made de-tokenizer]. | ||
+ | |||
+ | 3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/recode.sh converted] to UTF-8 before any other pre-processing. | ||
+ | |||
+ | 4. The resulting corpus was sentence-segmented using [https://stanfordnlp.github.io/stanza/performance.html Stanza]. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents. |
Revision as of 23:47, 11 January 2021
NorBERT: Bidirectional Encoder Representations from Transformers
Training corpus
We use clean training corpora with ordered sentences:
- Norsk Aviskorpus (NAK); 1.7 billion words;
- Bøkmal Wikipedia; 160 million words;
- Nynorsk Wikipedia; 40 million words;
In total, this comprises about two billion word tokens, both in Bøkmal and in Nynorsk; thus, this is a joint model. In the future, separate Børmal and Nynorsk models are planned as well.
Preprocessing
1. Wikipedia texts were extracted using segment_wiki.
2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a self-made de-tokenizer.
3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were converted to UTF-8 before any other pre-processing.
4. The resulting corpus was sentence-segmented using Stanza. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.