Difference between revisions of "Eosc/norbert"
(→Available Text Corpora) |
(→Available Bokmål Text Corpora) |
||
Line 5: | Line 5: | ||
*[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus]; 1.7 billion words; sentences are ordered; clean; | *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus]; 1.7 billion words; sentences are ordered; clean; | ||
− | *[https://dumps.wikimedia.org/nowiki/latest/ Norwegian Wikipedia]; | + | *[https://dumps.wikimedia.org/nowiki/latest/ Norwegian Wikipedia]; 160 million words; sentences are ordered; clean (more or less); |
*[https://www.hf.uio.no/iln/english/about/organization/text-laboratory/projects/nowac/index.html noWAC]; 700 million words; sentences are ordered; semi-clean; | *[https://www.hf.uio.no/iln/english/about/organization/text-laboratory/projects/nowac/index.html noWAC]; 700 million words; sentences are ordered; semi-clean; | ||
*[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989# CommonCrawl from CoNLL 2017]; 1.3 billion words; sentences are shuffled; not clean; | *[https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1989# CommonCrawl from CoNLL 2017]; 1.3 billion words; sentences are shuffled; not clean; |
Revision as of 10:56, 24 September 2020
Contents
Working Notes for Norwegian BERT-Like Models
Available Bokmål Text Corpora
- Norsk Aviskorpus; 1.7 billion words; sentences are ordered; clean;
- Norwegian Wikipedia; 160 million words; sentences are ordered; clean (more or less);
- noWAC; 700 million words; sentences are ordered; semi-clean;
- CommonCrawl from CoNLL 2017; 1.3 billion words; sentences are shuffled; not clean;
- NB Digital; 200 million words; sentences are ordered; semi-clean (OCR quality varies).
Preprocessing and Tokenization
SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.
It seems there are some tokenization issues in the UDPipe Norwegian model trained on UD 2.5 (norwegian-bokmaal-ud-2.5-191206.udpipe). It always merges punctuation marks with the preceding words, as can be checked at the online demo.
Example:
# text = På østsiden av vannet, er det en godkjent bålplass. 1 På på ADP _ _ _ _ _ _ 2 østsiden østsid NOUN _ Definite=Def|Gender=Masc|Number=Sing _ _ _ _ 3 av av ADP _ _ _ _ _ _ 4 vannet, $vannet, PUNCT _ _ _ _ _ _ 5 er være AUX _ Mood=Ind|Tense=Pres|VerbForm=Fin _ _ _ _ 6 det det PRON _ Gender=Neut|Number=Sing|Person=3|PronType=Prs _ _ _ _ 7 en en DET _ Gender=Masc|Number=Sing|PronType=Art _ _ _ _ 8 godkjent godkjent ADJ _ Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing _ _ _ _ 9 bålplass. bålplass. NOUN _ Abbr=Yes _ _ _ SpaceAfter=No
The model trained on the previous 2.4 release (norwegian-bokmaal-ud-2.4-190531.udpipe) does not exhibit such behavior.
There is an active pull request which supposedly fixes this.
Evaluation
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?