Difference between revisions of "Eosc/norbert"
(→Preprocessing and Tokenization) |
|||
Line 13: | Line 13: | ||
[https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump. | [https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump. | ||
+ | |||
+ | It seems there are some tokenization issues in the [https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131 UDPipe Norwegian model trained on UD 2.5] (''norwegian-bokmaal-ud-2.5-191206.udpipe''). | ||
+ | It always merges punctuation marks with the preceding words, as can be checked at the [https://lindat.mff.cuni.cz/services/udpipe/ online demo]. | ||
+ | |||
+ | Example: | ||
+ | <pre> | ||
+ | # text = På østsiden av vannet, er det en godkjent bålplass. | ||
+ | 1 På på ADP _ _ _ _ _ _ | ||
+ | 2 østsiden østsid NOUN _ Definite=Def|Gender=Masc|Number=Sing _ _ _ _ | ||
+ | 3 av av ADP _ _ _ _ _ _ | ||
+ | 4 vannet, $vannet, PUNCT _ _ _ _ _ _ | ||
+ | 5 er være AUX _ Mood=Ind|Tense=Pres|VerbForm=Fin _ _ _ _ | ||
+ | 6 det det PRON _ Gender=Neut|Number=Sing|Person=3|PronType=Prs _ _ _ _ | ||
+ | 7 en en DET _ Gender=Masc|Number=Sing|PronType=Art _ _ _ _ | ||
+ | 8 godkjent godkjent ADJ _ Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing _ _ _ _ | ||
+ | 9 bålplass. bålplass. NOUN _ Abbr=Yes _ _ _ SpaceAfter=No | ||
+ | </pre> | ||
+ | |||
+ | This is weird, since the treebank itself looks correct in this respect. | ||
+ | The model trained on the previous 2.4 release (''norwegian-bokmaal-ud-2.4-190531.udpipe'') does not exhibit such behavior. | ||
= Evaluation = | = Evaluation = | ||
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT? | Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT? |
Revision as of 19:58, 20 September 2020
Contents
Working Notes for Norwegian BERT-Like Models
Available Text Corpora
Preprocessing and Tokenization
SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.
It seems there are some tokenization issues in the UDPipe Norwegian model trained on UD 2.5 (norwegian-bokmaal-ud-2.5-191206.udpipe). It always merges punctuation marks with the preceding words, as can be checked at the online demo.
Example:
# text = På østsiden av vannet, er det en godkjent bålplass. 1 På på ADP _ _ _ _ _ _ 2 østsiden østsid NOUN _ Definite=Def|Gender=Masc|Number=Sing _ _ _ _ 3 av av ADP _ _ _ _ _ _ 4 vannet, $vannet, PUNCT _ _ _ _ _ _ 5 er være AUX _ Mood=Ind|Tense=Pres|VerbForm=Fin _ _ _ _ 6 det det PRON _ Gender=Neut|Number=Sing|Person=3|PronType=Prs _ _ _ _ 7 en en DET _ Gender=Masc|Number=Sing|PronType=Art _ _ _ _ 8 godkjent godkjent ADJ _ Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing _ _ _ _ 9 bålplass. bålplass. NOUN _ Abbr=Yes _ _ _ SpaceAfter=No
This is weird, since the treebank itself looks correct in this respect. The model trained on the previous 2.4 release (norwegian-bokmaal-ud-2.4-190531.udpipe) does not exhibit such behavior.
Evaluation
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?