Working Notes for Norwegian BERT-Like Models

Available Text Corpora

Preprocessing and Tokenization

SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.

It seems there are some tokenization issues in the UDPipe Norwegian model trained on UD 2.5 (norwegian-bokmaal-ud-2.5-191206.udpipe). It always merges punctuation marks with the preceding words, as can be checked at the online demo.

Example:

# text = På østsiden av vannet, er det en godkjent bålplass.
1	På	på	ADP	_	_	_	_	_	_
2	østsiden	østsid	NOUN	_	Definite=Def|Gender=Masc|Number=Sing	_	_	_	_
3	av	av	ADP	_	_	_	_	_	_
4	vannet,	$vannet,	PUNCT	_	_	_	_	_	_
5	er	være	AUX	_	Mood=Ind|Tense=Pres|VerbForm=Fin	_	_	_	_
6	det	det	PRON	_	Gender=Neut|Number=Sing|Person=3|PronType=Prs	_	_	_	_
7	en	en	DET	_	Gender=Masc|Number=Sing|PronType=Art	_	_	_	_
8	godkjent	godkjent	ADJ	_	Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing	_	_	_	_
9	bålplass.	bålplass.	NOUN	_	Abbr=Yes	_	_	_	SpaceAfter=No

This is weird, since the treebank itself looks correct in this respect. The model trained on the previous 2.4 release (norwegian-bokmaal-ud-2.4-190531.udpipe) does not exhibit such behavior.

Evaluation

Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?

Eosc/norbert

Contents

Working Notes for Norwegian BERT-Like Models

Available Text Corpora

Preprocessing and Tokenization

Evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools