Revision as of 23:56, 25 September 2020

Working Notes for Norwegian BERT-Like Models

Available Bokmål Text Corpora

Norsk Aviskorpus; 1.7 billion words; sentences are ordered; clean;
Norwegian Wikipedia; 160 million words; sentences are ordered; clean (more or less);
noWAC; 700 million words; sentences are ordered; semi-clean;
CommonCrawl from CoNLL 2017; 1.3 billion words; sentences are shuffled; not clean;
NB Digital; 200 million words; sentences are ordered; semi-clean (OCR quality varies).

Preprocessing and Tokenization

SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.

Input file format:

1. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because BERT uses the sentence boundaries for the "next sentence prediction" task).

2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

It seems there are some tokenization issues in the UDPipe Norwegian model trained on UD 2.5 (norwegian-bokmaal-ud-2.5-191206.udpipe). It always merges punctuation marks with the preceding words, as can be checked at the online demo.

Example:

# text = På østsiden av vannet, er det en godkjent bålplass.
1	På	på	ADP	_	_	_	_	_	_
2	østsiden	østsid	NOUN	_	Definite=Def|Gender=Masc|Number=Sing	_	_	_	_
3	av	av	ADP	_	_	_	_	_	_
4	vannet,	$vannet,	PUNCT	_	_	_	_	_	_
5	er	være	AUX	_	Mood=Ind|Tense=Pres|VerbForm=Fin	_	_	_	_
6	det	det	PRON	_	Gender=Neut|Number=Sing|Person=3|PronType=Prs	_	_	_	_
7	en	en	DET	_	Gender=Masc|Number=Sing|PronType=Art	_	_	_	_
8	godkjent	godkjent	ADJ	_	Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing	_	_	_	_
9	bålplass.	bålplass.	NOUN	_	Abbr=Yes	_	_	_	SpaceAfter=No

The model trained on the previous 2.4 release (norwegian-bokmaal-ud-2.4-190531.udpipe) does not exhibit such behavior.

There is an active pull request which supposedly fixes this.

Evaluation

Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?

Difference between revisions of "Eosc/norbert"

Revision as of 23:56, 25 September 2020

Contents

Working Notes for Norwegian BERT-Like Models

Available Bokmål Text Corpora

Preprocessing and Tokenization

Evaluation

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 13: / Line 13: @@
 [https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump.
+Input file format:
+. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text.
+(Because BERT uses the sentence boundaries for the "next sentence prediction" task).
+. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.
 It seems there are some tokenization issues in the [https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131 UDPipe Norwegian model trained on UD 2.5] (''norwegian-bokmaal-ud-2.5-191206.udpipe'').