Difference between revisions of "Eosc/norbert"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Working Notes for Norwegian BERT-Like Models)
(The tokenization issue was fixed in the UD 2.6 release)
Line 23: Line 23:
  
 
2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.
 
2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.
 
It seems there are some tokenization issues in the [https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-3131 UDPipe Norwegian model trained on UD 2.5] (''norwegian-bokmaal-ud-2.5-191206.udpipe'').
 
It always merges punctuation marks with the preceding words, as can be checked at the [https://lindat.mff.cuni.cz/services/udpipe/ online demo].
 
 
Example:
 
<pre>
 
# text = På østsiden av vannet, er det en godkjent bålplass.
 
1 På på ADP _ _ _ _ _ _
 
2 østsiden østsid NOUN _ Definite=Def|Gender=Masc|Number=Sing _ _ _ _
 
3 av av ADP _ _ _ _ _ _
 
4 vannet, $vannet, PUNCT _ _ _ _ _ _
 
5 er være AUX _ Mood=Ind|Tense=Pres|VerbForm=Fin _ _ _ _
 
6 det det PRON _ Gender=Neut|Number=Sing|Person=3|PronType=Prs _ _ _ _
 
7 en en DET _ Gender=Masc|Number=Sing|PronType=Art _ _ _ _
 
8 godkjent godkjent ADJ _ Definite=Ind|Degree=Pos|Gender=Neut|Number=Sing _ _ _ _
 
9 bålplass. bålplass. NOUN _ Abbr=Yes _ _ _ SpaceAfter=No
 
</pre>
 
 
The model trained on the previous 2.4 release (''norwegian-bokmaal-ud-2.4-190531.udpipe'') does not exhibit such behavior.
 
 
There is an active [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal/pull/5 pull request] which supposedly fixes this.
 
  
 
= Evaluation =
 
= Evaluation =
 
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?
 
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?

Revision as of 17:32, 2 December 2020

Working Notes for Norwegian BERT-Like Models

Report on the creation of FinBERT: https://arxiv.org/pdf/1912.07076.pdf

Working NVIDIA implementation workflow on Saga

Available Bokmål Text Corpora

Preprocessing and Tokenization

SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.

Input file format:

1. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because BERT uses the sentence boundaries for the "next sentence prediction" task).

2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.

Evaluation

Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?