Difference between revisions of "Eosc/norbert"
(→Working Notes for Norwegian BERT-Like Models) |
(The tokenization issue was fixed in the UD 2.6 release) |
||
Line 23: | Line 23: | ||
2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents. | 2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents. | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
= Evaluation = | = Evaluation = | ||
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT? | Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT? |
Revision as of 17:32, 2 December 2020
Contents
Working Notes for Norwegian BERT-Like Models
Report on the creation of FinBERT: https://arxiv.org/pdf/1912.07076.pdf
Working NVIDIA implementation workflow on Saga
Available Bokmål Text Corpora
- Norsk Aviskorpus; 1.7 billion words; sentences are ordered; clean;
- Norwegian Wikipedia; 160 million words; sentences are ordered; clean (more or less);
- noWAC; 700 million words; sentences are ordered; semi-clean;
- CommonCrawl from CoNLL 2017; 1.3 billion words; sentences are shuffled; not clean;
- NB Digital; 200 million words; sentences are ordered; semi-clean (OCR quality varies).
Preprocessing and Tokenization
SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.
Input file format:
1. One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because BERT uses the sentence boundaries for the "next sentence prediction" task).
2. Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.
Evaluation
Do we have available Norwegian test sets for typical NLP tasks to evaluate our NorBERT?