Difference between revisions of "Eosc/norbert"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Available Text Corpora)
(Preprocessing and Tokenization)
Line 11: Line 11:
  
 
= Preprocessing and Tokenization =
 
= Preprocessing and Tokenization =
 +
 +
[https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump.

Revision as of 18:05, 16 September 2020

Working Notes for Norwegian BERT-Like Models

Available Text Corpora

Preprocessing and Tokenization

SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.