Difference between revisions of "Eosc/norbert"
(→Available Text Corpora) |
(→Preprocessing and Tokenization) |
||
Line 11: | Line 11: | ||
= Preprocessing and Tokenization = | = Preprocessing and Tokenization = | ||
+ | |||
+ | [https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump. |
Revision as of 18:05, 16 September 2020
Working Notes for Norwegian BERT-Like Models
Available Text Corpora
Preprocessing and Tokenization
SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.