Difference between revisions of "Eosc/norbert"
 (→Available Text Corpora)  | 
				 (→Preprocessing and Tokenization)  | 
				||
| Line 11: | Line 11: | ||
= Preprocessing and Tokenization =  | = Preprocessing and Tokenization =  | ||
| + | |||
| + | [https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump.  | ||
Revision as of 18:05, 16 September 2020
Working Notes for Norwegian BERT-Like Models
Available Text Corpora
Preprocessing and Tokenization
SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.