Revision as of 18:05, 16 September 2020

Working Notes for Norwegian BERT-Like Models

SentencePiece library finds 157 unique characters in Norwegian Wikipedia dump.

Revision as of 12:23, 11 September 2020 (view source) Andreku (talk \| contribs) (→‎Available Text Corpora) ← Older edit		Revision as of 18:05, 16 September 2020 (view source) Andreku (talk \| contribs) (→‎Preprocessing and Tokenization) Newer edit →
Line 11:		Line 11:

	= Preprocessing and Tokenization =		= Preprocessing and Tokenization =
		+
		+	[https://github.com/google/sentencepiece SentencePiece] library finds '''157''' unique characters in Norwegian Wikipedia dump.