Difference between revisions of "Eosc/NorBERT3 corpus"

Revision as of 21:41, 18 October 2022

We plan to create two versions of the training corpus:

baseline (as is)
Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)

Starting with 50K, following NorBERT-2. May be later experiment with other values.

The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.

Revision as of 20:14, 18 October 2022 (view source) Andreku (talk \| contribs) (Cleaning) ← Older edit		Revision as of 21:41, 18 October 2022 (view source) Andreku (talk \| contribs) (→‎To Decide) Newer edit →
Line 13:		Line 13:

	== To Decide ==		== To Decide ==
−	~~What is the~~ size of NBDigital ~~and should~~ we use it? It probably overlaps a lot with NCC.	+	The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.