Eosc/NorBERT3 corpus
Workflow
Sampling experiment
We plan to create two versions of the training corpus:
- baseline (as is)
- Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)
Vocabulary
Starting with 50K, following NorBERT-2. May be later experiment with other values.
To Decide
What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC.