Difference between revisions of "Eosc/NorBERT3 corpus"
(→To Decide) |
(→To Decide) |
||
Line 14: | Line 14: | ||
== To Decide == | == To Decide == | ||
The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC. | The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC. | ||
+ | |||
+ | How should we split training corpora: one sentence per line, one paragraph per line, one document per line? |
Revision as of 14:37, 21 October 2022
Workflow
Sampling experiment
We plan to create two versions of the training corpus:
- baseline (as is)
- Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)
Vocabulary
Starting with 50K, following NorBERT-2. May be later experiment with other values.
To Decide
The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
How should we split training corpora: one sentence per line, one paragraph per line, one document per line?