Difference between revisions of "Eosc/NorBERT3 corpus"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(To Decide)
(To Decide)
Line 14: Line 14:
 
== To Decide ==
 
== To Decide ==
 
The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
 
The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
 +
 +
How should we split training corpora: one sentence per line, one paragraph per line, one document per line?

Revision as of 14:37, 21 October 2022

Workflow

Sampling experiment

We plan to create two versions of the training corpus:

  • baseline (as is)
  • Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)

Vocabulary

Starting with 50K, following NorBERT-2. May be later experiment with other values.

To Decide

The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.

How should we split training corpora: one sentence per line, one paragraph per line, one document per line?