Difference between revisions of "Eosc/NorBERT3 corpus"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Workflow)
(Sampling experiment)
 
(One intermediate revision by the same user not shown)
Line 8: Line 8:
 
We plan to create two versions of the training corpus:
 
We plan to create two versions of the training corpus:
 
* baseline (as is)  
 
* baseline (as is)  
* Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)
+
* Wikipedia+NCC+NAK+NBDigital multiplied by two to match the C4 size (oversampling quality data)
  
 
== Vocabulary ==
 
== Vocabulary ==
Line 14: Line 14:
  
 
== To Decide ==
 
== To Decide ==
The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
+
Q: The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
  
How should we split training corpora: one sentence per line, one paragraph per line, one document per line?
+
A: No, it isn't. Only 60 paragraphs out of total 18M in NCC are duplicates of paragraphs in NBDigital. Thus, we definitely should use it.
 +
 
 +
Q: How should we split training corpora: one sentence per line, one paragraph per line, one document per line?
  
 
A: BERT assumes that there is one sentence per line.
 
A: BERT assumes that there is one sentence per line.

Latest revision as of 18:50, 24 October 2022

Workflow

There are other de-duplication packages

Sampling experiment

We plan to create two versions of the training corpus:

  • baseline (as is)
  • Wikipedia+NCC+NAK+NBDigital multiplied by two to match the C4 size (oversampling quality data)

Vocabulary

Starting with 50K, following NorBERT-2. May be later experiment with other values.

To Decide

Q: The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.

A: No, it isn't. Only 60 paragraphs out of total 18M in NCC are duplicates of paragraphs in NBDigital. Thus, we definitely should use it.

Q: How should we split training corpora: one sentence per line, one paragraph per line, one document per line?

A: BERT assumes that there is one sentence per line.