Difference between revisions of "Eosc/NorBERT3 corpus"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Cleaning)
Line 1: Line 1:
* Cleaning procedure from https://arxiv.org/abs/2112.11446
+
== Workflow ==
* Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch
+
* [https://arxiv.org/abs/2112.11446 Cleaning]
* Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size
+
* [https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup Deduplication] ([https://pypi.org/project/mmh3/ MurMurHash])
 +
* [https://github.com/ekzhu/datasketch ...or another package]
  
Todo: what is the size of NBDigital and should we use it?
+
== Sampling experiment ==
 +
We plan to create two versions of the training corpus:
 +
* baseline (as is)
 +
* Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)
  
Todo: vocabulary size? Start with 50K
+
== Vocabulary ==
 +
Starting with 50K, following NorBERT-2. May be later experiment with other values.
 +
 
 +
== To Decide ==
 +
What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC.

Revision as of 20:14, 18 October 2022

Workflow

Sampling experiment

We plan to create two versions of the training corpus:

  • baseline (as is)
  • Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)

Vocabulary

Starting with 50K, following NorBERT-2. May be later experiment with other values.

To Decide

What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC.