Difference between revisions of "Eosc/NorBERT3 corpus"

Revision as of 20:14, 18 October 2022

We plan to create two versions of the training corpus:

baseline (as is)
Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)

Starting with 50K, following NorBERT-2. May be later experiment with other values.

What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC.

@@ Line 1: / Line 1: @@
-* Cleaning procedure from https://arxiv.org/abs/2112.11446
+== Workflow ==
-* Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch
+* [https://arxiv.org/abs/2112.11446 Cleaning]
-* Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size
+* [https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup Deduplication] ([https://pypi.org/project/mmh3/ MurMurHash])
+* [https://github.com/ekzhu/datasketch ...or another package]
-Todo: what is the size of NBDigital and should we use it?
+== Sampling experiment ==
+We plan to create two versions of the training corpus:
+* baseline (as is)
+* Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)
-Todo: vocabulary size? Start with 50K
+== Vocabulary ==
+Starting with 50K, following NorBERT-2. May be later experiment with other values.
+== To Decide ==
+What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC.