Difference between revisions of "Eosc/NorBERT3 corpus"
(Cleaning) |
|||
| Line 1: | Line 1: | ||
| − | * | + | == Workflow == |
| − | * | + | * [https://arxiv.org/abs/2112.11446 Cleaning] |
| − | + | * [https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup Deduplication] ([https://pypi.org/project/mmh3/ MurMurHash]) | |
| + | * [https://github.com/ekzhu/datasketch ...or another package] | ||
| − | + | == Sampling experiment == | |
| + | We plan to create two versions of the training corpus: | ||
| + | * baseline (as is) | ||
| + | * Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data) | ||
| − | + | == Vocabulary == | |
| + | Starting with 50K, following NorBERT-2. May be later experiment with other values. | ||
| + | |||
| + | == To Decide == | ||
| + | What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC. | ||
Revision as of 20:14, 18 October 2022
Workflow
Sampling experiment
We plan to create two versions of the training corpus:
- baseline (as is)
- Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)
Vocabulary
Starting with 50K, following NorBERT-2. May be later experiment with other values.
To Decide
What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC.