Difference between revisions of "Eosc/NorBERT3 corpus"
(→To Decide) |
(→Sampling experiment) |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
== Workflow == | == Workflow == | ||
+ | * [https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup De-duplication]: essentially, removing identical paragraphs using SimHash (similar to the NearDup approach in [https://aclanthology.org/2022.acl-long.577/ this paper], although they used MinHash; [https://pypi.org/project/mmh3/ MurMurHash] is another option). | ||
* [https://arxiv.org/abs/2112.11446 Cleaning] | * [https://arxiv.org/abs/2112.11446 Cleaning] | ||
− | + | ||
− | + | There are [https://github.com/ekzhu/datasketch other de-duplication packages] | |
== Sampling experiment == | == Sampling experiment == | ||
We plan to create two versions of the training corpus: | We plan to create two versions of the training corpus: | ||
* baseline (as is) | * baseline (as is) | ||
− | * Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data) | + | * Wikipedia+NCC+NAK+NBDigital multiplied by two to match the C4 size (oversampling quality data) |
== Vocabulary == | == Vocabulary == | ||
Line 13: | Line 14: | ||
== To Decide == | == To Decide == | ||
− | The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC. | + | Q: The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC. |
+ | |||
+ | A: No, it isn't. Only 60 paragraphs out of total 18M in NCC are duplicates of paragraphs in NBDigital. Thus, we definitely should use it. | ||
+ | |||
+ | Q: How should we split training corpora: one sentence per line, one paragraph per line, one document per line? | ||
− | + | A: BERT assumes that there is one sentence per line. |
Latest revision as of 18:50, 24 October 2022
Workflow
- De-duplication: essentially, removing identical paragraphs using SimHash (similar to the NearDup approach in this paper, although they used MinHash; MurMurHash is another option).
- Cleaning
There are other de-duplication packages
Sampling experiment
We plan to create two versions of the training corpus:
- baseline (as is)
- Wikipedia+NCC+NAK+NBDigital multiplied by two to match the C4 size (oversampling quality data)
Vocabulary
Starting with 50K, following NorBERT-2. May be later experiment with other values.
To Decide
Q: The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
A: No, it isn't. Only 60 paragraphs out of total 18M in NCC are duplicates of paragraphs in NBDigital. Thus, we definitely should use it.
Q: How should we split training corpora: one sentence per line, one paragraph per line, one document per line?
A: BERT assumes that there is one sentence per line.