Eosc/NorBERT3 corpus
Workflow
- De-duplication: essentially, removing identical paragraphs using SimHash (similar to the NearDup approach in this paper, although they used MinHash; MurMurHash is another option).
- Cleaning
There are other de-duplication packages
Sampling experiment
We plan to create two versions of the training corpus:
- baseline (as is)
- Wikipedia+NCC+NAK+NBDigital multiplied by two to match the C4 size (oversampling quality data)
Vocabulary
Starting with 50K, following NorBERT-2. May be later experiment with other values.
To Decide
Q: The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
A: No, it isn't. Only 60 paragraphs out of total 18M in NCC are duplicates of paragraphs in NBDigital. Thus, we definitely should use it.
Q: How should we split training corpora: one sentence per line, one paragraph per line, one document per line?
A: BERT assumes that there is one sentence per line.