Difference between revisions of "Eosc/NorBERT3 corpus"

Latest revision as of 18:50, 24 October 2022

Workflow

De-duplication: essentially, removing identical paragraphs using SimHash (similar to the NearDup approach in this paper, although they used MinHash; MurMurHash is another option).
Cleaning

There are other de-duplication packages

Sampling experiment

We plan to create two versions of the training corpus:

baseline (as is)
Wikipedia+NCC+NAK+NBDigital multiplied by two to match the C4 size (oversampling quality data)

Vocabulary

Starting with 50K, following NorBERT-2. May be later experiment with other values.

To Decide

Q: The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.

A: No, it isn't. Only 60 paragraphs out of total 18M in NCC are duplicates of paragraphs in NBDigital. Thus, we definitely should use it.

Q: How should we split training corpora: one sentence per line, one paragraph per line, one document per line?

A: BERT assumes that there is one sentence per line.

@@ Line 1: / Line 1: @@
 == Workflow ==
+* [https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup De-duplication]: essentially, removing identical paragraphs using SimHash (similar to the NearDup approach in [https://aclanthology.org/2022.acl-long.577/ this paper], although they used MinHash; [https://pypi.org/project/mmh3/ MurMurHash] is another option).
 * [https://arxiv.org/abs/2112.11446 Cleaning]
-* [https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup Deduplication] ([https://pypi.org/project/mmh3/ MurMurHash])
-* [https://github.com/ekzhu/datasketch ...or another package]
+There are [https://github.com/ekzhu/datasketch other de-duplication packages]
 == Sampling experiment ==
 We plan to create two versions of the training corpus:
 * baseline (as is)
-* Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)
+* Wikipedia+NCC+NAK+NBDigital multiplied by two to match the C4 size (oversampling quality data)
 == Vocabulary ==
@@ Line 13: / Line 14: @@
 == To Decide ==
-The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
+Q: The size of NBDigital is 662M tokens. Should we use it? It probably overlaps a lot with NCC.
+A: No, it isn't. Only 60 paragraphs out of total 18M in NCC are duplicates of paragraphs in NBDigital. Thus, we definitely should use it.
+Q: How should we split training corpora: one sentence per line, one paragraph per line, one document per line?
+A: BERT assumes that there is one sentence per line.

Difference between revisions of "Eosc/NorBERT3 corpus"

Latest revision as of 18:50, 24 October 2022

Contents

Workflow

Sampling experiment

Vocabulary

To Decide

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools