Difference between revisions of "Eosc/NorBERT3 corpus"
(Created page with "* Cleaning procedure from https://arxiv.org/abs/2112.11446 * Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch *...") |
|||
Line 2: | Line 2: | ||
* Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch | * Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch | ||
* Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size | * Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size | ||
+ | |||
+ | Todo: what is the size of NBDigital and should we use it? |
Revision as of 14:20, 12 October 2022
- Cleaning procedure from https://arxiv.org/abs/2112.11446
- Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch
- Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size
Todo: what is the size of NBDigital and should we use it?