Difference between revisions of "Eosc/NorBERT3 corpus"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "* Cleaning procedure from https://arxiv.org/abs/2112.11446 * Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch *...")
 
Line 2: Line 2:
 
* Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch
 
* Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch
 
* Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size
 
* Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size
 +
 +
Todo: what is the size of NBDigital and should we use it?

Revision as of 14:20, 12 October 2022

Todo: what is the size of NBDigital and should we use it?