Eosc/NorBERT3 corpus

From Nordic Language Processing Laboratory

Revision as of 14:27, 12 October 2022 by Andreku (talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Cleaning procedure from https://arxiv.org/abs/2112.11446
Deduplication https://github.com/ChenghaoMou/text-dedup/tree/main/text_dedup https://github.com/ekzhu/datasketch
Two versions: baseline and wikipedia+NCC+NAK multiplied by two to match the C4 size

Todo: what is the size of NBDigital and should we use it?

Todo: vocabulary size? Start with 50K

Retrieved from "http://wiki.nlpl.eu/index.php?title=Eosc/NorBERT3_corpus&oldid=1461"