Eosc/NorBERT3 corpus

From Nordic Language Processing Laboratory

Revision as of 20:14, 18 October 2022 by Andreku (talk | contribs) (Cleaning)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Jump to: navigation, search

Contents

1 Workflow
2 Sampling experiment
3 Vocabulary
4 To Decide

Workflow

Sampling experiment

We plan to create two versions of the training corpus:

baseline (as is)
Wikipedia+NCC+NAK multiplied by two to match the C4 size (oversampling quality data)

Vocabulary

Starting with 50K, following NorBERT-2. May be later experiment with other values.

To Decide

What is the size of NBDigital and should we use it? It probably overlaps a lot with NCC.

Retrieved from "http://wiki.nlpl.eu/index.php?title=Eosc/NorBERT3_corpus&oldid=1462"