Difference between revisions of "Eosc/norbert/benchmark"
(→Emerging Thoughts on Benchmarking) |
(→Linguistic pipeline) |
||
Line 3: | Line 3: | ||
The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. | The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. | ||
− | == Linguistic pipeline == | + | == Linguistic pipeline (for dependency parsing or PoS tagging) == |
− | *[https:// | + | *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] |
+ | *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk] | ||
+ | *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects] | ||
== Document classification == | == Document classification == |
Revision as of 11:16, 23 June 2021
Contents
Emerging Thoughts on Benchmarking
The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use.
Linguistic pipeline (for dependency parsing or PoS tagging)
Document classification
- NoReC; for document-level sentiment analysis (i.e. rating prediction). Note that we would want to use another version than the current official release; this has 10k more sentences (and is soon to be officially released).
- Talk of Norway
- NorDial
Other
- NoReC_fine; subset of documents from NoReC annotated with fine-grained sentiment (e.g. for predicting target expression + polarity)
- NorNE; for named entity recognition, extends NDT (also available for the UD version)
- NoReC_neg; soon to be released; adds negation cues and scopes to the same subset of sentences as in NoReC_fine.