Nordic Language Processing Laboratory - User contributions [en]

https://wiki.nlpl.eu/api.php?action=feedcontributions&feedformat=atom&user=Erikve Nordic Language Processing Laboratory - User contributions [en] 2026-05-16T06:23:25Z User contributions MediaWiki 1.31.10 https://wiki.nlpl.eu/index.php?title=Community/training&diff=1657 Community/training 2024-01-30T13:22:25Z

<p>Erikve: panelists</p> <hr /> <div>'''HPLT & NLPL Winter School on Large Language Models: Creation, Customization, Evaluation, and Use'''<br /> <br /> [[File:Skeikampen.2023.jpg|center]]<br /> <br /> = Background =<br /> <br /> Since 2023, the NLPL network and Horizon Europe<br /> project ''[https://hplt-project.org High-Performance Language Technologies]'' (HPLT)<br /> have joined forces to organize the successful winter school series on Web-scale NLP.<br /> The winter school seeks to stimulate ''community formation'',<br /> i.e. strengthening interaction and collaboration among<br /> European research teams in NLP and advancing a shared level of knowledge<br /> and experience in using high-performance e-infrastructures for large-scale<br /> NLP research.<br /> The 2024 edition of the winter school puts special emphasis on<br /> NLP researchers from countries who participate in the EuroHPC<br /> [https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium].<br /> For additional background, please see the archival pages from the<br /> [http://wiki.nlpl.eu/index.php/Community/training/2018 2018],<br /> [http://wiki.nlpl.eu/index.php/Community/training/2019 2019],<br /> [http://wiki.nlpl.eu/index.php/Community/training/2020 2020], and<br /> [http://wiki.nlpl.eu/index.php/Community/training/2023 2023]<br /> NLPL Winter Schools.<br /> <br /> For early 2024, HPLT will hold its winter school from Sunday, February 4, to<br /> Tuesday, February 6, 2024, at a<br /> [https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]<br /> (with skiing and walking opportunities) about two hours north of Oslo.<br /> The project will organize group bus transfer from and to the Oslo<br /> airport ''Gardermoen'', leaving the airport at 9:45 on Sunday morning<br /> and returning there around 17:30 on Tuesday afternoon.<br /> <br /> The winter school is subsidized by the HPLT project: there is no fee for<br /> participants and no charge for the bus transfer to and from the<br /> conference hotel.<br /> All participants will have to cover their own travel and accomodation<br /> at Skeikampen, however.<br /> Two nights at the hotel, including all meals, will come to NOK 3745 (NOK 3345 per person in a shared double room), <br /> to be paid to the hotel directly.<br /> <br /> = Programme =<br /> <br /> The 2024 winter school will have a thematic focus on ''Large Language Models: Creation, Customization, Evaluation, and Use''.<br /> The programme will be comprised of in-depth technical presentations (possibly including some<br /> hands-on elements) by seasoned experts, with special emphasis on open science and European languages, <br /> but also include critical reflections on current development trends in LLM-focussed NLP.<br /> The programme will be complemented with a panel discussion and a ‘walk-through’ of available<br /> infrastructure on the shared EuroHPC LUMI supercomputer.<br /> <br /> Confirmed presenters include:<br /> <br /> * [http://afra.alishahi.name Afra Alishahi, Tilburg University, The Netherlands]<br /> * [https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot, University of Copenhagen, Denmark]<br /> * [https://muennighoff.github.io/ Niklas Muennighoff, Contextual AI]<br /> * [https://perso.limsi.fr/neveol/bio.html Aurélie Névéol, Interdisciplinary Laboratory of Numerical Sciences, France]<br /> <br /> {| class="wikitable"<br /> |-<br /> !colspan=3|Sunday, February 4, 2024<br /> |-<br /> | 13:00 || 14:00 || Lunch<br /> |-<br /> | 14:00 || 15:30 || '''Session 1''': Analyzing and Interpreting Deep Neural Models of Language ([http://afra.alishahi.name Afra Alishahi])<br /> |-<br /> | 15:30 || 15:50 || Coffee Break<br /> |-<br /> | 16:00 || 17:30 || '''Session 2''': Analyzing and Interpreting Deep Neural Models of Language ([http://afra.alishahi.name Afra Alishahi])<br /> |-<br /> | 17:30 || 17:50 || Coffee Break<br /> |-<br /> | 17:50 || 19:20 || '''Session 3''': Scaling Data-constrained Language Models ([https://muennighoff.github.io/ Niklas Muennighoff])<br /> |-<br /> | 19:30 || || Dinner<br /> |}<br /> <br /> {| class="wikitable"<br /> |-<br /> !colspan=3|Monday, February 5, 2024<br /> |-<br /> |colspan=3 | Breakfast is available from 07:30<br /> |-<br /> | 09:00 || 10:30 || '''Session 4''': Bias in Natural Language Processing: focus on large language models ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol]) <br /> |-<br /> |colspan=3| Free time (Lunch is available between 13:00 and 14:30)<br /> |-<br /> | 15:00 || 16:30 || '''Session 5''': Multilingual and multimodal language models ([https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot]) <br /> |-<br /> | 16:30 || 16:50 || Coffee Break<br /> |-<br /> | 16:50 || 17:40 || '''Session 6''': Multilingual and multimodal language models ([https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot]) <br /> |-<br /> | 17:40 || 18:00 || Coffee Break<br /> |-<br /> | 18:00 || 19:15 || '''Session 7'''. «Large vs. Small»: panel discussion. Panelists: Per Kummervold (National Library of Norway), Desmond Elliott (University of Copenhagen), Evangelia Gogoulou (RISE, Sweden), Afra Alishahi (Tilburg University), Jan Hajič (Charles University in Prague), and Aurélie Névéol (LISN, France)<br /> |-<br /> | 19:30 || || Dinner<br /> |-<br /> | 21:00 || || '''Evening Session'''. LUMI: BERT in an Hour, GPT in a Week. Speakers: David Samuel (University of Oslo) and Risto Luukkonen (University of Turku, Silo AI)<br /> |}<br /> <br /> <br /> {| class="wikitable"<br /> |-<br /> !colspan=3|Tuesday, February 6, 2024<br /> |-<br /> |colspan=3| Breakfast is available from 07:30<br /> |-<br /> | 08:30 || 10:00 || '''Session 8''': Reproducibility in Natural Language Processing ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol]) <br /> |-<br /> | 10:00 || 10:30 || Coffee Break<br /> |-<br /> | 10:30 || 12:00 || '''Session 9''': Understanding and measuring the environmental impact of Natural Language Processing ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol]) <br /> |-<br /> | 12:30 || 13:30 || Lunch<br /> |}<br /> <br /> = Registration =<br /> <br /> In total, we anticipate around 55 participants at the 2024 winter school.<br /> We have received more requests for participation than we will be able to accommodate,<br /> and the registration form has now been closed.<br /> We processed requests for participation on a first-come, first-served basis, with an eye toward regional balance.<br /> Interested parties who have submitted the registration form were confirmed in three batches, on December 11, on December 15,<br /> and on December 22, which was also the closing date for winter school registration.<br /> <br /> Once confirmed by the organizing team, participant names are published<br /> on this page, and registration establishes a<br /> ''binding agreement'' with the hotel.<br /> Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute<br /> spaces), and no-shows will be charged the full price for at least one night<br /> by the hotel.<br /> <br /> = Logistics = <br /> <br /> With a few exceptions, winter school participants travel to and from the conference hotel<br /> jointly on a chartered bus (the HPLT shuttle).<br /> The bus will leave OSL airport no later than 9:45 CET on Sunday, February 4.<br /> Thus, please meet up by 9:30 and make your arrival known to your assigned<br /> ‘tour guide’ (who will introduce themselves to you by email beforehand).<br /> <br /> The group will gather near the DNB currency exchange booth in the downstairs<br /> arrivals area, just outside the international arrivals luggage claims and slightly<br /> to the left as one exits the customs area:<br /> the yellow dot numbered (18) on the<br /> [https://avinor.no/globalassets/_oslo-lufthavn/ankomst-arrivals.pdf OSL arrivals map].<br /> The group will then walk over to the bus terminal, to leave the airport not long after 9:40.<br /> The drive to the Skeikampen conference hotel will take us about three hours, and the bus<br /> will make one stop along the way to stretch our legs and fill up on coffee.<br /> <br /> The winter school will end with lunch on Tuesday, February 6, before the group returns<br /> to OSL airport on the HPLT shuttle.<br /> The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL<br /> around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.<br /> <br /> = Organization =<br /> <br /> The 2024 Winter School is organized by a team of volunteers at the University<br /> of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond,<br /> please see below.<br /> For all inquiries regarding registration, the programme, logistics,<br /> or such, please contact <code>hplt-training@ifi.uio.no</code>.<br /> <br /> The programme committee is comprised of:<br /> <br /> * Isabelle Augenstein (University of Copenhagen, Denmark)<br /> * Emily M. Bemder (University of Washington, USA)<br /> * Kenneth Heafield (Edinburgh University, UK)<br /> * Jindřich Helcl (Charles University, Czech Republic)<br /> * Marco Kuhlmann (Linköping University, Sweden)<br /> * Per Egil Kummervold (National Library of Norway)<br /> * Andrey Kutuzov (University of Oslo, Norway)<br /> * Joakim Nivre (RISE and Uppsala University, Sweden)<br /> * Stephan Oepen (University of Oslo, Norway)<br /> * Sampo Pyysalo (University of Turku, Finland)<br /> * Gema Ramirez (Prompsit Language Engineering, Spain)<br /> * Anna Rogers (IT University of Copenhagen, Denmark)<br /> * Magnus Sahlgreen (AI Sweden)<br /> * David Samuel (University of Oslo, Norway)<br /> * Jörg Tiedemann (University of Helsinki, Finland)<br /> * Erik Velldal (University of Oslo, Norway)<br /> <br /> = Participants =<br /> <br /> # Afra Alishahi, Tilburg University (The Netherlands)<br /> # Ali Allaith, University of Copenhagen (Denmark)<br /> # Nikolay Arefev, University of Oslo (Norway)<br /> # Joseph Attieh, University of Helsinki (Finland)<br /> # Christopher Brückner, Charles University in Prague (Czech Republic)<br /> # Lucas Charpentier, University of Oslo (Norway)<br /> # Konstantin Dobler, Hasso Plattner Institute (Germany)<br /> # Aleksei Dorkin, University of Tartu (Estonia)<br /> # Luise Dürlich, Uppsala University (Sweden)<br /> # Simen Eide, Schibsted (Norway)<br /> # Desmond Elliott, University of Copenhagen (Denmark)<br /> # Kenneth Enevoldsen, Aarhus University (Denmark)<br /> # Mariia Fedorova, University of Oslo (Norway)<br /> # Emilie Francis, Gothenburg University (Sweden)<br /> # Evangelia Gogoulou, RISE (Sweden)<br /> # Jan Hajič, Charles University in Prague (Czech Republic)<br /> # Lasse Hansen, Aarhus University Hospital (Denmark)<br /> # Jindřich Helcl, Charles University in Prague (Czech Republic)<br /> # Yiping Jin, Pompeu Fabra University (Spain)<br /> # Amanda Kann, Stockholm University (Sweden)<br /> # Jan Kostkan, Aarhus University (Denmark)<br /> # Per Kummervold, National Library og Norway<br /> # Andrey Kutuzov, University of Oslo (Norway)<br /> # Tsz Kin Lam, University of Edinburgh (UK)<br /> # Wenyan Li, University of Copenhagen (Denmark)<br /> # Pierre Lison, Norsk Regnesentral<br /> # Jouni Luoma, University of Turku (Finland)<br /> # Risto Luukkonen, University of Turku (Finland)<br /> # Arianna Masciolini, Gothenburg University (Sweden)<br /> # Petter Mæhlum, University of Oslo (Norway)<br /> # Vladislav Mikhailov, University of Oslo (Norway)<br /> # Yousuf Ali Mohammed, Gothenburg University (Sweden)<br /> # Aurélie Névéol, LISN & CNRS (France)<br /> # Tobias Norlund, AI Sweden (Sweden)<br /> # Stephan Oepen, University of Oslo (Norway)<br /> # Lilja Øvrelid, University of Oslo (Norway)<br /> # Alberto Parola, University of Copenhagen (Denmark)<br /> # Siddhesh Pawar, University of Copenhagen (Denmark)<br /> # Erofili Psaltaki, University of Helsinki (Finland)<br /> # Akseli Reunamo, University of Turku (Finland)<br /> # David Samuel, University of Oslo (Norway)<br /> # Ricardo Muñoz Sánchez, Gothenburg University (Sweden)<br /> # Gautam Kishore Shahi, University of Duisburg-Essen (Germany)<br /> # Janine Siewert, University of Helsinki (Finland)<br /> # Étienne Simon, University of Oslo (Norway)<br /> # Inguna Skadiņa, University of Latvia<br /> # Ondrej Sotolar, Masaryk University (Czech Republic)<br /> # Pavel Stranak, Charles University in Prague (Czech Republic)<br /> # Maria Irena Szawerna, Gothenburg University (Sweden)<br /> # Jörg Tiedemann, University of Helsinki (Finland)<br /> # Ekaterina Uetova, Technological University Dublin (Ireland)<br /> # Erik Velldal, University of Oslo (Norway)<br /> # Tea Vojtěchová, Charles University in Prague (Czech Republic)<br /> # Jonas Waldendorf, University of Edinburgh (UK)<br /> # Jaume Zaragoza-Bernabeu, Prompsit Language Engineering (Spain)<br /> # Giulio Zhou, University of Edinburgh (UK)</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1377 Eosc/norbert/benchmark 2021-06-23T11:56:15Z

<p>Erikve: /* NLP tasks */</p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> * Structured sentiment analysis: [https://github.com/ltgoslo/norec_fine NoReC_fine] <br /> * Sentence-level 2/3-way polarity: [https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] <br /> * Negation cues and scopes (evaluation is still being developed): [https://github.com/ltgoslo/norec_neg/ NoReC_neg]<br /> * PoS tagging: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> * Dependency parsing: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk] <br /> * NER: [https://github.com/ltgoslo/norne NorNE] (Bokmål+Nynorsk)<br /> * Co-reference resolution (annotation ongoing)<br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms] (for static models)<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies] (for static models)<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon (for static models)<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1376 Eosc/norbert/benchmark 2021-06-23T11:56:01Z

<p>Erikve: /* NLP tasks */</p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> * Structured sentiment analysis: [https://github.com/ltgoslo/norec_fine NoReC_fine] <br /> * Sentence-level 2/3-way polarity: [https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] <br /> * Negation cues and scopes (evaluation is still being developed): [https://github.com/ltgoslo/norec_neg/ NoReC_neg]<br /> * PoS tagging: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> * Dependency parsing: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk] <br /> * NER: [https://github.com/ltgoslo/norne NorNE] (bokmål+nynorsk)<br /> * Co-reference resolution (annotation ongoing)<br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms] (for static models)<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies] (for static models)<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon (for static models)<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1375 Eosc/norbert/benchmark 2021-06-23T11:55:20Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> * Structured sentiment analysis: [https://github.com/ltgoslo/norec_fine NoReC_fine] <br /> * Sentence-level 2/3-way polarity: [https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] <br /> * Negation cues and scopes (evaluation is still being developed): [https://github.com/ltgoslo/norec_neg/ NoReC_neg]<br /> * PoS tagging: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> * Dependency parsing: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk] <br /> * NER: [https://github.com/ltgoslo/norne NorNE]<br /> * Co-reference resolution (annotation ongoing)<br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms] (for static models)<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies] (for static models)<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon (for static models)<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1374 Eosc/norbert/benchmark 2021-06-23T11:53:33Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> * Structured sentiment analysis: [https://github.com/ltgoslo/norec_fine NoReC_fine] <br /> * Sentence-level 2/3-way polarity: [https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] <br /> * Negation cues and scopes (evaluation is still being developed): [https://github.com/ltgoslo/norec_neg/ NoReC_neg]<br /> * PoS tagging: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> * Dependency parsing: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk] <br /> * Co-reference resolution (annotation ongoing)<br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms] (for static models)<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies] (for static models)<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon (for static models)<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1373 Eosc/norbert/benchmark 2021-06-23T11:52:53Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> * Structured sentiment analysis: [https://github.com/ltgoslo/norec_fine NoReC_fine] <br /> * Sentence-level 2/3-way polarity: [https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] <br /> * Negation cues and scopes (evaluation is still being developed): [https://github.com/ltgoslo/norec_neg/ NoReC_neg]<br /> * PoS tagging: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> * Dependency parsing: [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA ILA] + NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk] <br /> <br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms] (for static models)<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies] (for static models)<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon (for static models)<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1372 Eosc/norbert/benchmark 2021-06-23T11:51:07Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> * Structured sentiment analysis: [https://github.com/ltgoslo/norec_fine NoReC_fine] <br /> * Sentence-level 2/3-way polarity: [https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] <br /> * Negation cues and scopes (evaluation is still being developed): [https://github.com/ltgoslo/norec_neg/ NoReC_neg]<br /> * PoS tagging: NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> * Dependency parsing: NDT [https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal] / [https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk] <br /> * [https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms] (for static models)<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies] (for static models)<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon (for static models)<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1370 Eosc/norbert/benchmark 2021-06-23T11:47:24Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> * Structured sentiment analysis: [https://github.com/ltgoslo/norec_fine NoReC_fine] <br /> * Sentence-level 2/3-way polarity: [https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] <br /> * Negation cues and scopes (evaluation is still being developed): [https://github.com/ltgoslo/norec_neg/ NoReC_neg]<br /> <br /> === Linguistic pipeline (dependency parsing or PoS tagging) ===<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms] (for static models)<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies] (for static models)<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1367 Eosc/norbert/benchmark 2021-06-23T11:41:09Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> <br /> === NoReC* ===<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]: structured sentiment analysis <br /> *[https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] sentence-level 2/3-way polarity <br /> *[https://github.com/ltgoslo/norec_neg/ NoReC_neg]: negation cues and scopes<br /> <br /> === Linguistic pipeline (dependency parsing or PoS tagging) ===<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> *[https://github.com/ltgoslo/norwegian-synonyms Norwegian synonyms]<br /> *[https://github.com/ltgoslo/norwegian-analogies Norwegian analogies]<br /> *[https://github.com/ltgoslo/norsentlex NorSentLex]: Sentiment lexicon<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1366 Eosc/norbert/benchmark 2021-06-23T11:37:36Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> <br /> === NoReC* ===<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]: structured sentiment analysis <br /> *[https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] sentence-level 2/3-way polarity <br /> *[https://github.com/ltgoslo/norec_neg/ NoReC_neg]: negation cues and scopes<br /> <br /> === Linguistic pipeline (dependency parsing or PoS tagging) ===<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical semantic ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec NoReC]; document-level ratings.<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1365 Eosc/norbert/benchmark 2021-06-23T11:37:02Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> <br /> === NoReC* ===<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]: structured sentiment analysis <br /> *[https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] sentence-level 2/3-way polarity <br /> *[https://github.com/ltgoslo/norec_neg/ NoReC_neg]: negation cues and scopes<br /> <br /> === Linguistic pipeline (dependency parsing or PoS tagging) ===<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical semantic ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction). Note that we would want to use another version than the current official release; this has 10k more sentences (and is soon to be officially released).<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1363 Eosc/norbert/benchmark 2021-06-23T11:34:50Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> <br /> === NoReC* ===<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]: structured sentiment analysis <br /> *[https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] sentence-level 2/3-way polarity <br /> *[https://github.com/ltgoslo/norec_neg/ NoReC_neg]: negation cues and scopes<br /> <br /> === Linguistic pipeline (dependency parsing or PoS tagging) ===<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical semantic ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> <br /> == Text classification ==<br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction). Note that we would want to use another version than the current official release; this has 10k more sentences (and is soon to be officially released).<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; subset of documents from NoReC annotated with fine-grained sentiment (e.g. for predicting target expression + polarity)<br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version)<br /> *NoReC_neg; soon to be released; adds negation cues and scopes to the same subset of sentences as in NoReC_fine.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1361 Eosc/norbert/benchmark 2021-06-23T11:33:41Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NLP tasks ==<br /> <br /> === NoReC* ===<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]: strucutred sentiment analysis <br /> *[https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] sentence-level 2/3-way polarity <br /> *[https://github.com/ltgoslo/norec_neg/ NoReC_neg]: negation cues and scopes<br /> <br /> === Linguistic pipeline (dependency parsing or PoS tagging) ===<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical semantic ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> <br /> == Document classification ==<br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction). Note that we would want to use another version than the current official release; this has 10k more sentences (and is soon to be officially released).<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; subset of documents from NoReC annotated with fine-grained sentiment (e.g. for predicting target expression + polarity)<br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version)<br /> *NoReC_neg; soon to be released; adds negation cues and scopes to the same subset of sentences as in NoReC_fine.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1360 Eosc/norbert/benchmark 2021-06-23T11:29:43Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> == NoReC* == <br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]: strucutred sentiment analysis <br /> *[https://github.com/ltgoslo/norec_sentence/ NoReC_sentences] sentence-level 2/3-way polarity <br /> *[https://github.com/ltgoslo/norec_neg/ NoReC_neg]: negation cues and scopes<br /> <br /> == Linguistic pipeline (dependency parsing or PoS tagging) ==<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Bokmaal Bokmaal]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-Nynorsk Nynorsk]<br /> *[https://github.com/UniversalDependencies/UD_Norwegian-NynorskLIA Spoken dialects]<br /> <br /> == Lexical semantic ==<br /> *[https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-27/ Word sense disambiguation in context]<br /> <br /> == Document classification ==<br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction). Note that we would want to use another version than the current official release; this has 10k more sentences (and is soon to be officially released).<br /> *[https://github.com/ltgoslo/talk-of-norway Talk of Norway]<br /> *[https://github.com/jerbarnes/norwegian_dialect NorDial]<br /> <br /> ==Other ==<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; subset of documents from NoReC annotated with fine-grained sentiment (e.g. for predicting target expression + polarity)<br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version)<br /> *NoReC_neg; soon to be released; adds negation cues and scopes to the same subset of sentences as in NoReC_fine.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Vectors/norlm/norbert&diff=1281 Vectors/norlm/norbert 2021-01-13T11:59:41Z

<p>Erikve: </p> <hr /> <div>''[http://norlm.nlpl.eu Back to NorLM]''<br /> <br /> = NorBERT: Bidirectional Encoder Representations from Transformers =<br /> <br /> '''NorBERT''' is a BERT deep learning language model [[https://www.aclweb.org/anthology/N19-1423/ Devlin et al 2019]] trained from scratch for Norwegian. The model can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks.<br /> These models are part of the ongoing<br /> [http://norlm.nlpl.eu NorLM initiative] for very large contextualized<br /> Norwegian language models and associated tools and recipies.<br /> The NorBERT training setup builds on prior work on<br /> [https://github.com/TurkuNLP/FinBERT FinBERT] <br /> by our collaborators at the<br /> [https://turkunlp.org/ University of Turku].<br /> <br /> - '''[http://vectors.nlpl.eu/repository/20/215.zip Download from the NLPL Vector Repository]'''<br /> <br /> - '''[https://huggingface.co/ltgoslo/norbert Use with the Huggingface Transformers library]'''<br /> <br /> <br /> '''NorBERT''' features a custom 30 000 WordPiece vocabulary that has much better coverage of Norwegian words than the multilingual BERT (mBERT) models from Google:<br /> <br /> {| class="wikitable"<br /> |-<br /> ! Vocabulary !! Example of a tokenized sentence<br /> |-<br /> | NorBERT || Denne gjengen håper at de sammen skal bidra til å gi kvinne ##fotball ##en i Kristiansand et lenge etterl ##engt ##et løft .<br /> |-<br /> | mBERT || Denne g ##jeng ##en h ##å ##per at de sammen skal bid ##ra til å gi k ##vinne ##fo ##t ##ball ##en i Kristiansand et lenge etter ##len ##gte ##t l ##ø ##ft .<br /> |}<br /> <br /> == Evaluation ==<br /> <br /> We have currently evaluated NorBERT on two standard benchmarks: Part-of-Speech tagging on Bokmål (taken from [https://universaldependencies.org/ the Universal Dependencies project]) and sentence-level binary sentiment classification (created by aggregating the fine-grained annotations in [https://github.com/ltgoslo/norec_fine NoReC_fine] and removing sentences with conflicting or no sentiment). <br /> <br /> <br /> {| class="wikitable"<br /> |-<br /> ! Data !! Train !! Dev !! Test<br /> |-<br /> | POS || 15,696 || 2,409 || 1939<br /> |-<br /> | Sentiment || 2,675|| 516 || 417<br /> |}<br /> <br /> <br /> We fine-tune NorBERT and mBERT for 30 epochs and keep the best model on the dev set. NorBERT outperforms mBERT on both tasks: on POS by 0.6 percentage points, and by 15.6 on sentiment.<br /> <br /> {| class="wikitable"<br /> |-<br /> ! Model/task !! mBERT !! NorBERT<br /> |-<br /> | Part-of-Speech tagging || 97.9 || '''98.5'''<br /> |-<br /> | Sentence-level binary sentiment classification || 66.7 || '''82.3'''<br /> |}<br /> <br /> ==Training Corpus==<br /> We use clean training corpora with ordered sentences:<br /> <br /> *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-4/ Norsk Aviskorpus] (NAK); 1.7 billion words;<br /> *[https://dumps.wikimedia.org/nowiki/latest/ Bokmål Wikipedia]; 160 million words;<br /> *[https://dumps.wikimedia.org/nnwiki/latest/ Nynorsk Wikipedia]; 40 million words;<br /> <br /> In total, this comprises about two billion (1 907 072 909) word tokens in 203 million (202 802 665) sentences, both in Bokmål and in Nynorsk; thus, this is a ''joint'' model. In the future, separate Bokmål and Nynorsk models are planned as well.<br /> <br /> ==Preprocessing ==<br /> 1. Wikipedia texts were extracted using [https://github.com/RaRe-Technologies/gensim/blob/master/gensim/scripts/segment_wiki.py segment_wiki].<br /> <br /> 2. In NAK, for years up to 2005, the text is in the one-token-per-line format. There are special delimiters signaling the beginning of a new document and providing the URLs. We converted this to running text using a [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/detokenize.py self-made de-tokenizer].<br /> <br /> 3. In NAK, everything up to and including 2011 is in the ISO 8859-01 encoding ('Latin-1'). These files were [https://github.com/ltgoslo/NorBERT/blob/main/preprocessing/recode.sh converted] to UTF-8 before any other pre-processing.<br /> <br /> 4. The resulting corpus was sentence-segmented using [https://stanfordnlp.github.io/stanza/performance.html Stanza]. We left blank lines between documents (and sections in the case of Wikipedia) so that the "next sentence prediction" task doesn't span between documents.<br /> <br /> ==Vocabulary==<br /> The vocabulary for the model is of size 30 000 and contains ''cased entries with diacritics''. It is generated from raw text, without, e.g., separating punctuation from word tokens. This means one can feed raw text into NorBERT.<br /> <br /> The vocabulary was generated using the SentencePiece algorithm and Tokenizers library ([https://github.com/ltgoslo/NorBERT/blob/main/tokenization/spiece_tokenizer.py code]). The resulting [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_sentencepiece_vocab_30k.json Tokenizers model] was [https://github.com/ltgoslo/NorBERT/blob/main/tokenization/sent2wordpiece.py converted] to the standard [https://github.com/ltgoslo/NorBERT/blob/main/vocabulary/norwegian_wordpiece_vocab_30k.txt BERT WordPiece format].<br /> <br /> =NorBERT Model Details=<br /> <br /> ==Configuration==<br /> NorBERT corresponds in its configuration to Google's Bert-Base Cased for English, with 12 layers and hidden size 768. [https://github.com/ltgoslo/NorBERT/blob/main/norbert_config.json Configuration file]<br /> <br /> ==Training Overview==<br /> NorBERT was trained on the Norwegian academic HPC system called [https://documentation.sigma2.no/hpc_machines/saga.html Saga]. Most of the time the training was distributed across 4 compute nodes and 16 NVIDIA P100 GPUs. Training took approximately 3 weeks. [http://wiki.nlpl.eu/index.php/Eosc/pretraining/nvidia Instructions for reproducing the training setup with EasyBuild]<br /> <br /> ==Training Code==<br /> Similar to the creators of [https://github.com/TurkuNLP/FinBERT FinBERT], we employed the [https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT BERT implementation by NVIDIA] (version 20.06.08) which allows relatively fast multi-node and multi-GPU training.<br /> <br /> We made minor changes to this code, mostly to update it to the newer TensorFlow versions ([https://github.com/ltgoslo/NorBERT/tree/main/patches_for_NVIDIA_BERT our patches]).<br /> <br /> All the utils we used at the preprocessing and training are published in [https://github.com/ltgoslo/NorBERT our Github repository].<br /> <br /> ==Training Workflow==<br /> The Phase 1 (training with maximum sequence length of 128) was being done with batch size 48 and global batch size 48*16=768. Since one global batch contains 768 sentences, approximately 265 000 training steps constitute 1 epoch (one pass over the whole corpus). We have done 3 epochs: 795 000 training steps.<br /> <br /> The Phase 2 (training with maximum sequence length of 512) was being done with batch size 8 and global batch size 8*16=128. We aimed at mimicking the original BERT in that at Phase 2 the model should see about 1/9 of the number of sentences seen during Phase 1. Thus, we needed about 68 million sentences, which at the global batch size of 128 boils down to 531 000 training steps more.<br /> <br /> Full logs and loss plots can be found [https://github.com/ltgoslo/NorBERT/tree/main/logs here] (the training was on pause on December 25 and 26, since we were solving problems with mixed precision training).</div>

Erikve https://wiki.nlpl.eu/index.php?title=Vectors/norlm/norelmo&diff=1280 Vectors/norlm/norelmo 2021-01-13T11:59:17Z

<p>Erikve: </p> <hr /> <div>''[http://norlm.nlpl.eu Back to NorLM]''<br /> <br /> =NorELMo: Embeddings from Language Models for Norwegian=<br /> <br /> '''NorELMo''' is a set of bidirectional recurrent ELMo language models trained from scratch on Norwegian Wikipedia, trained as part of the ongoing [http://norlm.nlpl.eu NorLM initiative]. <br /> ELMo was the first contextualized architecture to become well-known in the NLP community. [[https://www.aclweb.org/anthology/N18-1202/ Peters et al 2018]] describing it got the Best Paper award at the NAACL 2018 conference. <br /> <br /> NorELMO models can be used to achieve state-of-the-art results for various Norwegian natural language processing tasks. In many cases, they may be a viable alternative to [[Vectors/norlm/norbert|NorBERT]], especially if computational resources are scarce.<br /> <br /> Download from the [http://vectors.nlpl.eu/repository/ NLPL Vector repository]:<br /> <br /> - ID 210: trained on ''lemmatized'' Norwegian Wikipedia Dump of September 2020 ([http://vectors.nlpl.eu/repository/20/210.zip download])<br /> <br /> - ID 211: trained on ''tokenized'' Norwegian Wikipedia Dump of September 2020 ([http://vectors.nlpl.eu/repository/20/210.zip download])<br /> <br /> ==Training Corpus==<br /> *[https://dumps.wikimedia.org/nowiki/latest/ Norwegian Bokmål Wikipedia] dump from September 2020; 160 million words;<br /> <br /> ==Preprocessing==<br /> Both models were trained on the corpus tokenized using [https://ufal.mff.cuni.cz/udpipe UDPipe]. The '''lemmatized''' model is trained on the version of the corpus where raw word forms were replaced with their lemmas (`kontorer' --> `kontor'). For different tasks, different models can be better.<br /> <br /> ==Vocabulary==<br /> Both models were trained with vocabularies comprising 100 000 most frequent words in the corresponding training corpus. The vocabularies are published together with the models in the archives linked above.<br /> <br /> ==Training workflow==<br /> Each models was trained for 3 epochs with batch size 192. We employed a [https://github.com/ltgoslo/simple_elmo_training version of the original ELMo training code updated to work better with the recent TensorFlow versions]. All the hyperparameters were left at their default values, except LSTM dimensionality reduced to 2048 from the default 4096.<br /> <br /> ==Usage==<br /> The NorELMO models are published in two formats: <br /> <br /> 1. TensorFlow checkpoints<br /> <br /> 2. HDF5 model files<br /> <br /> We recommend to use our [https://pypi.org/project/simple-elmo/ simple-elmo] Python library to do stuff with NorELMo models.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Vectors/norlm&diff=1263 Vectors/norlm 2021-01-13T08:05:02Z

<p>Erikve: </p> <hr /> <div>= Norwegian Large-scale Language Models =<br /> <br /> [[File:norbert.png|thumb|right|150px]]<br /> Welcome to the emerging collection of large-scale contextualized<br /> language models for the Norwegian language.<br /> NorLM is a joint initiative of the projects <br /> [https://www.eosc-nordic.eu/ EOSC-Nordic] (European Open Science Cloud) and<br /> [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT]<br /> (Sentiment Analysis for Norwegian), <br /> coordinated by the<br /> [https://www.mn.uio.no/ifi/english/research/groups/ltg/ Language Technology Group] (LTG)<br /> at the University of Oslo.<br /> <br /> We are working to provide these models and supporting tools for researchers and developers in Natural<br /> Language Processing (NLP) for the Norwegian language.<br /> We do so in the hope of facilitating scientific experimentation with and practical applications of state-of-the-art<br /> NLP architectures, as well as to enable others to develop their own large-scale models, for example for<br /> domain- or application-specific tasks, language variants, or even other languages than Norwegian.<br /> <br /> = Available Models =<br /> <br /> At this initial stage of development, Norwegian models for two common architecture variants are available:<br /> <br /> * [[Vectors/norlm/elmo|NorELMo: LSTM-Based Architectures]]<br /> * [[Vectors/norlm/norbert|NorBERT: Transformer-Based Architectures]]<br /> <br /> We emphatically welcome all kinds of user feedback, including of course suggestions for improvement<br /> or suggestions for additional types of Norwegian contextualized language models or associated tools.<br /> Please contact us via the NorLM technical coordinator,<br /> [https://www.mn.uio.no/ifi/english/people/aca/andreku/ Andrey Kutuzov].<br /> = License and Access =<br /> <br /> All Norwegian language models from the NorLM initiative are<br /> publicly available for download from the<br /> [http://vectors.nlpl.eu/repository NLPL Vectors Repository], with a [https://creativecommons.org/licenses/by/4.0/ CC BY 4.0 license].<br /> The NorBERT model is also included with the <br /> [https://huggingface.co/transformers/ Huggingface Transformers Library].<br /> <br /> To receive announcements of updates and availability of additional<br /> models, please self-subscribe to our very low-traffic NorLM<br /> [http://lists.nlpl.eu/mailman/listinfo/norlm mailing list].<br /> <br /> = Acknowledgements =<br /> <br /> The NorLM resources are being developed on the Norwegian national supercomputing services operated by<br /> [https://www.sigma2.no/ UNINETT Sigma2], the National Infrastructure for High Performance Computing and Data Storage in Norway.<br /> Software provisioning was financially supported through the European<br /> [https://www.eosc-nordic.eu/ EOSC-Nordic] project; data preparation and evaluation<br /> were supported by the Norwegian<br /> [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] project.<br /> We are indebted to all funding agencies involved, the University of Oslo, and the<br /> Norwegian tax payer.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Vectors/norlm&diff=1254 Vectors/norlm 2021-01-12T21:08:13Z

<p>Erikve: </p> <hr /> <div>= Norwegian Large-scale Language Models =<br /> <br /> [[File:norbert.png|thumb|right|150px]]<br /> Welcome to the emerging collection of large-scale contextualized<br /> language models for the Norwegian language.<br /> NorLM is a joint initiative of the projects <br /> [https://www.eosc-nordic.eu/ EOSC-Nordic] (European Open Science Cloud) and<br /> [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT]<br /> (Sentiment Analysis for Norwegian), <br /> coordinated by the<br /> [https://www.mn.uio.no/ifi/english/research/groups/ltg/ Language Technology Group] (LTG)<br /> at the University of Oslo.<br /> <br /> We are working to provide these models and supporting tools for researchers and developers in Natural<br /> Language Processing (NLP) for the Norwegian language.<br /> We do so in the hope of facilitating scientific experimentation with and practical applications of state-of-the-art<br /> NLP architectures, as well as to enable others to develop their own large-scale models, for example for<br /> domain- or application-specific tasks, language variants, or even other languages than Norwegian.<br /> <br /> = Available Models =<br /> <br /> At this initial stage of development, Norwegian models for two common architecture variants are available:<br /> <br /> * [[Vectors/norlm/elmo|NorELMo: LSTM-Based Architectures]]<br /> * [[Vectors/norlm/norbert|NorBERT: Transformer-Based Architectures]]<br /> <br /> We emphatically welcome all kinds of user feedback, including of course suggestions for improvement<br /> or suggestions for additional types of Norwegian contextualized language models or associated tools.<br /> Please contact us via the NorLM technical coordinator,<br /> [https://www.mn.uio.no/ifi/english/people/aca/andreku/ Andrey Kutuzov].<br /> = License and Access =<br /> <br /> All Norwegian language models from the NorLM initiative are<br /> publicly available for download from the<br /> [http://vectors.nlpl.eu/repository NLPL Vectors Repository];<br /> a subset of the models is also included with the<br /> [https://huggingface.co/transformers/ Huggingface Transformers Library].<br /> <br /> To receive announcements of updates and availability of additional<br /> models, please self-subscribe to our very low-traffic NorLM<br /> [http://lists.nlpl.eu/mailman/listinfo/norlm mailing list].<br /> <br /> = Acknowledgements =<br /> <br /> The NorLM resources are being developed on the Norwegian national supercomputing services operated by<br /> [https://www.sigma2.no/ UNINETT Sigma2], the National Infrastructure for High Performance Computing and Data Storage in Norway.<br /> Software provisioning was financially supported through the European<br /> [https://www.eosc-nordic.eu/ EOSC-Nordic] project; data preparation and evaluation<br /> were supported by the Norwegian<br /> [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] project.<br /> We are indebted to all funding agencies involved, the University of Oslo, and the<br /> Norwegian tax payer.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Vectors/norlm&diff=1253 Vectors/norlm 2021-01-12T20:56:15Z

<p>Erikve: </p> <hr /> <div>= Norwegian Large-scale Language Models =<br /> <br /> [[File:norbert.png|thumb|right|150px]]<br /> Welcome to the emerging collection of large-scale contextualized<br /> language models for the Norwegian language.<br /> NorLM is a joint initiative of the<br /> [https://www.eosc-nordic.eu/ EOSC-Nordic] (European Open Science Cloud) and<br /> [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT]<br /> (Sentiment Analysis for Norwegian) projects,<br /> coordinated by the<br /> [https://www.mn.uio.no/ifi/english/research/groups/ltg/ Language Technology Group] (LTG)<br /> at the University of Oslo.<br /> <br /> We are working to provide these models and supporting tools for researchers and developers in Natural<br /> Language Processing (NLP) for the Norwegian language.<br /> We do so in the hope of facilitating scientific experimentation with and practical applications of state-of-the-art<br /> NLP architectures, as well as to enable others to develop their own large-scale models, for example for<br /> domain- or application-specific tasks, language variants, or even other languages than Norwegian.<br /> <br /> = Available Models =<br /> <br /> At this initial stage of development, Norwegian models for two common architecture variants are available:<br /> <br /> * [[Vectors/norlm/elmo|NorELMo: LSTM-Based Architectures]]<br /> * [[Vectors/norlm/norbert|NorBERT: Transformer-Based Architectures]]<br /> <br /> We emphatically welcome all kinds of user feedback, including of course suggestions for improvement<br /> or suggestions for additional types of Norwegian contextualized language models or associated tools.<br /> Please contact us via the NorLM technical coordinator,<br /> [https://www.mn.uio.no/ifi/english/people/aca/andreku/ Andrey Kutuzov].<br /> = License and Access =<br /> <br /> All Norwegian language models from the NorLM initiative are<br /> publicly available for download from the<br /> [http://vectors.nlpl.eu/repository NLPL Vectors Repository];<br /> a subset of the models is also included with the<br /> [https://huggingface.co/transformers/ Huggingface Transformers Library].<br /> <br /> To receive announcements of updates and availability of additional<br /> models, please self-subscribe to our very low-traffic NorLM<br /> [http://lists.nlpl.eu/mailman/listinfo/norlm mailing list].<br /> <br /> = Acknowledgements =<br /> <br /> The NorLM resources are being developed on the Norwegian national supercomputing services operated by<br /> [https://www.sigma2.no/ UNINETT Sigma2], the National Infrastructure for High Performance Computing and Data Storage in Norway.<br /> Software provisioning was financially supported through the European<br /> [https://www.eosc-nordic.eu/ EOSC-Nordic] project; data preparation and evaluation<br /> were supported by the Norwegian<br /> [https://www.mn.uio.no/ifi/english/research/projects/sant/index.html SANT] project.<br /> We are indebted to all funding agencies involved, the University of Oslo, and the<br /> Norwegian tax payer.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1173 Eosc/norbert/benchmark 2020-12-03T20:50:16Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction). Note that we would want to use another version than the current official release; this has 10k more sentences (and is soon to be officially released).<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; subset of documents from NoReC annotated with fine-grained sentiment (e.g. for predicting target expression + polarity)<br /> *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-10/ NDT]; for dependency parsing or PoS tagging (perhaps best to use the UD version)<br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version)<br /> *NoReC_neg; soon to be released; adds negation cues and scopes to the same subset of sentences as in NoReC_fine.</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1172 Eosc/norbert/benchmark 2020-12-03T20:45:19Z

<p>Erikve: /* Emerging Thoughts on Benchmarking */</p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these, while we do have baseline numbers to compare to, we do not have existing set-ups where we could simply plug in a Norwegian BERT and rund, so we may need to identify suitable code for existing BERT-based architectures for e.g. English to re-use. For the first task though (document-level SA on NoReC) Jeremy would have an existing set-up for using mBERT that we could perhaps use. <br /> <br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction). <br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; for fine-grained sentiment analysis (e.g. predicting target expression + polarity)<br /> *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-10/ NDT]; for dependency parsing or PoS tagging (perhaps best to use the UD version)<br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version)</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1171 Eosc/norbert/benchmark 2020-12-03T18:06:15Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> The following would be natural places to start. For most of these one would need to find suitable code for existing BERT-based architectures for e.g. English. For the first though, document-level SA on NoReC, Jeremy would have an existing set-up for using mBERT. <br /> <br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction). <br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; for fine-grained sentiment analysis (e.g. predicting target expression + polarity)<br /> *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-10/ NDT]; for dependency parsing or PoS tagging (perhaps best to use the UD version)<br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version)</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1170 Eosc/norbert/benchmark 2020-12-03T18:03:51Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> This would be natural places to start:<br /> <br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction)<br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; for fine-grained sentiment analysis (e.g. predicting target expression + polarity)<br /> *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-10/ NDT]; for dependency parsing or PoS tagging (perhaps best to use the UD version)<br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version)</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1169 Eosc/norbert/benchmark 2020-12-03T18:03:33Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> This would be natural places to start:<br /> <br /> *[https://github.com/ltgoslo/norec_fine NoReC]; for document-level sentiment analysis (i.e. rating prediction): <br /> *[https://github.com/ltgoslo/norec_fine NoReC_fine]; for fine-grained sentiment analysis (e.g. predicting target expression + polarity): <br /> *[https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-10/ NDT]; for dependency parsing or PoS tagging (perhaps best to use the UD version): <br /> *[https://github.com/ltgoslo/norne NorNE]; for named entity recognition, extends NDT (also available for the UD version):</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1168 Eosc/norbert/benchmark 2020-12-03T18:00:33Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> This would be natural places to start:<br /> <br /> NoReC, for document-level sentiment analysis (i.e. rating prediction): <br /> https://github.com/ltgoslo/norec<br /> <br /> NoReC_fine, for fine-grained sentiment analysis (e.g. predicting target expression + polarity): <br /> https://github.com/ltgoslo/norec_fine<br /> <br /> NDT, for dependency parsing or PoS tagging (perhaps best to use the UD version): <br /> https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-10/<br /> <br /> NorNE, for named entity recognition, extends NDT (also available for the UD version):<br /> https://github.com/ltgoslo/norne</div>

Erikve https://wiki.nlpl.eu/index.php?title=Eosc/norbert/benchmark&diff=1167 Eosc/norbert/benchmark 2020-12-03T18:00:05Z

<p>Erikve: </p> <hr /> <div>= Emerging Thoughts on Benchmarking =<br /> <br /> This would be natural places to start:<br /> <br /> NoReC, for document-level sentiment analysis (i.e. rating prediction)<br /> https://github.com/ltgoslo/norec<br /> <br /> NoReC_fine, for fine-grained sentiment analysis (e.g. predicting target expression + polarity)<br /> https://github.com/ltgoslo/norec_fine<br /> <br /> NDT, for dependency parsing or PoS tagging (perhaps best to use the UD version) <br /> https://www.nb.no/sprakbanken/ressurskatalog/oai-nb-no-sbr-10/<br /> <br /> NorNE, for named entity recognition, extends NDT (also available for the UD version)<br /> https://github.com/ltgoslo/norne</div>

Erikve