Difference between revisions of "Community/training"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Programme)
(Schedule)
 
(326 intermediate revisions by 3 users not shown)
Line 1: Line 1:
[[File:skeikampen.2020.png|center]]
+
= '''Circle U, NLPL, & OpenEuroLLM 2026 Winter School on Multilinguality in LLM Development and Evaluation''' =
 +
 
 +
[[File:Winter school 2025.jpg|center|thumb|upright=2.0]]
  
 
= Background =
 
= Background =
  
A desirable side-effect of the NLPL cooperation is ''community formation'',
+
In 2026, the NLPL network and Digital Europe
i.e. strengthening interaction and collaboration among Nordic research teams
+
project ''[https://openeurollm.eu OpenEuroLLM]''
in NLP and advancing a shared level of knowledge and experience in using
+
have joined forces to organize the successful winter school series on Web-scale NLP.
national e-Infrastructures for large-scale NLP research.
+
The winter school seeks to stimulate ''community formation'',
Towards these goals, the project organizes an annual three-day winter school.
+
i.e. strengthening interaction and collaboration among
 +
European research teams in NLP and advancing a shared level of knowledge
 +
and experience in using high-performance e-infrastructures for large-scale
 +
NLP research.
 +
This 2026 edition of the winter school puts special emphasis on
 +
NLP researchers from countries who participate in the EuroHPC
 +
[https://www.eurohpc-ju.europa.eu/supercomputers/our-supercomputers_en consortium]
 +
and is endorsed as a doctoral training event in the European
 +
[https://www.circle-u.eu Circle U university alliance].
 
For additional background, please see the archival pages from the
 
For additional background, please see the archival pages from the
[http://wiki.nlpl.eu/index.php/Community/training/2018 2018] and
+
[https://wiki.nlpl.eu/index.php/Community/training/2018 2018],
[http://wiki.nlpl.eu/index.php/Community/training/2019 2019]
+
[https://wiki.nlpl.eu/index.php/Community/training/2019 2019],
NLPL Winter Schools].
+
[https://wiki.nlpl.eu/index.php/Community/training/2020 2020],
 +
[https://wiki.nlpl.eu/index.php/Community/training/2023 2023],
 +
[https://wiki.nlpl.eu/index.php/Community/training/2024 2024], and
 +
[https://wiki.nlpl.eu/index.php/Community/training/2025 2025]
 +
NLPL Winter Schools.
  
For early 2020, NLPL will hold its winter school from Sunday, February 2, to
+
For early 2026, NLPL will hold its winter school from Monday, February 2, to
Tuesday, February 4, 2020, at a
+
Wednesday, February 4, 2026, at a
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
(with skiing opportunities) about two hours north of Oslo.
+
(with skiing and walking opportunities) about two hours north of Oslo.
The project will organize group bus transfer from and to the Oslo
+
The project will organize group bus transfer from and to the main Oslo
airport ''Gardermoen'', leaving the airport at 9:30 on Sunday morning
+
airport ''Gardermoen'' (OSL), leaving the airport at 9:45 on Monday morning
and returning there around 17:30 on Tuesday afternoon.
+
and returning there around 17:30 on Wednesday afternoon.
 
 
The main external instructors in 2020 will be
 
[https://u.cs.biu.ac.il/~yogo/ Yoav Goldberg] (Bar Ilan University and Allen Institute for AI)
 
and [https://thomwolf.io/ Thomas Wolf] (Huggingface).
 
Additional sessions will be contributed by NLPL project members, including
 
* Filip Ginter and Antti Virtanen, on multi-gpu training of language-specific BERTs;
 
* Joakim Nivre and Artur Kulmizev, on syntactic dependency parsing in the neural age;
 
* Stephan Oepen and Daniel Hershcovich, on the 2019 and 2020 CoNLL tasks on semantic parsing;
 
* Jörg Tiedemann and Alessandro Raganato, with a practical crash course in neural MT.
 
Some sessions will combine lecturing and hands-on exercises.
 
The winter school programme will be complemented with an
 
evening ‘research bazar’ (by participants) to stimulate academic socializing
 
and possibly a ‘walk-through’ of available software, data, and service resources
 
in the NLPL Virtual Laboratory.
 
  
The winter school is subsidized by the project: there is no fee for
+
The winter school is subsidized by the OpenEuroLLM project: there is no fee for
 
participants and no charge for the bus transfer to and from the
 
participants and no charge for the bus transfer to and from the
 
conference hotel.
 
conference hotel.
All participants will have to cover their own travel and accomodation
+
All participants will have to cover their own travel and accommodation
 
at Skeikampen, however.
 
at Skeikampen, however.
Two nights at the hotel, including all meals, will come to NOK 2865,  
+
Two nights at the hotel, including all meals, will come to NOK 3885 (NOK 3485 per person in a shared double room),  
to be paid to the hotel directly.
+
to be paid to the hotel directly upon arrival.
  
= Logistics =  
+
= Programme =
  
With a few exceptions, winter school participants travel to and from the conference hotel
+
The 2026 winter school has a thematic focus on ''Multilinguality in LLM Development and Evaluation''.
jointly on a chartered bus (the NLPL shuttle).
+
The programme is comprised of in-depth technical presentations (possibly including some
The bus shall leave OSL airport no later than 9:30 CET on Sunday, February 2.
+
hands-on elements) by international experts, with special emphasis on open science and European languages, but also includes critical reflections on current development trends in LLM-focused NLP.
Thus, please meet up at 9:15 and make your arrival known to your assigned
+
The programme will be complemented with a ‘walk-through’ of example EuroHPC experience
‘tour guide’ (who will introduce themselves to you by email beforehand).
+
reports from the OpenEuroLLM consortium and with reflections about current LLM-oriented activities of the National Library of Norway.  
The group will gather near the bus and taxi information booth in the downstairs
 
arrivals area, just outside the international arrivals luggage claims and slightly
 
to the right, as one exits the customs area:
 
The yellow dot numbered (17) on the
 
[https://avinor.no/globalassets/_oslo-lufthavn/ankomst-arrivals.pdf OSL arrivals map].
 
The group will then walk over to the bus terminal, to leave the airport by 9:30.
 
The drive to the Skeikampen conference hotel will take us about three hours, and the bus
 
will make one stop along the way to stretch our legs and fill up on coffee.
 
  
 +
Confirmed presenters and talks include:
  
The winter school will end with lunch on Tuesday, February 4, before the group returns
+
* [https://bplank.github.io Barbara Plank], Ludwig Maximilian University of Munich
to OSL on the NLPL shuttle.
+
* [https://commoncrawl.org/team/laurie-burchell Laurie Burchell] and [https://commoncrawl.org/team/pedro-ortiz-suarez Pedro Ortiz Suarez], Common Crawl
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL airport
+
* [https://www.linkedin.com/in/maximilianidahl/?originalSubdomain=de Max Idahl], ellamind
around 17:00 to 17:30 CET.
+
* [https://juliakreutzer.github.io Julia Kreutzer], Cohere for Labs
 
+
* [https://geoalgo.github.io/ David Salinas], ELLIS Institute Tübingen
= Programme =
+
* [https://www.isir.upmc.fr/personnel/yvon/?lang=en François Yvon], Sorbonne Université
 
 
  
 +
= Schedule =
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Sunday, February 2, 2020
+
!colspan=3|Monday, February 2, 2026
 
|-
 
|-
 
| 13:00 || 14:00 || Lunch
 
| 13:00 || 14:00 || Lunch
 
|-
 
|-
| 14:00 || 15:30 || '''Session 1''' Yoav Goldberg: [http://svn.nlpl.eu/outreach/skeikampen/2020/goldberg1.pdf ''Introduction to Neural Network Abstractions and Encoder–Decoder Architectures'']
+
| 14:00 || 15:30 || '''Session 1''' [http://svn.nlpl.eu/outreach/skeikampen/2026/ortiz-burchell.pdf Laurie Burchell and Pedro Ortiz Suarez: Multilinguality at Common Crawl]<p  class="mw-collapsible mw-collapsed">'''Improving Language Coverage for the Largest Open Web Corpus'''<br>The Common Crawl Foundation (CCF) provides the largest open corpus of web data, enabling a wide range of scientific and technical applications including large language model (LLM) development. However, our current data processing pipeline faces challenges when processing multilingual data, decreasing language representation and impacting downstream model performance. In this talk, we will discuss CCF’s initiatives to improve multilingual coverage and language identification of our web corpus. These efforts include soliciting crowd-sourced web seeds for under-served languages, running the First Workshop for Multilingual Data Quality Signals at COLM 2025, and creating CommonLID, a community-driven, human-annotated language identification benchmark for the web domain. Throughout, we emphasise the collaborative nature of our efforts, working in partnership with members of the NLP community to improve content available in their languages.</p>
 
|-
 
|-
 
| 15:30 || 15:50 || Coffee Break
 
| 15:30 || 15:50 || Coffee Break
 
|-
 
|-
| 15:50 || 17:20 || '''Session 2''' Jörg Tiedemann & Alessandro Raganato: ''A Practical Crash Course in Neural Machine Translation''
+
| 16:00 || 17:30 || '''Session 2''' [http://svn.nlpl.eu/outreach/skeikampen/2026/yvon1.pdf François Yvon: Evaluating Large LMs and their Multilingualism]<p  class="mw-collapsible mw-collapsed">Large Language Models introduced in the recent years have been found extremely helpful to advance the state-of-the-art in many Natural Language Applications, notably due to their ability to compute numerical, high-dimensional, representations of linguistic units such as words or sentences. Multilingual language models go one step further and add the ability to handle multiple languages, sometimes even multiple scripts, with just one single model. In this presentation, I will discuss multilingual language models at length, with a focus on the evaluation of their multilingual abilities, which raises two difficult questions: (a) to evaluate their performance as if they were just a collection of monolingual models; (b) to evaluate their performance as integrated multilingual models, capable of bridging between languages. </p>
 
|-
 
|-
| 17:20 || 17:40 || Coffee Break
+
| 17:30 || 17:50 || Coffee Break
 
|-
 
|-
| 17:40 || 19:10 || '''Session 3''' Yoav Goldberg: [http://svn.nlpl.eu/outreach/skeikampen/2020/goldberg2.pdf ''Topics in Representation Learning'']
+
| 17:50 || 19:20 || '''Session 3''' [http://svn.nlpl.eu/outreach/skeikampen/2026/kreutzer1.pdf Julia Kreutzer: Evaluating Generations Multilingually] <p  class="mw-collapsible mw-collapsed">'''Current Challenges and Lessons from Machine Translation'''<br>In this session we will dive into the particular challenge of evaluating LLMs across many languages in generative tasks. We will take a look at the "sister field" of machine translation and inspect what principles have led to advances in understanding quality across languages. </p>
 
|-
 
|-
 
| 19:30 ||  || Dinner
 
| 19:30 ||  || Dinner
|-
 
| 21:00 || || '''Research Bazaar''' Everyone: Upstairs Bar
 
 
|}
 
|}
 
  
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Monday, February 3, 2020
+
!colspan=3|Tuesday, February 3, 2026
 
|-
 
|-
 
|colspan=3 | Breakfast is available from 07:30
 
|colspan=3 | Breakfast is available from 07:30
 
|-
 
|-
| 08:30 || 10:00 || '''Session 4''' Yoav Goldberg: [http://svn.nlpl.eu/outreach/skeikampen/2020/goldberg3.pdf ''Interpretability and Black-Box NLP'']
+
| 09:00 || 10:30 || '''Session 4''' [http://svn.nlpl.eu/outreach/skeikampen/2026/yvon2.pdf François Yvon: Text Generation: Know your Options!] <p  class="mw-collapsible mw-collapsed">Text generation, contextual or non-contextual, is ubiquitous in the current LLM era, as it serves as the most basic block in multiple application contexts, from question answering and dialog systems to text summarization and machine translation, and many more. Generation is thus equally useful to compute deterministic and highly non-deterministic mappings with various level of output constraints. Furthermore, text generation is also used as a sub-routine of more complex generation strategies, aiming to produce syntactically well-formed (e.g. for code generation) or semantically consistent outputs, possibility through multiple steps of generation (e.g, in chain-of-thoughts generation) or to collect diverse samples from the generating distribution. To cover this considerable diversity of uses, multiple text generation strategies have been proposed, some less well-known than others. In this talk I will review various families of generation algorithms, from the most basic ones to the more sophisticated approaches, so as to document, as much as possible, the possible options that are available to text generation users. The final part will survey some decoding issues that are specific to multilingual models. </p>
 
|-
 
|-
|colspan=3| Lunch is available between 13:00 and 14:30
+
|colspan=3| Free time (Lunch is available between 13:00 and 14:30)
 
|-
 
|-
| 15:00 || 16:20 || '''Session 5''' Thomas Wolf: [http://svn.nlpl.eu/outreach/skeikampen/2020/wolf1.pdf ''Transfer Learning 1'']
+
| 15:30 || 17:00 || '''Session 5''' [http://svn.nlpl.eu/outreach/skeikampen/2026/idahl.pdf Max Idahl: Multilingual Model-Based Quality Filtering for LLM Pretraining]<p  class="mw-collapsible mw-collapsed">Data quality is the highest-leverage factor for LLM performance, with recent work showing significant training efficiency gains through careful curation. This presentation traces the evolution from rule-based filtering to modern model-based approaches that now work across dozens of languages. We cover the progression from basic perplexity-based filters, to FastText and encoder-based scorers, to our newly released Propella models that annotate documents across 18 properties for 57 languages at scale. The talk includes practical insights into building multilingual filtering pipelines.</p>
 
|-
 
|-
| 16:20 || 16:40 || Coffee Break
+
| 17:00 || 17:20 || Coffee Break
 
|-
 
|-
| 16:40 || 18:00 || '''Session 6''' Filip Ginter, Antti Virtanen, & Andrey Kutuzov: ''Experiences and a Hands-On Tutorial on Training BERT and ELMo from Scratch in a Multi-GPU Setting''
+
| 17:20 || 19:20 || '''Session 6''' [http://svn.nlpl.eu/outreach/skeikampen/2026/salinas.pdf David Salinas: Challenges in Evaluating Generative Models]<p  class="mw-collapsible mw-collapsed">In this talk, we will discuss the evaluation of generative models, in particular Large Language Models (LLMs). Given that such models produce open-ended output, their evaluation requires different techniques than static evaluations such as simple question-answering benchmarks. We will first discuss human annotations and their use in leaderboards such as LMArena and ComparIA. We will then focus on automatic evaluation relying on LLM judges. In particular, we will describe current challenges with LLM judges before discussing their application in multilingual settings.</p>
 
|-
 
|-
| 18:00 || 18:10 || Coffee Break
+
| 19:30 || || Dinner
|-
 
| 18:10 || 19:30 || '''Session 7''' Thomas Wolf: ''Transfer Learning 2''
 
 
|-
 
|-
| 19:30 || || Dinner
+
| 21:00 || || '''Evening Session''':<br/>
 +
[http://svn.nlpl.eu/outreach/skeikampen/2026/nb.pdf Javier de la Rosa, Rolv-Arild Braaten, Marthe Midtgaard, Angelina Zanardi: National Library of Norway]<br/>
 +
[http://svn.nlpl.eu/outreach/skeikampen/2026/openeurollm.pdf Sampo Pyysalo, Max Idahl, David Salias, Stephan Oepen, Shenbin Qian: OpenEuroLLM, MultiSynt]
 
|}
 
|}
  
Line 116: Line 106:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Tuesday, February 4, 2020
+
!colspan=3|Wednesday, February 4, 2026
 
|-
 
|-
 
|colspan=3| Breakfast is available from 07:30
 
|colspan=3| Breakfast is available from 07:30
 
|-
 
|-
| 08:30 || 10:00 || '''Session 8''' <ul><li>Joakim Nivre & Artur Kulmizev: ''Syntactic Dependency Parsing''</li><li>Stephan Oepen & Daniel Hershcovich: ''Meaning Representation Parsing''</li></ul>
+
| 08:30 || 10:00 || '''Session 8''' [http://svn.nlpl.eu/outreach/skeikampen/2026/plank.pdf Barbara Plank: NLP Beyond the Standard]<p  class="mw-collapsible mw-collapsed">'''Dialects, Variation, and Shared Representations in Multilingual Language Models'''<br>Multilingual language models have primarily focused on cross-lingual differences, with intra-language variation only recently gaining more attention. Dialects and non-standard varieties challenge core assumptions about data, representation, and evaluation. In this talk, I discuss what makes dialects particularly challenging for multilingual models, review approaches starting from early encoder-based methods, and give an overview of resources developed for dialectal NLP, with a focus on German dialects. I then turn to recent work on multilingual training dynamics and shared representations, analyzing when linguistic information and shared concept spaces emerge during training and where alignment breaks down. Although dialects are not yet explicitly modeled in this analysis, the findings provide insight into multilingual representation learning during pre-training. </p>
 
|-
 
|-
 
| 10:00 || 10:30 || Coffee Break
 
| 10:00 || 10:30 || Coffee Break
 
|-
 
|-
| 10:30 || 12:00 || '''Session 9''' Thomas Wolf: ''Limitations of Transfer Learning (Tentative Title)''
+
| 10:30 || 12:00 || '''Session 9''' [http://svn.nlpl.eu/outreach/skeikampen/2026/kreutzer2.pdf Julia Kreutzer: Optimizing Data for Multilingual Post-Training] <p  class="mw-collapsible mw-collapsed">In this session we will look into techniques for augmenting data collections for better multilingual coverage. We will discuss the role of translation and inference settings, and explore methods for optimizing multilingual data both on the prompt and the generation side.</p>
 
|-
 
|-
 
| 12:30 || 13:30 || Lunch
 
| 12:30 || 13:30 || Lunch
 +
|-
 +
| 13:45 || 16:45 || Bus transfer to OSL Airport
 
|}
 
|}
  
 
= Registration =
 
= Registration =
  
In total, we anticipate around 45 participants in the 2020 Winter School.
+
In total, we expect 60–70 participants at the 2026 winter school.
Please register your intent of participation through our
+
Registration for interested participants is now closed.
[https://indico.neic.no/e/skeikampen20 on-line registration form].
+
Requests for participation were processed on a first-come, first-served basis, with an eye toward regional balance.
We will process requests for participation on a first-come, first-served
+
Interested parties who have submitted the registration form were confirmed in three batches, on '''November 28''', on '''December 5''',
basis; the closing date for registration is Friday, December 13, 2019.
+
and on '''December 19''', which was also the closing date for winter school registration.
Once confirmed by the organizing team, registration will establish a
+
 
binding agreement with the hotel and a cancellation fee will be
+
Once confirmed by the organizing team, participant names are published
incurred (unless we can find someone else to ‘take over’ last-minute
+
on this page, and registration establishes a
spaces).
+
''binding agreement'' with the hotel.
 +
Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute
 +
spaces), and no-shows will be charged the full price for at least one night
 +
by the hotel.
 +
 
 +
= Logistics =
  
= Contact =
+
With a few exceptions, winter school participants travel to and from the conference hotel
 +
jointly on a chartered bus (the OpenEuroLLM shuttle).
 +
The bus will leave OSL airport no later than 9:45 CET on Monday, February 2.
 +
Thus, please meet up by 9:30 and make your arrival known to your assigned
 +
‘tour guide’ (who will introduce themselves to you by email beforehand).
  
The 2020 NLPL Winter School is organized by a team of volunteers,
+
The group will gather near the DNB currency exchange booth in the downstairs
Li-Hsin Chang,
+
arrivals area, just outside the international arrivals luggage claims and slightly
Filip Ginter,
+
to the left as one exits the customs area:
Bjørn Lindi,  
+
the yellow dot numbered (18) on the
Farrokh Mehryary,
+
[https://www.avinor.no/siteassets/flyplasser/oslo-lufthavn/info/kart-over-flyplassen/kart-over-flyplassen-ankomst-oslo-lufthavn-avinor.jpg OSL arrivals map].
Joakim Nivre,
+
The group will then walk over to the bus terminal, to leave the airport not long after 9:40.
Stephan Oepen, and
+
The drive to the Skeikampen conference hotel will take us about two-three hours, and the bus
Jörg Tiedemann.
+
will make one stop along the way to stretch our legs and fill up on coffee.
 +
 
 +
The winter school will end with lunch on Wednesday, February 4, before the group returns
 +
to OSL airport on the OpenEuroLLM shuttle.
 +
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL
 +
around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.
 +
 
 +
= Organization =
 +
 
 +
The 2026 Winter School is organized by a team of volunteers at the University
 +
of Oslo, supported by a programme committee from the OpenEuroLLM, Circle U, and
 +
NLPL networks and beyond, please see below.
 
For all inquiries regarding registration, the programme, logistics,
 
For all inquiries regarding registration, the programme, logistics,
or such, please contact <code>outreach@nlpl.eu</code>.
+
or such, please contact <code>nlpl-training@ifi.uio.no</code>.
 +
 
 +
The programme committee is comprised of (in alphabetical order):
 +
 
 +
* Jenia Jitsev (Forschungszentrum Jülich, Germany)
 +
* Andrey Kutuzov (University of Oslo, Norway)
 +
* Alessandro Lenci (University of Pisa, Italy)
 +
* Stephan Oepen (University of Oslo, Norway)
 +
* Sampo Pyysalo (University of Turku, Finland)
 +
* David Salinas (ELLIS Institute, Germany)
 +
* Gema Ramirez-Sanches (Prompsit Language Engineering, Spain)
 +
* Jörg Tiedemann (University of Helsinki, Finland)
 +
* Joaquin Vanschoren (Eindhoven University of Technology, The Netherlands)
 +
* Guillaume Wisniewski (Paris Cité University, France)
  
 
= Participants =
 
= Participants =
 
+
# Adam Hrin, AMD Silo AI (Finland)
# Jordi Armengol-Estapé (Barcelona)
+
# Agnes Toftgård, The National Library (Sweden)
# Pepa Atanasova (Copenhagen)
+
# Alicia Núñez Alcover, Prompsit (Spain)
# Jeremy Barnes (Oslo)
+
# Anastasia Philipps, University of Oslo (Norway)
# Ali Basirat (Uppsala)
+
# Andrey Kutuzov, University of Oslo (Norway)
# Aleksandrs Berdicevskis (Gothenburg)
+
# Angelina Zanardi, National Library of Norway
# Maja Buljan (Oslo)
+
# Anni Moisala, CSC – IT Center for Science (Finland)
# Li-Hsin Chang (Turku, co-organizer)
+
# Artūrs Znotiņš, University of Latvia (Latvia)
# Manuel Ciosici (Copenhagen)
+
# Barbara Heinisch, Eurac Research (Italy)
# Cheikh Bamba Dione (Bergen)
+
# Barbara Plank, Ludwig-Maximilians-Universität München (Germany)
# Adam Ek (Gothenburg)
+
# Charlotte Noel, LINAGORA Labs (France)
# Filip Ginter (Turku, co-organizer)
+
# Dalton Harmsen, Eindhoven University of Technology (Netherlands)
# Yoav Goldberg (Tel Aviv, presenter)
+
# David Salinas, ELLIS institute Tübingen (Germany)
# Rob van der Goot (Copenhagen)
+
# Diana Kylymnyk, University of Exeter (UK)
# Daniel Hershcovich (Copenhagen)
+
# Elizaveta Kuzmenko, Université Libre de Bruxelles (Belgium)
# Andreas Holm (Copenhagen)
+
# Etienne Simon, University of Oslo (Norway)
# Safiqul Islam (Oslo)
+
# Faton Rekathati, The National Library (Sweden)
# Suwisa Kaewphan (Turku)
+
# Fedor Vitiugin, University of Turku (Finland)
# Jenna Kanerva (Turku)
+
# François Yvon, CNRS (France)
# Martin Krallinger (Barcelona)
+
# Fred Philippy, University of Luxembourg (Luxembourg)
# Artur Kulmizev (Uppsala)
+
# Ghulam Muhammed Khan, University of Exeter (United Kingdom)
# Maria Kunilovskaya (Wolverhampton)
+
# Gianluca Barmina, University of Southern Denmark (Denmark)
# Jenny Kunz (Linköping)
+
# Hannah Clausen, University of Oslo (Norway)
# Andrey Kutuzov (Oslo)
+
# Hannan Mahadik, ELLIS Institute Tübingen (Germany)
# Anna Lindahl (Gothenburg)
+
# Iglika Nikolova-Stoupak, Sorbonne Université (France)
# Ellinor Lindqvist (Uppsala)
+
# Jan Hajič, Charles University (Czech Republic)
# Juhani Luotolahti (Turku)
+
# Jiajing Wan, University of Bergen (Norway)
# Jan Tore Lønning (Oslo)
+
# Jindřich Helcl, University of Oslo (Norway)
# Arild Matsson (Gothenburg)
+
# Johannes Gabriel Sindlinger, IT University of Copenhagen (Denmark)
# Maite Melero (Barcelona)
+
# Jouni Luoma, AMD Silo AI (Finland)
# Farrokh Mehryary (Turku, co-organizer)
+
# Julia Kreutzer, Cohere Labs (Canada)
# Antonio Miranda (Barcelona)
+
# Justyna Sikora, The National Library (Sweden)
# Joakim Nivre (Uppsala, co-organizer)
+
# Katarina Strani Herriot-Watt University (United Kingdom)
# Stephan Oepen (Oslo, co-organizer)
+
# Kevin Glocker, Linköping University (Sweden)
# Ildiko Pilan (Oslo)
+
# Kristýna Onderková, Charles University (Czech Republic)
# Alessandro Raganato (Helsinki)
+
# Laurène Cave, Sorbonne Université (France)
# Vinit Ravishankar (Oslo)
+
# Lisa Yankovskaya, University of Tartu (Estonia)
# Arradi Nur Rizal (Uppsala)
+
# Maja Buljan, University of Oslo (Norway)
# Samuel Rönnqvist (Turku)
+
# Markus Heiervang, National Library of Norway
# Stian Rødven Eide (Gothenburg)
+
# Marthe Midtgaard, National Library of Norway
# Jörg Tiedemann (Helsinki, co-organizer)
+
# Mattes Ruckdeschel, IT University of Copenhagen (Denmark)
# Samia Touileb (Oslo)
+
# Maximilian Idahl, ellamind (Germany)
# Erik Velldal (Oslo)
+
# Meihan Tong, University of Oslo (Norway)
# Daniel Varab (Copenhagen)
+
# Muhammad Imran, University of A Coruña (Spain)
# Marta Villegas (Barcelona)
+
# Nam Luu, Charles University (Czech Republic)
# Antti Virtanen (Turku)
+
# Neda Jamshidi, University of Sienna (Italy)
# Michael Welzl (Oslo)
+
# Nikolay Arefev, University of Oslo (Norway)
# Thomas Wolf (Huggingface, presenter)
+
# Nils Grünefeld, IT University of Copenhagen (Denmark)
# Dustin Wright (Copenhagen)
+
# Pedro Ortiz Suarez, Common Crawl Foundation (USA)
# Lilja Øvrelid (Oslo)
+
# Rolv-Arild Braaten, National Library of Norway
 +
# Romina Oji, Linköping University (Sweden)
 +
# Sampo Pyysalo, University of Turku (Finland)
 +
# Shanshan Xu, University of Copenhagen (Denmark)
 +
# Shenbin Qian, University of Oslo (Norway)
 +
# Stephan Oepen, University of Oslo (Norway)
 +
# Taja Kuzman Pungeršek, Jožef Stefan Institute (Slovenia)
 +
# Tita Enstad, National Library of Norway
 +
# Tommaso Green, University of Mannheim (Germany)
 +
# Tudor Nicolae Mateiu, Prompsit (Spain)
 +
# Vladislav Mikhailov, University of Oslo (Norway)
 +
# Wafa Aissa, UCLouvain (Belgium)
 +
# Xiaorui Yu, King's College London (UK)
 +
# Yihang Lu, Sorbonne Université (France)
 +
# Yiheng Wu, University of Helsinki (Finland)
 +
# Yves Scherrer, University of Oslo (Norway)
 +
# Zihao Li, University of Helsinki (Finland)

Latest revision as of 15:30, 4 February 2026

Circle U, NLPL, & OpenEuroLLM 2026 Winter School on Multilinguality in LLM Development and Evaluation

Winter school 2025.jpg

Background

In 2026, the NLPL network and Digital Europe project OpenEuroLLM have joined forces to organize the successful winter school series on Web-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. This 2026 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC consortium and is endorsed as a doctoral training event in the European Circle U university alliance. For additional background, please see the archival pages from the 2018, 2019, 2020, 2023, 2024, and 2025 NLPL Winter Schools.

For early 2026, NLPL will hold its winter school from Monday, February 2, to Wednesday, February 4, 2026, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the main Oslo airport Gardermoen (OSL), leaving the airport at 9:45 on Monday morning and returning there around 17:30 on Wednesday afternoon.

The winter school is subsidized by the OpenEuroLLM project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accommodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3885 (NOK 3485 per person in a shared double room), to be paid to the hotel directly upon arrival.

Programme

The 2026 winter school has a thematic focus on Multilinguality in LLM Development and Evaluation. The programme is comprised of in-depth technical presentations (possibly including some hands-on elements) by international experts, with special emphasis on open science and European languages, but also includes critical reflections on current development trends in LLM-focused NLP. The programme will be complemented with a ‘walk-through’ of example EuroHPC experience reports from the OpenEuroLLM consortium and with reflections about current LLM-oriented activities of the National Library of Norway.

Confirmed presenters and talks include:

Schedule

Monday, February 2, 2026
13:00 14:00 Lunch
14:00 15:30 Session 1 Laurie Burchell and Pedro Ortiz Suarez: Multilinguality at Common Crawl

Improving Language Coverage for the Largest Open Web Corpus
The Common Crawl Foundation (CCF) provides the largest open corpus of web data, enabling a wide range of scientific and technical applications including large language model (LLM) development. However, our current data processing pipeline faces challenges when processing multilingual data, decreasing language representation and impacting downstream model performance. In this talk, we will discuss CCF’s initiatives to improve multilingual coverage and language identification of our web corpus. These efforts include soliciting crowd-sourced web seeds for under-served languages, running the First Workshop for Multilingual Data Quality Signals at COLM 2025, and creating CommonLID, a community-driven, human-annotated language identification benchmark for the web domain. Throughout, we emphasise the collaborative nature of our efforts, working in partnership with members of the NLP community to improve content available in their languages.

15:30 15:50 Coffee Break
16:00 17:30 Session 2 François Yvon: Evaluating Large LMs and their Multilingualism

Large Language Models introduced in the recent years have been found extremely helpful to advance the state-of-the-art in many Natural Language Applications, notably due to their ability to compute numerical, high-dimensional, representations of linguistic units such as words or sentences. Multilingual language models go one step further and add the ability to handle multiple languages, sometimes even multiple scripts, with just one single model. In this presentation, I will discuss multilingual language models at length, with a focus on the evaluation of their multilingual abilities, which raises two difficult questions: (a) to evaluate their performance as if they were just a collection of monolingual models; (b) to evaluate their performance as integrated multilingual models, capable of bridging between languages.

17:30 17:50 Coffee Break
17:50 19:20 Session 3 Julia Kreutzer: Evaluating Generations Multilingually

Current Challenges and Lessons from Machine Translation
In this session we will dive into the particular challenge of evaluating LLMs across many languages in generative tasks. We will take a look at the "sister field" of machine translation and inspect what principles have led to advances in understanding quality across languages.

19:30 Dinner
Tuesday, February 3, 2026
Breakfast is available from 07:30
09:00 10:30 Session 4 François Yvon: Text Generation: Know your Options!

Text generation, contextual or non-contextual, is ubiquitous in the current LLM era, as it serves as the most basic block in multiple application contexts, from question answering and dialog systems to text summarization and machine translation, and many more. Generation is thus equally useful to compute deterministic and highly non-deterministic mappings with various level of output constraints. Furthermore, text generation is also used as a sub-routine of more complex generation strategies, aiming to produce syntactically well-formed (e.g. for code generation) or semantically consistent outputs, possibility through multiple steps of generation (e.g, in chain-of-thoughts generation) or to collect diverse samples from the generating distribution. To cover this considerable diversity of uses, multiple text generation strategies have been proposed, some less well-known than others. In this talk I will review various families of generation algorithms, from the most basic ones to the more sophisticated approaches, so as to document, as much as possible, the possible options that are available to text generation users. The final part will survey some decoding issues that are specific to multilingual models.

Free time (Lunch is available between 13:00 and 14:30)
15:30 17:00 Session 5 Max Idahl: Multilingual Model-Based Quality Filtering for LLM Pretraining

Data quality is the highest-leverage factor for LLM performance, with recent work showing significant training efficiency gains through careful curation. This presentation traces the evolution from rule-based filtering to modern model-based approaches that now work across dozens of languages. We cover the progression from basic perplexity-based filters, to FastText and encoder-based scorers, to our newly released Propella models that annotate documents across 18 properties for 57 languages at scale. The talk includes practical insights into building multilingual filtering pipelines.

17:00 17:20 Coffee Break
17:20 19:20 Session 6 David Salinas: Challenges in Evaluating Generative Models

In this talk, we will discuss the evaluation of generative models, in particular Large Language Models (LLMs). Given that such models produce open-ended output, their evaluation requires different techniques than static evaluations such as simple question-answering benchmarks. We will first discuss human annotations and their use in leaderboards such as LMArena and ComparIA. We will then focus on automatic evaluation relying on LLM judges. In particular, we will describe current challenges with LLM judges before discussing their application in multilingual settings.

19:30 Dinner
21:00 Evening Session:

Javier de la Rosa, Rolv-Arild Braaten, Marthe Midtgaard, Angelina Zanardi: National Library of Norway
Sampo Pyysalo, Max Idahl, David Salias, Stephan Oepen, Shenbin Qian: OpenEuroLLM, MultiSynt


Wednesday, February 4, 2026
Breakfast is available from 07:30
08:30 10:00 Session 8 Barbara Plank: NLP Beyond the Standard

Dialects, Variation, and Shared Representations in Multilingual Language Models
Multilingual language models have primarily focused on cross-lingual differences, with intra-language variation only recently gaining more attention. Dialects and non-standard varieties challenge core assumptions about data, representation, and evaluation. In this talk, I discuss what makes dialects particularly challenging for multilingual models, review approaches starting from early encoder-based methods, and give an overview of resources developed for dialectal NLP, with a focus on German dialects. I then turn to recent work on multilingual training dynamics and shared representations, analyzing when linguistic information and shared concept spaces emerge during training and where alignment breaks down. Although dialects are not yet explicitly modeled in this analysis, the findings provide insight into multilingual representation learning during pre-training.

10:00 10:30 Coffee Break
10:30 12:00 Session 9 Julia Kreutzer: Optimizing Data for Multilingual Post-Training

In this session we will look into techniques for augmenting data collections for better multilingual coverage. We will discuss the role of translation and inference settings, and explore methods for optimizing multilingual data both on the prompt and the generation side.

12:30 13:30 Lunch
13:45 16:45 Bus transfer to OSL Airport

Registration

In total, we expect 60–70 participants at the 2026 winter school. Registration for interested participants is now closed. Requests for participation were processed on a first-come, first-served basis, with an eye toward regional balance. Interested parties who have submitted the registration form were confirmed in three batches, on November 28, on December 5, and on December 19, which was also the closing date for winter school registration.

Once confirmed by the organizing team, participant names are published on this page, and registration establishes a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.

Logistics

With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the OpenEuroLLM shuttle). The bus will leave OSL airport no later than 9:45 CET on Monday, February 2. Thus, please meet up by 9:30 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).

The group will gather near the DNB currency exchange booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the left as one exits the customs area: the yellow dot numbered (18) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport not long after 9:40. The drive to the Skeikampen conference hotel will take us about two-three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.

The winter school will end with lunch on Wednesday, February 4, before the group returns to OSL airport on the OpenEuroLLM shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.

Organization

The 2026 Winter School is organized by a team of volunteers at the University of Oslo, supported by a programme committee from the OpenEuroLLM, Circle U, and NLPL networks and beyond, please see below. For all inquiries regarding registration, the programme, logistics, or such, please contact nlpl-training@ifi.uio.no.

The programme committee is comprised of (in alphabetical order):

  • Jenia Jitsev (Forschungszentrum Jülich, Germany)
  • Andrey Kutuzov (University of Oslo, Norway)
  • Alessandro Lenci (University of Pisa, Italy)
  • Stephan Oepen (University of Oslo, Norway)
  • Sampo Pyysalo (University of Turku, Finland)
  • David Salinas (ELLIS Institute, Germany)
  • Gema Ramirez-Sanches (Prompsit Language Engineering, Spain)
  • Jörg Tiedemann (University of Helsinki, Finland)
  • Joaquin Vanschoren (Eindhoven University of Technology, The Netherlands)
  • Guillaume Wisniewski (Paris Cité University, France)

Participants

  1. Adam Hrin, AMD Silo AI (Finland)
  2. Agnes Toftgård, The National Library (Sweden)
  3. Alicia Núñez Alcover, Prompsit (Spain)
  4. Anastasia Philipps, University of Oslo (Norway)
  5. Andrey Kutuzov, University of Oslo (Norway)
  6. Angelina Zanardi, National Library of Norway
  7. Anni Moisala, CSC – IT Center for Science (Finland)
  8. Artūrs Znotiņš, University of Latvia (Latvia)
  9. Barbara Heinisch, Eurac Research (Italy)
  10. Barbara Plank, Ludwig-Maximilians-Universität München (Germany)
  11. Charlotte Noel, LINAGORA Labs (France)
  12. Dalton Harmsen, Eindhoven University of Technology (Netherlands)
  13. David Salinas, ELLIS institute Tübingen (Germany)
  14. Diana Kylymnyk, University of Exeter (UK)
  15. Elizaveta Kuzmenko, Université Libre de Bruxelles (Belgium)
  16. Etienne Simon, University of Oslo (Norway)
  17. Faton Rekathati, The National Library (Sweden)
  18. Fedor Vitiugin, University of Turku (Finland)
  19. François Yvon, CNRS (France)
  20. Fred Philippy, University of Luxembourg (Luxembourg)
  21. Ghulam Muhammed Khan, University of Exeter (United Kingdom)
  22. Gianluca Barmina, University of Southern Denmark (Denmark)
  23. Hannah Clausen, University of Oslo (Norway)
  24. Hannan Mahadik, ELLIS Institute Tübingen (Germany)
  25. Iglika Nikolova-Stoupak, Sorbonne Université (France)
  26. Jan Hajič, Charles University (Czech Republic)
  27. Jiajing Wan, University of Bergen (Norway)
  28. Jindřich Helcl, University of Oslo (Norway)
  29. Johannes Gabriel Sindlinger, IT University of Copenhagen (Denmark)
  30. Jouni Luoma, AMD Silo AI (Finland)
  31. Julia Kreutzer, Cohere Labs (Canada)
  32. Justyna Sikora, The National Library (Sweden)
  33. Katarina Strani Herriot-Watt University (United Kingdom)
  34. Kevin Glocker, Linköping University (Sweden)
  35. Kristýna Onderková, Charles University (Czech Republic)
  36. Laurène Cave, Sorbonne Université (France)
  37. Lisa Yankovskaya, University of Tartu (Estonia)
  38. Maja Buljan, University of Oslo (Norway)
  39. Markus Heiervang, National Library of Norway
  40. Marthe Midtgaard, National Library of Norway
  41. Mattes Ruckdeschel, IT University of Copenhagen (Denmark)
  42. Maximilian Idahl, ellamind (Germany)
  43. Meihan Tong, University of Oslo (Norway)
  44. Muhammad Imran, University of A Coruña (Spain)
  45. Nam Luu, Charles University (Czech Republic)
  46. Neda Jamshidi, University of Sienna (Italy)
  47. Nikolay Arefev, University of Oslo (Norway)
  48. Nils Grünefeld, IT University of Copenhagen (Denmark)
  49. Pedro Ortiz Suarez, Common Crawl Foundation (USA)
  50. Rolv-Arild Braaten, National Library of Norway
  51. Romina Oji, Linköping University (Sweden)
  52. Sampo Pyysalo, University of Turku (Finland)
  53. Shanshan Xu, University of Copenhagen (Denmark)
  54. Shenbin Qian, University of Oslo (Norway)
  55. Stephan Oepen, University of Oslo (Norway)
  56. Taja Kuzman Pungeršek, Jožef Stefan Institute (Slovenia)
  57. Tita Enstad, National Library of Norway
  58. Tommaso Green, University of Mannheim (Germany)
  59. Tudor Nicolae Mateiu, Prompsit (Spain)
  60. Vladislav Mikhailov, University of Oslo (Norway)
  61. Wafa Aissa, UCLouvain (Belgium)
  62. Xiaorui Yu, King's College London (UK)
  63. Yihang Lu, Sorbonne Université (France)
  64. Yiheng Wu, University of Helsinki (Finland)
  65. Yves Scherrer, University of Oslo (Norway)
  66. Zihao Li, University of Helsinki (Finland)