Latest revision as of 09:04, 5 February 2025

HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation

Background

Since 2023, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) have joined forces to organize the successful winter school series on Web-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. This 2025 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, 2020, 2023, and 2024 NLPL Winter Schools.

For early 2025, HPLT will hold its winter school from Monday, February 3, to Wednesday, February 5, 2025, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:45 on Monday morning and returning there around 17:30 on Wednesday afternoon.

The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accommodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room), to be paid to the hotel directly upon arrival.

Programme

The 2025 winter school will have a thematic focus on Pretraining Data Quality and Multilingual LLM Evaluation. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) by seasoned experts, with special emphasis on open science and European languages, but also include critical reflections on current development trends in LLM-focussed NLP. The programme will be complemented with a ‘walk-through’ of example experience reports on the shared EuroHPC LUMI supercomputer.

Confirmed presenters and talks include:

Alexandra Birch, University of Edinburgh
EuroLLM and FinLLM – stories from the trenches
Jenia Jitsev and Marianna Nezhurina, Jülich Supercomputing Centre / LAION
Open Foundation Models: Scaling Laws and Generalization
Guilherme Penedo, Huggingface
FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training
Gema Ramírez-Sánchez, Prompsit Language Engineering
A look at Pre-Training Data through the Stats Glass
Anna Rogers, IT University of Copenhagen
Large Language Models and Factuality
Pedro Ortiz Suarez and Sebastian Nagel, Common Crawl
Data Quality, Language Coverage and Ethical Considerations in Web Crawling
Ahmet Üstün, Cohere AI
Recipe for multilingual post-training: How to collect high-quality data and use them?

Schedule

Monday, February 3, 2025
13:00	14:00	Lunch
14:00	15:30	Session 1 Pedro Ortiz Suarez & Sebastian Nagel Data Quality, Language Coverage and Ethical Considerations in Web Crawling Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs. Slides
15:30	15:50	Coffee Break
16:00	17:30	Session 2 Anna Rogers LLMs and Factuality: facts from LLMs This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy. Slides
17:30	17:50	Coffee Break
17:50	19:20	Session 3 Alexandra Birch EuroLLM and FinLLM – stories from the trenches In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks. Slides
19:30		Dinner

Tuesday, February 4, 2025
Breakfast is available from 07:30
09:00	10:30	Session 4 Guilherme Penedo FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication. Slides
Free time (Lunch is available between 13:00 and 14:30)
15:30	17:00	Session 5 Gema Ramírez-Sánchez Having a look at pretraining data through the stats glass At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it. Slides
17:00	17:20	Coffee Break
17:20	19:20	Session 6 Jenia Jitsev & Marianna Nezhurina Open Foundation Models: Scaling Laws and Generalization Slides 1 Slides 2
19:30		Dinner
21:00		Evening Session: Findings from HPLT

Wednesday, February 5, 2025
Breakfast is available from 07:30
08:30	10:00	Session 8 Ahmet Üstün (online) Recipe for multilingual post-training: How to collect high-quality data and use them? Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models, to leverage the curated data best. Slides
10:00	10:30	Coffee Break
10:30	12:00	Session 9 Anna Rogers LLMs and Factuality: facts about LLMs This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field. Slides
12:30	13:30	Lunch
13:45	16:45	Bus transfer to OSL Airport

Registration

In total, this year we welcome 62 participants at the 2025 winter school. The winter school is over-subscribed and no longer accepting registrations. We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who had submitted the registration form have been confirmed in three batches, on December 6, on December 13, and on December 20, which was also the closing date for winter school registration.

Once confirmed by the organizing team, participant names are published on this page, and registration establishes a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.

Logistics

With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:45 CET on Monday, February 3. Thus, please meet up by 9:30 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).

The group will gather near the DNB currency exchange booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the left as one exits the customs area: the yellow dot numbered (18) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport not long after 9:40. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.

The winter school will end with lunch on Wednesday, February 5, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.

Organization

The 2025 Winter School is organized by a team of volunteers at the University of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond, please see below. For all inquiries regarding registration, the programme, logistics, or such, please contact hplt-training@ifi.uio.no.

The programme committee is comprised of:

Barry Haddow (University of Edinburgh, UK)
Andrey Kutuzov (University of Oslo, Norway)
Stephan Oepen (University of Oslo, Norway)
Sampo Pyysalo (University of Turku, Finland)
Jörg Tiedemann (University of Helsinki, Finland)

Participants

Nikolay Arefev, University of Oslo (Norway)
Maria Barrett, Silo AI (Finland)
Toms Bergmanis, Tilde (Latvia)
Alexandra Birch, University of Edinburgh (UK)
Laurie Burchell, University of Edinburgh (UK)
Lucas Charpentie, University of Oslo (Norway)
Pinzhen (Patrick) Chen, University of Edinburgh (UK)
Hannah Clausen, University of Oslo (Norway)
Lucia Domenichelli, University of Pisa (Italy)
Aleksei Dorkin, University of Tartu (Estonia)
Kenneth Enevoldsen, Aarhus University (Denmark)
Tita Enstad, National Library (Norway)
Mariia Fedorova, University of Oslo (Norway)
Yanzhu Guo, INRIA Paris (France)
Arzu Burcu Güven, IT University of Copenhagen (Denmark)
Barry Haddow, University of Edinburgh (UK)
Jan Hajič, Charles University (Czech Republic)
Jindřich Helcl, Charles University (Czech Republic)
Bertram Højer, IT University Copenhagen (Denmark)
Sekh Mainul Islam, University of Copenhagen (Denmark)
Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
Márton Kardos, Aarhus University (Denmark)
Anastasiia Klimashevskaia, University of Bergen (Norway)
Mateusz Klimaszewski, The University of Edinburgh (UK)
Ville Komulainen, University of Turku (Finland)
Markus Koskela, CSC – IT Center for Science (Finland)
Martins Kronis, Tilde (Latvia)
Vimal Kumar Kumar, University of Limerick (Ireland)
Andrey Kutuzov, University of Oslo (Norway)
Hengyu Luo, University of Helsinki (Finland)
Farrokh Mehryary, University of Turku (Finland)
Vladislav Mikhailov, University of Oslo (Norway)
Andreas Motzfeldt, IT University of Copenhagen (Denmark)
Zain Muhammad Mujahid, University of Copenhagen (Denmark)
Sebastian Nagel, Common Crawl Foundation (Germany)
Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
Stephan Oepen, University of Oslo (Norway)
Guilherme Penedo, HuugingFace (France)
Irina Proskurina, University of Lyon (France)
Taido Purason, University of Tartu (Estonia)
Marie Roald, National Library (Norway)
Anna Rogers, IT University Copenhagen (Denmark)
Ismaël Rousseau, Orange (France)
David Samuel, University of Oslo (Norway)
Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
Marta Sartor, University of Pisa (Italy)
Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
Étienne Simon, University of Oslo (Norway)
Pavel Stepachev, The University of Edinburgh (UK)
Pedro Ortiz Suarez, Common Crawl Foundation (France)
Otto Tarkka, University of Turku (Finland)
Kushal Tatariya, KU Leuven (Belgium)
Jörg Tiedemann, University of Helsinki (Finland)
Samia Touileb, University of Bergen (Norway)
Elke Vandermeerschen, KU Leuven (Belgium)
Raul Vazquez, University of Helsinki (Finland)
Ramón Carreño Villar, University of Oslo (Norway)
Fedor Vitiugin, Aalto University (Finland)
Tea Vojtěchová, Charles University (Czech Republic)
Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
Elaine Zosa, Silo AI (Finland)

Difference between revisions of "Community/training"

Latest revision as of 09:04, 5 February 2025

Contents

Background

Programme

Schedule

Registration

Logistics

Organization

Participants

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools

@@ Line 1: / Line 1: @@
-'''HPLT & NLPL Winter School on Large Language Models: Creation, Customization, Evaluation, and Use'''
+'''HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation'''
-[[File:Skeikampen.2023.jpg|center]]
+[[File:HPLT and NLPL Winter School 2024.jpg|center|thumb|upright=2.0]]
 = Background =
@@ Line 13: / Line 13: @@
 and experience in using high-performance e-infrastructures for large-scale
 NLP research.
-The 2024 edition of the winter school puts special emphasis on
+This 2025 edition of the winter school puts special emphasis on
 NLP researchers from countries who participate in the EuroHPC
 [https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium].
 For additional background, please see the archival pages from the
-[http://wiki.nlpl.eu/index.php/Community/training/2018 2018],
+[https://wiki.nlpl.eu/index.php/Community/training/2018 2018],
-[http://wiki.nlpl.eu/index.php/Community/training/2019 2019],
+[https://wiki.nlpl.eu/index.php/Community/training/2019 2019],
-[http://wiki.nlpl.eu/index.php/Community/training/2020 2020], and
+[https://wiki.nlpl.eu/index.php/Community/training/2020 2020],
-[http://wiki.nlpl.eu/index.php/Community/training/2023 2023]
+[https://wiki.nlpl.eu/index.php/Community/training/2023 2023], and
+[https://wiki.nlpl.eu/index.php/Community/training/2024 2024]
 NLPL Winter Schools.
-For early 2024, HPLT will hold its winter school from Sunday, February 4, to
+For early 2025, HPLT will hold its winter school from Monday, February 3, to
-Tuesday, February 6, 2024, at a
+Wednesday, February 5, 2025, at a
 [https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
 (with skiing and walking opportunities) about two hours north of Oslo.
 The project will organize group bus transfer from and to the Oslo
-airport ''Gardermoen'', leaving the airport at 9:45 on Sunday morning
+airport ''Gardermoen'', leaving the airport at 9:45 on Monday morning
-and returning there around 17:30 on Tuesday afternoon.
+and returning there around 17:30 on Wednesday afternoon.
 The winter school is subsidized by the HPLT project: there is no fee for
 participants and no charge for the bus transfer to and from the
 conference hotel.
-All participants will have to cover their own travel and accomodation
+All participants will have to cover their own travel and accommodation
 at Skeikampen, however.
-Two nights at the hotel, including all meals, will come to NOK 3745 (NOK 3345 per person in a shared double room),
+Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room),
-to be paid to the hotel directly.
+to be paid to the hotel directly upon arrival.
 = Programme =
-The 2024 winter school will have a thematic focus on ''Large Language Models: Creation, Customization, Evaluation, and Use''.
+The 2025 winter school will have a thematic focus on ''Pretraining Data Quality and Multilingual LLM Evaluation''.
 The programme will be comprised of in-depth technical presentations (possibly including some
 hands-on elements) by seasoned experts, with special emphasis on open science and European languages,
 but also include critical reflections on current development trends in LLM-focussed NLP.
-The programme will be complemented with a panel discussion and a ‘walk-through’ of available
+The programme will be complemented with a ‘walk-through’ of example experience
-infrastructure on the shared EuroHPC LUMI supercomputer.
+reports on the shared EuroHPC LUMI supercomputer.
-Confirmed presenters include:
+Confirmed presenters and talks include:
-* [http://afra.alishahi.name Afra Alishahi, Tilburg University, The Netherlands]
+* [https://sites.google.com/view/alexandra-birch Alexandra Birch], University of Edinburgh</br>'''EuroLLM and FinLLM – stories from the trenches'''
-* [https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot, University of Copenhagen, Denmark]
+* [https://laion.ai/team/ Jenia Jitsev] and [https://laion.ai/team/ Marianna Nezhurina], Jülich Supercomputing Centre / LAION</br>'''Open Foundation Models: Scaling Laws and Generalization'''
-* [https://muennighoff.github.io/ Niklas Muennighoff, Contextual AI]
+* [https://huggingface.co/guipenedo Guilherme Penedo], Huggingface</br>'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''
-* [https://perso.limsi.fr/neveol/bio.html Aurélie Névéol, Interdisciplinary Laboratory of Numerical Sciences, France]
+* [https://scholar.google.com/citations?user=f5FSgPwAAAAJ&hl=en Gema Ramírez-Sánchez], Prompsit Language Engineering</br>'''A look at Pre-Training Data through the Stats Glass'''
+* [https://annargrs.github.io Anna Rogers], IT University of Copenhagen</br>'''Large Language Models and Factuality'''
+* [https://portizs.eu Pedro Ortiz Suarez] and [https://commoncrawl.org/team/sebastian-nagel-engineer Sebastian Nagel], Common Crawl</br>'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''
+* [https://scholar.google.com.tr/citations?user=fvotcRIAAAAJ&hl=tr Ahmet Üstün], Cohere AI</br>'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''
+= Schedule =
 {| class="wikitable"
 |-
-!colspan=3|Sunday, February 4, 2024
+!colspan=3|Monday, February 3, 2025
 |-
 | 13:00 || 14:00 || Lunch
 |-
-| 14:00 || 15:30 || '''Session 1''': Analyzing and Interpreting Deep Neural Models of Language ([http://afra.alishahi.name Afra Alishahi])
+| 14:00 || 15:30 || '''Session 1''' Pedro Ortiz Suarez & Sebastian Nagel <p  class="mw-collapsible mw-collapsed">'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''<br>
+Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.</p>
+[https://data.hplt-project.org/transfer/commoncrawl_2025.pdf Slides]
 |-
 | 15:30 || 15:50 || Coffee Break
 |-
-| 16:00 || 17:30 || '''Session 2''': Analyzing and Interpreting Deep Neural Models of Language ([http://afra.alishahi.name Afra Alishahi])
+| 16:00 || 17:30 || '''Session 2''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts from LLMs'''<br>
+This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.</p>
+[https://data.hplt-project.org/transfer/nlpl_rogers_pt1.pdf Slides]
 |-
 | 17:30 || 17:50 || Coffee Break
 |-
-| 17:50 || 19:20 || '''Session 3''': Scaling Data-constrained Language Models ([https://muennighoff.github.io/ Niklas Muennighoff])
+| 17:50 || 19:20 || '''Session 3''' Alexandra Birch <p  class="mw-collapsible mw-collapsed">'''EuroLLM and FinLLM – stories from the trenches'''<br>
+In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.</p>
+[https://data.hplt-project.org/transfer/2025-02-EuroLLM_and_FinLLM_Birch.pdf Slides]
 |-
 | 19:30 ||  || Dinner
@@ Line 76: / Line 87: @@
 {| class="wikitable"
 |-
-!colspan=3|Monday, February 5, 2024
+!colspan=3|Tuesday, February 4, 2025
 |-
 |colspan=3 | Breakfast is available from 07:30
 |-
-| 09:00 || 10:30 || '''Session 4''': Bias in Natural Language Processing: focus on large language models ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol])
+| 09:00 || 10:30 || '''Session 4''' Guilherme Penedo <p  class="mw-collapsible mw-collapsed">'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''<br>FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.</p>
+[https://data.hplt-project.org/transfer/FineWeb2_90min.pdf Slides]
 |-
 |colspan=3| Free time (Lunch is available between 13:00 and 14:30)
 |-
-| 15:00 || 16:30 || '''Session 5''': Multilingual and multimodal language models ([https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot])
+| 15:30 || 17:00 || '''Session 5''' Gema Ramírez-Sánchez <p  class="mw-collapsible mw-collapsed">'''Having a look at pretraining data through the stats glass'''<br>At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.</p>
+[https://data.hplt-project.org/transfer/Gema-Ramírez-HPLT-Winter-School-2025.pdf Slides]
 |-
-| 16:30 || 16:50 || Coffee Break
+| 17:00 || 17:20 || Coffee Break
 |-
-| 16:50 || 17:40 || '''Session 6''': Multilingual and multimodal language models ([https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot])
+| 17:20 || 19:20 || '''Session 6''' Jenia Jitsev & Marianna Nezhurina <p  class="mw-collapsible mw-collapsed">'''Open Foundation Models: Scaling Laws and Generalization'''</p>
-|-
+[https://data.hplt-project.org/transfer/Open_Foundation_Models_Scaling_Laws-pre_final_2024.pdf Slides 1]<br>
-| 17:40 || 18:00 || Coffee Break
+[https://data.hplt-project.org/transfer/Pitfalls_in_measuring_generalization.pdf Slides 2]
-|-
-| 18:00 || 19:15 || '''Session 7'''. «Large vs. Small»: panel discussion. Panelists: Desmond Elliott (University of Copenhagen), Evangelia Gogoulou (RISE, Sweden), Afra Alishahi (Tilburg University), Jan Hajič (Charles University in Prague), and Aurélie Névéol (LISN, France)
 |-
 | 19:30 ||  || Dinner
 |-
-| 21:00 || || '''Evening Session'''. LUMI: BERT in an Hour, GPT in a Week. Speakers: David Samuel (University of Oslo) and Risto Luukkonen (University of Turku, Silo AI)
+| 21:00 || || '''Evening Session: Findings from HPLT'''
 |}
@@ Line 102: / Line 113: @@
 {| class="wikitable"
 |-
-!colspan=3|Tuesday, February 6, 2024
+!colspan=3|Wednesday, February 5, 2025
 |-
 |colspan=3| Breakfast is available from 07:30
 |-
-| 08:30 || 10:00 || '''Session 8''': Reproducibility in Natural Language Processing ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol])
+| 08:30 || 10:00 || '''Session 8''' Ahmet Üstün (online) <p  class="mw-collapsible mw-collapsed">'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''<br>Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models,  to leverage the curated data best.</p>
+[https://data.hplt-project.org/transfer/HPLT_Winter_School_Aya.pdf Slides]
 |-
 | 10:00 || 10:30 || Coffee Break
 |-
-| 10:30 || 12:00 || '''Session 9''': Understanding and measuring the environmental impact of Natural Language Processing ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol])
+| 10:30 || 12:00 || '''Session 9''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts about LLMs'''<br>
+This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.</p>
+[https://data.hplt-project.org/transfer/nlpl_rogers_pt2.pdf Slides]
 |-
 | 12:30 || 13:30 || Lunch
+|-
+| 13:45 || 16:45 || Bus transfer to OSL Airport
 |}
 = Registration =
-In total, we anticipate around 55 participants at the 2024 winter school.
+In total, this year we welcome 62 participants at the 2025 winter school.
-We have received more requests for participation than we will be able to accommodate,
+The winter school is [https://nettskjema.no/a/381438 over-subscribed] and no longer accepting registrations.
-and the registration form has now been closed.
+We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance.
-We processed requests for participation on a first-come, first-served basis, with an eye toward regional balance.
+Interested parties who had submitted the registration form have been confirmed in three batches, on '''December 6''', on '''December 13''',
-Interested parties who have submitted the registration form were confirmed in three batches, on December 11, on December 15,
+and on '''December 20''', which was also the closing date for winter school registration.
-and on December 22, which was also the closing date for winter school registration.
 Once confirmed by the organizing team, participant names are published
@@ Line 135: / Line 150: @@
 With a few exceptions, winter school participants travel to and from the conference hotel
 jointly on a chartered bus (the HPLT shuttle).
-The bus will leave OSL airport no later than 9:45 CET on Sunday, February 4.
+The bus will leave OSL airport no later than 9:45 CET on Monday, February 3.
 Thus, please meet up by 9:30 and make your arrival known to your assigned
 ‘tour guide’ (who will introduce themselves to you by email beforehand).
@@ Line 148: / Line 163: @@
 will make one stop along the way to stretch our legs and fill up on coffee.
-The winter school will end with lunch on Tuesday, February 6, before the group returns
+The winter school will end with lunch on Wednesday, February 5, before the group returns
 to OSL airport on the HPLT shuttle.
 The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL
@@ Line 155: / Line 170: @@
 = Organization =
-The 2024 Winter School is organized by a team of volunteers at the University
+The 2025 Winter School is organized by a team of volunteers at the University
 of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond,
 please see below.
@@ Line 163: / Line 178: @@
 The programme committee is comprised of:
-* Isabelle Augenstein (University of Copenhagen, Denmark)
+* Barry Haddow (University of Edinburgh, UK)
-* Emily M. Bemder (University of Washington, USA)
-* Kenneth Heafield (Edinburgh University, UK)
-* Jindřich Helcl (Charles University, Czech Republic)
-* Marco Kuhlmann (Linköping University, Sweden)
-* Per Egil Kummervold (National Library of Norway)
 * Andrey Kutuzov (University of Oslo, Norway)
-* Joakim Nivre (RISE and Uppsala University, Sweden)
 * Stephan Oepen (University of Oslo, Norway)
 * Sampo Pyysalo (University of Turku, Finland)
-* Gema Ramirez (Prompsit Language Engineering, Spain)
-* Anna Rogers (IT University of Copenhagen, Denmark)
-* Magnus Sahlgreen (AI Sweden)
-* David Samuel (University of Oslo, Norway)
 * Jörg Tiedemann (University of Helsinki, Finland)
-* Erik Velldal (University of Oslo, Norway)
 = Participants =
-# Afra Alishahi, Tilburg University (The Netherlands)
-# Ali Allaith, University of Copenhagen (Denmark)
 # Nikolay Arefev, University of Oslo (Norway)
-# Joseph Attieh, University of Helsinki (Finland)
+# Maria	Barrett, Silo AI (Finland)
-# Christopher Brückner, Charles University in Prague (Czech Republic)
+# Toms Bergmanis, Tilde (Latvia)
-# Lucas Charpentier, University of Oslo (Norway)
+# Alexandra Birch, University of Edinburgh (UK)
-# Konstantin Dobler, Hasso Plattner Institute (Germany)
+# Laurie Burchell, University of Edinburgh (UK)
+# Lucas Charpentie, University of Oslo (Norway)
+# Pinzhen (Patrick) Chen, University of Edinburgh (UK)
+# Hannah Clausen, University of Oslo (Norway)
+# Lucia Domenichelli, University of Pisa (Italy)
 # Aleksei Dorkin, University of Tartu (Estonia)
-# Luise Dürlich, Uppsala University (Sweden)
-# Simen Eide, Schibsted (Norway)
-# Desmond Elliott, University of Copenhagen (Denmark)
 # Kenneth Enevoldsen, Aarhus University (Denmark)
+# Tita Enstad, National Library (Norway)
 # Mariia Fedorova, University of Oslo (Norway)
-# Emilie Francis, Gothenburg University (Sweden)
+# Yanzhu Guo, INRIA Paris (France)
-# Evangelia Gogoulou, RISE (Sweden)
+# Arzu Burcu Güven, IT University of Copenhagen (Denmark)
-# Jan Hajič, Charles University in Prague (Czech Republic)
+# Barry Haddow, University of Edinburgh (UK)
-# Lasse Hansen, Aarhus University Hospital (Denmark)
+# Jan Hajič, Charles University (Czech Republic)
-# Jindřich Helcl, Charles University in Prague (Czech Republic)
+# Jindřich Helcl, Charles University (Czech Republic)
-# Yiping Jin, Pompeu Fabra University (Spain)
+# Bertram Højer, IT University Copenhagen (Denmark)
-# Amanda Kann, Stockholm University (Sweden)
+# Sekh Mainul Islam, University of Copenhagen (Denmark)
-# Jan Kostkan, Aarhus University (Denmark)
+# Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
-# Per Kummervold, National Library og Norway
+# Márton Kardos, Aarhus University (Denmark)
+# Anastasiia Klimashevskaia, University of Bergen (Norway)
+# Mateusz Klimaszewski, The University of Edinburgh (UK)
+# Ville Komulainen, University of Turku (Finland)
+# Markus Koskela, CSC – IT Center for Science (Finland)
+# Martins Kronis, Tilde (Latvia)
+# Vimal Kumar Kumar, University of Limerick (Ireland)
 # Andrey Kutuzov, University of Oslo (Norway)
-# Tsz Kin Lam, University of Edinburgh (UK)
+# Hengyu Luo, University of Helsinki (Finland)
-# Wenyan Li, University of Copenhagen (Denmark)
+# Farrokh Mehryary, University of Turku (Finland)
-# Pierre Lison, Norsk Regnesentral
-# Jouni Luoma, University of Turku (Finland)
-# Risto Luukkonen, University of Turku (Finland)
-# Arianna Masciolini, Gothenburg University (Sweden)
-# Petter Mæhlum, University of Oslo (Norway)
 # Vladislav Mikhailov, University of Oslo (Norway)
-# Yousuf Ali Mohammed, Gothenburg University (Sweden)
+# Andreas Motzfeldt, IT University of Copenhagen (Denmark)
-# Aurélie Névéol, LISN & CNRS (France)
+# Zain Muhammad Mujahid, University of Copenhagen (Denmark)
-# Tobias Norlund, AI Sweden (Sweden)
+# Sebastian Nagel, Common Crawl Foundation (Germany)
+# Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
 # Stephan Oepen, University of Oslo (Norway)
-# Lilja Øvrelid, University of Oslo (Norway)
+# Guilherme Penedo, HuugingFace (France)
-# Alberto Parola, University of Copenhagen (Denmark)
+# Irina Proskurina, University of Lyon (France)
-# Siddhesh Pawar, University of Copenhagen (Denmark)
+# Taido Purason, University of Tartu (Estonia)
-# Erofili Psaltaki, University of Helsinki (Finland)
+# Marie Roald, National Library (Norway)
-# Akseli Reunamo, University of Turku (Finland)
+# Anna Rogers, IT University Copenhagen (Denmark)
+# Ismaël Rousseau, Orange (France)
 # David Samuel, University of Oslo (Norway)
-# Ricardo Muñoz Sánchez, Gothenburg University (Sweden)
+# Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
-# Gautam Kishore Shahi, University of Duisburg-Essen (Germany)
+# Marta Sartor, University of Pisa (Italy)
-# Janine Siewert, University of Helsinki (Finland)
+# Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
-# Étienne Simon, University of Oslo (Norway)
+# Étienne Simon,  University of Oslo (Norway)
-# Inguna Skadiņa, University of Latvia
+# Pavel Stepachev, The University of Edinburgh (UK)
-# Ondrej Sotolar, Masaryk University (Czech Republic)
+# Pedro Ortiz Suarez, Common Crawl Foundation (France)
-# Pavel Stranak, Charles University in Prague (Czech Republic)
+# Otto Tarkka, University of Turku (Finland)
-# Maria Irena Szawerna, Gothenburg University (Sweden)
+# Kushal Tatariya, KU Leuven (Belgium)
 # Jörg Tiedemann, University of Helsinki (Finland)
-# Ekaterina Uetova, Technological University Dublin (Ireland)
+# Samia Touileb, University of Bergen (Norway)
-# Erik Velldal, University of Oslo (Norway)
+# Elke Vandermeerschen, KU Leuven (Belgium)
-# Tea Vojtěchová, Charles University in Prague (Czech Republic)
+# Raul Vazquez, University of Helsinki (Finland)
-# Jonas Waldendorf, University of Edinburgh (UK)
+# Ramón Carreño	Villar, University of Oslo (Norway)
-# Jaume Zaragoza-Bernabeu, Prompsit Language Engineering (Spain)
+# Fedor Vitiugin, Aalto University (Finland)
-# Giulio Zhou, University of Edinburgh (UK)
+# Tea Vojtěchová, Charles University (Czech Republic)
+# Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
+# Elaine Zosa, Silo AI (Finland)