Difference between revisions of "Community/training"
(→Programme) |
(→Schedule) |
||
(94 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
− | '''HPLT & NLPL Winter School on | + | '''HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation''' |
− | [[File: | + | [[File:HPLT and NLPL Winter School 2024.jpg|center|thumb|upright=2.0]] |
= Background = | = Background = | ||
Line 13: | Line 13: | ||
and experience in using high-performance e-infrastructures for large-scale | and experience in using high-performance e-infrastructures for large-scale | ||
NLP research. | NLP research. | ||
− | + | This 2025 edition of the winter school puts special emphasis on | |
NLP researchers from countries who participate in the EuroHPC | NLP researchers from countries who participate in the EuroHPC | ||
[https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium]. | [https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium]. | ||
For additional background, please see the archival pages from the | For additional background, please see the archival pages from the | ||
− | [ | + | [https://wiki.nlpl.eu/index.php/Community/training/2018 2018], |
− | [ | + | [https://wiki.nlpl.eu/index.php/Community/training/2019 2019], |
− | [ | + | [https://wiki.nlpl.eu/index.php/Community/training/2020 2020], |
− | [ | + | [https://wiki.nlpl.eu/index.php/Community/training/2023 2023], and |
+ | [https://wiki.nlpl.eu/index.php/Community/training/2024 2024] | ||
NLPL Winter Schools. | NLPL Winter Schools. | ||
− | For early | + | For early 2025, HPLT will hold its winter school from Monday, February 3, to |
− | + | Wednesday, February 5, 2025, at a | |
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel] | [https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel] | ||
(with skiing and walking opportunities) about two hours north of Oslo. | (with skiing and walking opportunities) about two hours north of Oslo. | ||
The project will organize group bus transfer from and to the Oslo | The project will organize group bus transfer from and to the Oslo | ||
− | airport ''Gardermoen'', leaving the airport at 9: | + | airport ''Gardermoen'', leaving the airport at 9:45 on Monday morning |
− | and returning there around 17:30 on | + | and returning there around 17:30 on Wednesday afternoon. |
The winter school is subsidized by the HPLT project: there is no fee for | The winter school is subsidized by the HPLT project: there is no fee for | ||
participants and no charge for the bus transfer to and from the | participants and no charge for the bus transfer to and from the | ||
conference hotel. | conference hotel. | ||
− | All participants will have to cover their own travel and | + | All participants will have to cover their own travel and accommodation |
at Skeikampen, however. | at Skeikampen, however. | ||
− | Two nights at the hotel, including all meals, will come to NOK | + | Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room), |
− | to be paid to the hotel directly. | + | to be paid to the hotel directly upon arrival. |
= Programme = | = Programme = | ||
− | The | + | The 2025 winter school will have a thematic focus on ''Pretraining Data Quality and Multilingual LLM Evaluation''. |
The programme will be comprised of in-depth technical presentations (possibly including some | The programme will be comprised of in-depth technical presentations (possibly including some | ||
hands-on elements) by seasoned experts, with special emphasis on open science and European languages, | hands-on elements) by seasoned experts, with special emphasis on open science and European languages, | ||
but also include critical reflections on current development trends in LLM-focussed NLP. | but also include critical reflections on current development trends in LLM-focussed NLP. | ||
− | The programme will be complemented with | + | The programme will be complemented with a ‘walk-through’ of example experience |
− | + | reports on the shared EuroHPC LUMI supercomputer. | |
− | Confirmed presenters include | + | Confirmed presenters and talks include: |
− | * [ | + | * [https://sites.google.com/view/alexandra-birch Alexandra Birch], University of Edinburgh</br>'''EuroLLM and FinLLM – stories from the trenches''' |
− | * [https:// | + | * [https://laion.ai/team/ Jenia Jitsev] and [https://laion.ai/team/ Marianna Nezhurina], Jülich Supercomputing Centre / LAION</br>'''Open Foundation Models: Scaling Laws and Generalization''' |
− | * [https:// | + | * [https://huggingface.co/guipenedo Guilherme Penedo], Huggingface</br>'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training''' |
− | * [https:// | + | * [https://scholar.google.com/citations?user=f5FSgPwAAAAJ&hl=en Gema Ramírez-Sánchez], Prompsit Language Engineering</br>'''A look at Pre-Training Data through the Stats Glass''' |
+ | * [https://annargrs.github.io Anna Rogers], IT University of Copenhagen</br>'''Large Language Models and Factuality''' | ||
+ | * [https://portizs.eu Pedro Ortiz Suarez] and [https://commoncrawl.org/team/sebastian-nagel-engineer Sebastian Nagel], Common Crawl</br>'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling''' | ||
+ | * [https://scholar.google.com.tr/citations?user=fvotcRIAAAAJ&hl=tr Ahmet Üstün], Cohere AI</br>'''Recipe for multilingual post-training: How to collect high-quality data and use them?''' | ||
+ | = Schedule = | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | !colspan=3| | + | !colspan=3|Monday, February 3, 2025 |
|- | |- | ||
| 13:00 || 14:00 || Lunch | | 13:00 || 14:00 || Lunch | ||
|- | |- | ||
− | | 14:00 || 15:30 || '''Session 1''' | + | | 14:00 || 15:30 || '''Session 1''' Pedro Ortiz Suarez & Sebastian Nagel <p class="mw-collapsible mw-collapsed">'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''<br> |
+ | Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.</p> | ||
+ | [https://data.hplt-project.org/transfer/commoncrawl_2025.pdf Slides] | ||
|- | |- | ||
| 15:30 || 15:50 || Coffee Break | | 15:30 || 15:50 || Coffee Break | ||
|- | |- | ||
− | | 16:00 || 17:30 || '''Session 2''': | + | | 16:00 || 17:30 || '''Session 2''' Anna Rogers <p class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts from LLMs'''<br> |
+ | This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.</p> | ||
+ | [https://data.hplt-project.org/transfer/nlpl_rogers_pt1.pdf Slides] | ||
|- | |- | ||
| 17:30 || 17:50 || Coffee Break | | 17:30 || 17:50 || Coffee Break | ||
|- | |- | ||
− | | 17:50 || 19:20 || '''Session 3''': | + | | 17:50 || 19:20 || '''Session 3''' Alexandra Birch <p class="mw-collapsible mw-collapsed">'''EuroLLM and FinLLM – stories from the trenches'''<br> |
+ | In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.</p> | ||
+ | [https://data.hplt-project.org/transfer/2025-02-EuroLLM_and_FinLLM_Birch.pdf Slides] | ||
|- | |- | ||
| 19:30 || || Dinner | | 19:30 || || Dinner | ||
Line 76: | Line 87: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | !colspan=3| | + | !colspan=3|Tuesday, February 4, 2025 |
|- | |- | ||
|colspan=3 | Breakfast is available from 07:30 | |colspan=3 | Breakfast is available from 07:30 | ||
|- | |- | ||
− | | 09:00 || 10:30 || '''Session 4''': | + | | 09:00 || 10:30 || '''Session 4''' Guilherme Penedo <p class="mw-collapsible mw-collapsed">'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''<br>FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.</p> |
+ | [https://data.hplt-project.org/transfer/FineWeb2_90min.pdf Slides] | ||
|- | |- | ||
|colspan=3| Free time (Lunch is available between 13:00 and 14:30) | |colspan=3| Free time (Lunch is available between 13:00 and 14:30) | ||
|- | |- | ||
− | | 15: | + | | 15:30 || 17:00 || '''Session 5''' Gema Ramírez-Sánchez <p class="mw-collapsible mw-collapsed">'''Having a look at pretraining data through the stats glass'''<br>At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.</p> |
+ | [https://data.hplt-project.org/transfer/Gema-Ramírez-HPLT-Winter-School-2025.pdf Slides] | ||
|- | |- | ||
− | | | + | | 17:00 || 17:20 || Coffee Break |
|- | |- | ||
− | | | + | | 17:20 || 19:20 || '''Session 6''' Jenia Jitsev & Marianna Nezhurina <p class="mw-collapsible mw-collapsed">'''Open Foundation Models: Scaling Laws and Generalization'''</p> |
− | + | [https://data.hplt-project.org/transfer/Open_Foundation_Models_Scaling_Laws-pre_final_2024.pdf Slides 1]<br> | |
− | + | [https://data.hplt-project.org/transfer/Pitfalls_in_measuring_generalization.pdf Slides 2] | |
− | |||
− | |||
|- | |- | ||
| 19:30 || || Dinner | | 19:30 || || Dinner | ||
|- | |- | ||
− | | 21:00 || || '''Evening Session''' | + | | 21:00 || || '''Evening Session: Findings from HPLT''' |
|} | |} | ||
Line 102: | Line 113: | ||
{| class="wikitable" | {| class="wikitable" | ||
|- | |- | ||
− | !colspan=3| | + | !colspan=3|Wednesday, February 5, 2025 |
|- | |- | ||
|colspan=3| Breakfast is available from 07:30 | |colspan=3| Breakfast is available from 07:30 | ||
|- | |- | ||
− | | 08:30 || 10:00 || '''Session 8''': | + | | 08:30 || 10:00 || '''Session 8''' Ahmet Üstün (online) <p class="mw-collapsible mw-collapsed">'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''<br>Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models, to leverage the curated data best.</p> |
+ | [https://data.hplt-project.org/transfer/HPLT_Winter_School_Aya.pdf Slides] | ||
|- | |- | ||
| 10:00 || 10:30 || Coffee Break | | 10:00 || 10:30 || Coffee Break | ||
|- | |- | ||
− | | 10:30 || 12:00 || '''Session 9''': | + | | 10:30 || 12:00 || '''Session 9''' Anna Rogers <p class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts about LLMs'''<br> |
+ | This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.</p> | ||
+ | [https://data.hplt-project.org/transfer/nlpl_rogers_pt2.pdf Slides] | ||
|- | |- | ||
| 12:30 || 13:30 || Lunch | | 12:30 || 13:30 || Lunch | ||
+ | |- | ||
+ | | 13:45 || 16:45 || Bus transfer to OSL Airport | ||
|} | |} | ||
= Registration = | = Registration = | ||
− | In total, we | + | In total, this year we welcome 62 participants at the 2025 winter school. |
− | + | The winter school is [https://nettskjema.no/a/381438 over-subscribed] and no longer accepting registrations. | |
− | and | + | We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance. |
− | We | + | Interested parties who had submitted the registration form have been confirmed in three batches, on '''December 6''', on '''December 13''', |
− | Interested parties who | + | and on '''December 20''', which was also the closing date for winter school registration. |
− | and on December | ||
− | Once confirmed by the organizing team, participant names | + | Once confirmed by the organizing team, participant names are published |
− | on this page, and registration | + | on this page, and registration establishes a |
''binding agreement'' with the hotel. | ''binding agreement'' with the hotel. | ||
Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute | Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute | ||
Line 135: | Line 150: | ||
With a few exceptions, winter school participants travel to and from the conference hotel | With a few exceptions, winter school participants travel to and from the conference hotel | ||
jointly on a chartered bus (the HPLT shuttle). | jointly on a chartered bus (the HPLT shuttle). | ||
− | The bus will leave OSL airport no later than 9:45 CET on | + | The bus will leave OSL airport no later than 9:45 CET on Monday, February 3. |
Thus, please meet up by 9:30 and make your arrival known to your assigned | Thus, please meet up by 9:30 and make your arrival known to your assigned | ||
‘tour guide’ (who will introduce themselves to you by email beforehand). | ‘tour guide’ (who will introduce themselves to you by email beforehand). | ||
− | The group will gather near the | + | The group will gather near the DNB currency exchange booth in the downstairs |
arrivals area, just outside the international arrivals luggage claims and slightly | arrivals area, just outside the international arrivals luggage claims and slightly | ||
− | to the | + | to the left as one exits the customs area: |
− | + | the yellow dot numbered (18) on the | |
[https://avinor.no/globalassets/_oslo-lufthavn/ankomst-arrivals.pdf OSL arrivals map]. | [https://avinor.no/globalassets/_oslo-lufthavn/ankomst-arrivals.pdf OSL arrivals map]. | ||
− | The group will then walk over to the bus terminal, to leave the airport not long after 9: | + | The group will then walk over to the bus terminal, to leave the airport not long after 9:40. |
The drive to the Skeikampen conference hotel will take us about three hours, and the bus | The drive to the Skeikampen conference hotel will take us about three hours, and the bus | ||
will make one stop along the way to stretch our legs and fill up on coffee. | will make one stop along the way to stretch our legs and fill up on coffee. | ||
− | The winter school will end with lunch on | + | The winter school will end with lunch on Wednesday, February 5, before the group returns |
to OSL airport on the HPLT shuttle. | to OSL airport on the HPLT shuttle. | ||
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL | The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL | ||
Line 155: | Line 170: | ||
= Organization = | = Organization = | ||
− | The | + | The 2025 Winter School is organized by a team of volunteers at the University |
− | of Oslo, supported by a programme committee from the NLPL network and beyond, | + | of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond, |
please see below. | please see below. | ||
For all inquiries regarding registration, the programme, logistics, | For all inquiries regarding registration, the programme, logistics, | ||
Line 163: | Line 178: | ||
The programme committee is comprised of: | The programme committee is comprised of: | ||
− | * | + | * Barry Haddow (University of Edinburgh, UK) |
− | |||
− | |||
− | |||
− | |||
− | |||
* Andrey Kutuzov (University of Oslo, Norway) | * Andrey Kutuzov (University of Oslo, Norway) | ||
− | |||
* Stephan Oepen (University of Oslo, Norway) | * Stephan Oepen (University of Oslo, Norway) | ||
* Sampo Pyysalo (University of Turku, Finland) | * Sampo Pyysalo (University of Turku, Finland) | ||
− | |||
− | |||
− | |||
− | |||
* Jörg Tiedemann (University of Helsinki, Finland) | * Jörg Tiedemann (University of Helsinki, Finland) | ||
− | |||
= Participants = | = Participants = | ||
− | |||
− | |||
# Nikolay Arefev, University of Oslo (Norway) | # Nikolay Arefev, University of Oslo (Norway) | ||
− | # | + | # Maria Barrett, Silo AI (Finland) |
− | # | + | # Toms Bergmanis, Tilde (Latvia) |
− | # Lucas | + | # Alexandra Birch, University of Edinburgh (UK) |
− | # | + | # Laurie Burchell, University of Edinburgh (UK) |
+ | # Lucas Charpentie, University of Oslo (Norway) | ||
+ | # Pinzhen (Patrick) Chen, University of Edinburgh (UK) | ||
+ | # Hannah Clausen, University of Oslo (Norway) | ||
+ | # Lucia Domenichelli, University of Pisa (Italy) | ||
# Aleksei Dorkin, University of Tartu (Estonia) | # Aleksei Dorkin, University of Tartu (Estonia) | ||
− | |||
− | |||
# Kenneth Enevoldsen, Aarhus University (Denmark) | # Kenneth Enevoldsen, Aarhus University (Denmark) | ||
+ | # Tita Enstad, National Library (Norway) | ||
# Mariia Fedorova, University of Oslo (Norway) | # Mariia Fedorova, University of Oslo (Norway) | ||
− | # | + | # Yanzhu Guo, INRIA Paris (France) |
− | # | + | # Arzu Burcu Güven, IT University of Copenhagen (Denmark) |
− | # Jan Hajič, Charles University | + | # Barry Haddow, University of Edinburgh (UK) |
− | # | + | # Jan Hajič, Charles University (Czech Republic) |
− | # | + | # Jindřich Helcl, Charles University (Czech Republic) |
− | # | + | # Bertram Højer, IT University Copenhagen (Denmark) |
− | # | + | # Sekh Mainul Islam, University of Copenhagen (Denmark) |
− | # | + | # Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany) |
− | # | + | # Márton Kardos, Aarhus University (Denmark) |
− | # | + | # Anastasiia Klimashevskaia, University of Bergen (Norway) |
+ | # Mateusz Klimaszewski, The University of Edinburgh (UK) | ||
+ | # Ville Komulainen, University of Turku (Finland) | ||
+ | # Markus Koskela, CSC – IT Center for Science (Finland) | ||
+ | # Martins Kronis, Tilde (Latvia) | ||
+ | # Vimal Kumar Kumar, University of Limerick (Ireland) | ||
# Andrey Kutuzov, University of Oslo (Norway) | # Andrey Kutuzov, University of Oslo (Norway) | ||
− | # | + | # Hengyu Luo, University of Helsinki (Finland) |
− | + | # Farrokh Mehryary, University of Turku (Finland) | |
− | |||
− | |||
− | # | ||
− | |||
− | |||
# Vladislav Mikhailov, University of Oslo (Norway) | # Vladislav Mikhailov, University of Oslo (Norway) | ||
− | # | + | # Andreas Motzfeldt, IT University of Copenhagen (Denmark) |
− | # | + | # Zain Muhammad Mujahid, University of Copenhagen (Denmark) |
− | # | + | # Sebastian Nagel, Common Crawl Foundation (Germany) |
+ | # Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany) | ||
# Stephan Oepen, University of Oslo (Norway) | # Stephan Oepen, University of Oslo (Norway) | ||
− | # | + | # Guilherme Penedo, HuugingFace (France) |
− | # | + | # Irina Proskurina, University of Lyon (France) |
− | # | + | # Taido Purason, University of Tartu (Estonia) |
− | # | + | # Marie Roald, National Library (Norway) |
− | # | + | # Anna Rogers, IT University Copenhagen (Denmark) |
+ | # Ismaël Rousseau, Orange (France) | ||
# David Samuel, University of Oslo (Norway) | # David Samuel, University of Oslo (Norway) | ||
− | # | + | # Gema Ramírez Sánchez, Prompsit Language Engineering (Spain) |
− | # | + | # Marta Sartor, University of Pisa (Italy) |
− | # | + | # Ipek Baris Schlicht, Universitat Politècnica de València (Spain) |
− | # Étienne Simon, University of Oslo (Norway) | + | # Étienne Simon, University of Oslo (Norway) |
− | # | + | # Pavel Stepachev, The University of Edinburgh (UK) |
− | # | + | # Pedro Ortiz Suarez, Common Crawl Foundation (France) |
− | # | + | # Otto Tarkka, University of Turku (Finland) |
− | # | + | # Kushal Tatariya, KU Leuven (Belgium) |
# Jörg Tiedemann, University of Helsinki (Finland) | # Jörg Tiedemann, University of Helsinki (Finland) | ||
− | # | + | # Samia Touileb, University of Bergen (Norway) |
− | # | + | # Elke Vandermeerschen, KU Leuven (Belgium) |
− | # Tea Vojtěchová, Charles University | + | # Raul Vazquez, University of Helsinki (Finland) |
− | # | + | # Ramón Carreño Villar, University of Oslo (Norway) |
− | # | + | # Fedor Vitiugin, Aalto University (Finland) |
− | + | # Tea Vojtěchová, Charles University (Czech Republic) | |
+ | # Artūrs Znotiņš, IMCS at University of Latvia (Latvia) | ||
+ | # Elaine Zosa, Silo AI (Finland) |
Latest revision as of 09:04, 5 February 2025
HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation
Contents
Background
Since 2023, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) have joined forces to organize the successful winter school series on Web-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. This 2025 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, 2020, 2023, and 2024 NLPL Winter Schools.
For early 2025, HPLT will hold its winter school from Monday, February 3, to Wednesday, February 5, 2025, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:45 on Monday morning and returning there around 17:30 on Wednesday afternoon.
The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accommodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room), to be paid to the hotel directly upon arrival.
Programme
The 2025 winter school will have a thematic focus on Pretraining Data Quality and Multilingual LLM Evaluation. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) by seasoned experts, with special emphasis on open science and European languages, but also include critical reflections on current development trends in LLM-focussed NLP. The programme will be complemented with a ‘walk-through’ of example experience reports on the shared EuroHPC LUMI supercomputer.
Confirmed presenters and talks include:
- Alexandra Birch, University of Edinburgh
EuroLLM and FinLLM – stories from the trenches - Jenia Jitsev and Marianna Nezhurina, Jülich Supercomputing Centre / LAION
Open Foundation Models: Scaling Laws and Generalization - Guilherme Penedo, Huggingface
FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training - Gema Ramírez-Sánchez, Prompsit Language Engineering
A look at Pre-Training Data through the Stats Glass - Anna Rogers, IT University of Copenhagen
Large Language Models and Factuality - Pedro Ortiz Suarez and Sebastian Nagel, Common Crawl
Data Quality, Language Coverage and Ethical Considerations in Web Crawling - Ahmet Üstün, Cohere AI
Recipe for multilingual post-training: How to collect high-quality data and use them?
Schedule
Monday, February 3, 2025 | ||
---|---|---|
13:00 | 14:00 | Lunch |
14:00 | 15:30 | Session 1 Pedro Ortiz Suarez & Sebastian Nagel Data Quality, Language Coverage and Ethical Considerations in Web Crawling |
15:30 | 15:50 | Coffee Break |
16:00 | 17:30 | Session 2 Anna Rogers LLMs and Factuality: facts from LLMs |
17:30 | 17:50 | Coffee Break |
17:50 | 19:20 | Session 3 Alexandra Birch EuroLLM and FinLLM – stories from the trenches |
19:30 | Dinner |
Tuesday, February 4, 2025 | ||
---|---|---|
Breakfast is available from 07:30 | ||
09:00 | 10:30 | Session 4 Guilherme Penedo FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training |
Free time (Lunch is available between 13:00 and 14:30) | ||
15:30 | 17:00 | Session 5 Gema Ramírez-Sánchez Having a look at pretraining data through the stats glass |
17:00 | 17:20 | Coffee Break |
17:20 | 19:20 | Session 6 Jenia Jitsev & Marianna Nezhurina Open Foundation Models: Scaling Laws and Generalization |
19:30 | Dinner | |
21:00 | Evening Session: Findings from HPLT |
Wednesday, February 5, 2025 | ||
---|---|---|
Breakfast is available from 07:30 | ||
08:30 | 10:00 | Session 8 Ahmet Üstün (online) Recipe for multilingual post-training: How to collect high-quality data and use them? |
10:00 | 10:30 | Coffee Break |
10:30 | 12:00 | Session 9 Anna Rogers LLMs and Factuality: facts about LLMs |
12:30 | 13:30 | Lunch |
13:45 | 16:45 | Bus transfer to OSL Airport |
Registration
In total, this year we welcome 62 participants at the 2025 winter school. The winter school is over-subscribed and no longer accepting registrations. We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who had submitted the registration form have been confirmed in three batches, on December 6, on December 13, and on December 20, which was also the closing date for winter school registration.
Once confirmed by the organizing team, participant names are published on this page, and registration establishes a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.
Logistics
With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:45 CET on Monday, February 3. Thus, please meet up by 9:30 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).
The group will gather near the DNB currency exchange booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the left as one exits the customs area: the yellow dot numbered (18) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport not long after 9:40. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.
The winter school will end with lunch on Wednesday, February 5, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.
Organization
The 2025 Winter School is organized by a team of volunteers at the University
of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond,
please see below.
For all inquiries regarding registration, the programme, logistics,
or such, please contact hplt-training@ifi.uio.no
.
The programme committee is comprised of:
- Barry Haddow (University of Edinburgh, UK)
- Andrey Kutuzov (University of Oslo, Norway)
- Stephan Oepen (University of Oslo, Norway)
- Sampo Pyysalo (University of Turku, Finland)
- Jörg Tiedemann (University of Helsinki, Finland)
Participants
- Nikolay Arefev, University of Oslo (Norway)
- Maria Barrett, Silo AI (Finland)
- Toms Bergmanis, Tilde (Latvia)
- Alexandra Birch, University of Edinburgh (UK)
- Laurie Burchell, University of Edinburgh (UK)
- Lucas Charpentie, University of Oslo (Norway)
- Pinzhen (Patrick) Chen, University of Edinburgh (UK)
- Hannah Clausen, University of Oslo (Norway)
- Lucia Domenichelli, University of Pisa (Italy)
- Aleksei Dorkin, University of Tartu (Estonia)
- Kenneth Enevoldsen, Aarhus University (Denmark)
- Tita Enstad, National Library (Norway)
- Mariia Fedorova, University of Oslo (Norway)
- Yanzhu Guo, INRIA Paris (France)
- Arzu Burcu Güven, IT University of Copenhagen (Denmark)
- Barry Haddow, University of Edinburgh (UK)
- Jan Hajič, Charles University (Czech Republic)
- Jindřich Helcl, Charles University (Czech Republic)
- Bertram Højer, IT University Copenhagen (Denmark)
- Sekh Mainul Islam, University of Copenhagen (Denmark)
- Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
- Márton Kardos, Aarhus University (Denmark)
- Anastasiia Klimashevskaia, University of Bergen (Norway)
- Mateusz Klimaszewski, The University of Edinburgh (UK)
- Ville Komulainen, University of Turku (Finland)
- Markus Koskela, CSC – IT Center for Science (Finland)
- Martins Kronis, Tilde (Latvia)
- Vimal Kumar Kumar, University of Limerick (Ireland)
- Andrey Kutuzov, University of Oslo (Norway)
- Hengyu Luo, University of Helsinki (Finland)
- Farrokh Mehryary, University of Turku (Finland)
- Vladislav Mikhailov, University of Oslo (Norway)
- Andreas Motzfeldt, IT University of Copenhagen (Denmark)
- Zain Muhammad Mujahid, University of Copenhagen (Denmark)
- Sebastian Nagel, Common Crawl Foundation (Germany)
- Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
- Stephan Oepen, University of Oslo (Norway)
- Guilherme Penedo, HuugingFace (France)
- Irina Proskurina, University of Lyon (France)
- Taido Purason, University of Tartu (Estonia)
- Marie Roald, National Library (Norway)
- Anna Rogers, IT University Copenhagen (Denmark)
- Ismaël Rousseau, Orange (France)
- David Samuel, University of Oslo (Norway)
- Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
- Marta Sartor, University of Pisa (Italy)
- Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
- Étienne Simon, University of Oslo (Norway)
- Pavel Stepachev, The University of Edinburgh (UK)
- Pedro Ortiz Suarez, Common Crawl Foundation (France)
- Otto Tarkka, University of Turku (Finland)
- Kushal Tatariya, KU Leuven (Belgium)
- Jörg Tiedemann, University of Helsinki (Finland)
- Samia Touileb, University of Bergen (Norway)
- Elke Vandermeerschen, KU Leuven (Belgium)
- Raul Vazquez, University of Helsinki (Finland)
- Ramón Carreño Villar, University of Oslo (Norway)
- Fedor Vitiugin, Aalto University (Finland)
- Tea Vojtěchová, Charles University (Czech Republic)
- Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
- Elaine Zosa, Silo AI (Finland)