Difference between revisions of "Community/training"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Background)
(Schedule)
 
(270 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
'''HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation'''
 +
 +
[[File:HPLT and NLPL Winter School 2024.jpg|center|thumb|upright=2.0]]
 +
 
= Background =
 
= Background =
  
A desirable side-effect of the NLPL cooperation is ''community formation'',
+
Since 2023, the NLPL network and Horizon Europe
i.e. strengthening interaction and collaboration among Nordic research teams
+
project ''[https://hplt-project.org High-Performance Language Technologies]'' (HPLT)
in NLP and advancing a shared level of knowledge and experience in using
+
have joined forces to organize the successful winter school series on Web-scale NLP.
national e-Infrastructures for large-scale NLP research.
+
The winter school seeks to stimulate ''community formation'',
Towards these goals, the project organizes an annual three-day winter school.
+
i.e. strengthening interaction and collaboration among
 +
European research teams in NLP and advancing a shared level of knowledge
 +
and experience in using high-performance e-infrastructures for large-scale
 +
NLP research.
 +
This 2025 edition of the winter school puts special emphasis on
 +
NLP researchers from countries who participate in the EuroHPC
 +
[https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium].
 
For additional background, please see the archival pages from the
 
For additional background, please see the archival pages from the
[http://wiki.nlpl.eu/index.php/Community/training/2018 2018] and
+
[https://wiki.nlpl.eu/index.php/Community/training/2018 2018],
[http://wiki.nlpl.eu/index.php/Community/training/2019 2019]
+
[https://wiki.nlpl.eu/index.php/Community/training/2019 2019],
NLPL Winter Schools].
+
[https://wiki.nlpl.eu/index.php/Community/training/2020 2020],
 +
[https://wiki.nlpl.eu/index.php/Community/training/2023 2023], and
 +
[https://wiki.nlpl.eu/index.php/Community/training/2024 2024]
 +
NLPL Winter Schools.
  
For early 2020, NLPL will hold its winter school from Sunday, February 2, to
+
For early 2025, HPLT will hold its winter school from Monday, February 3, to
Tuesday, February 4, 2020, at a
+
Wednesday, February 5, 2025, at a
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
(with skiing opportunities) about two hours north of Oslo.
+
(with skiing and walking opportunities) about two hours north of Oslo.
 
The project will organize group bus transfer from and to the Oslo
 
The project will organize group bus transfer from and to the Oslo
airport ''Gardermoen'', leaving the airport at 9:30 on Sunday morning
+
airport ''Gardermoen'', leaving the airport at 9:45 on Monday morning
and returning there around 17:30 on Tuesday afternoon.
+
and returning there around 17:30 on Wednesday afternoon.
  
The main external instructors in 2020 will be
+
The winter school is subsidized by the HPLT project: there is no fee for
[https://u.cs.biu.ac.il/~yogo/ Yoav Goldberg] (Bar Ilan University and Allen Institute for AI)
 
and [https://thomwolf.io/ Thomas Wolf] (Huggingface).
 
Additional sessions will be contributed by NLPL project members, including
 
* Filip Ginter and Antti Virtanen, on multi-gpu training of language-specific BERTs;
 
* Joakim Nivre and Artur Kulmizev, on syntactic dependency parsing in the neural age;
 
* Stephan Oepen and Daniel Hershcovich, on the 2019 and 2020 CoNLL tasks on semantic parsing;
 
* Jörg Tiedemann and Alessandro Raganato, with a practical crash course in neural MT.
 
Some sessions will combine lecturing and hands-on exercises.
 
The winter school programme will be complemented with an
 
evening ‘research bazar’ (by participants) to stimulate academic socializing
 
and possibly a ‘walk-through’ of available software, data, and service resources
 
in the NLPL Virtual Laboratory.
 
 
 
The winter school is subsidized by the project: there is no fee for
 
 
participants and no charge for the bus transfer to and from the
 
participants and no charge for the bus transfer to and from the
 
conference hotel.
 
conference hotel.
All participants will have to cover their own travel and accomodation
+
All participants will have to cover their own travel and accommodation
 
at Skeikampen, however.
 
at Skeikampen, however.
Two nights at the hotel, including all meals, will come to NOK 2865,  
+
Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room),  
to be paid to the hotel directly.
+
to be paid to the hotel directly upon arrival.
 +
 
 +
= Programme =
 +
 
 +
The 2025 winter school will have a thematic focus on ''Pretraining Data Quality and Multilingual LLM Evaluation''.
 +
The programme will be comprised of in-depth technical presentations (possibly including some
 +
hands-on elements) by seasoned experts, with special emphasis on open science and European languages,
 +
but also include critical reflections on current development trends in LLM-focussed NLP.
 +
The programme will be complemented with a ‘walk-through’ of example experience
 +
reports on the shared EuroHPC LUMI supercomputer.
 +
 
 +
Confirmed presenters and talks include:
 +
 
 +
* [https://sites.google.com/view/alexandra-birch Alexandra Birch], University of Edinburgh</br>'''EuroLLM and FinLLM – stories from the trenches'''
 +
* [https://laion.ai/team/ Jenia Jitsev] and [https://laion.ai/team/ Marianna Nezhurina], Jülich Supercomputing Centre / LAION</br>'''Open Foundation Models: Scaling Laws and Generalization'''
 +
* [https://huggingface.co/guipenedo Guilherme Penedo], Huggingface</br>'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''
 +
* [https://scholar.google.com/citations?user=f5FSgPwAAAAJ&hl=en Gema Ramírez-Sánchez], Prompsit Language Engineering</br>'''A look at Pre-Training Data through the Stats Glass''' 
 +
* [https://annargrs.github.io Anna Rogers], IT University of Copenhagen</br>'''Large Language Models and Factuality'''
 +
* [https://portizs.eu Pedro Ortiz Suarez] and [https://commoncrawl.org/team/sebastian-nagel-engineer Sebastian Nagel], Common Crawl</br>'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''
 +
* [https://scholar.google.com.tr/citations?user=fvotcRIAAAAJ&hl=tr Ahmet Üstün], Cohere AI</br>'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''
 +
 
 +
= Schedule =
 +
{| class="wikitable"
 +
|-
 +
!colspan=3|Monday, February 3, 2025
 +
|-
 +
| 13:00 || 14:00 || Lunch
 +
|-
 +
| 14:00 || 15:30 || '''Session 1''' Pedro Ortiz Suarez & Sebastian Nagel <p  class="mw-collapsible mw-collapsed">'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''<br>
 +
Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.</p>
 +
[https://data.hplt-project.org/transfer/commoncrawl_2025.pdf Slides]
 +
|-
 +
| 15:30 || 15:50 || Coffee Break
 +
|-
 +
| 16:00 || 17:30 || '''Session 2''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts from LLMs'''<br>
 +
This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.</p>
 +
[https://data.hplt-project.org/transfer/nlpl_rogers_pt1.pdf Slides]
 +
|-
 +
| 17:30 || 17:50 || Coffee Break
 +
|-
 +
| 17:50 || 19:20 || '''Session 3''' Alexandra Birch <p  class="mw-collapsible mw-collapsed">'''EuroLLM and FinLLM – stories from the trenches'''<br>
 +
In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.</p>
 +
[https://data.hplt-project.org/transfer/2025-02-EuroLLM_and_FinLLM_Birch.pdf Slides]
 +
|-
 +
| 19:30 ||  || Dinner
 +
|}
 +
 
 +
{| class="wikitable"
 +
|-
 +
!colspan=3|Tuesday, February 4, 2025
 +
|-
 +
|colspan=3 | Breakfast is available from 07:30
 +
|-
 +
| 09:00 || 10:30 || '''Session 4''' Guilherme Penedo <p  class="mw-collapsible mw-collapsed">'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''<br>FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.</p>
 +
[https://data.hplt-project.org/transfer/FineWeb2_90min.pdf Slides]
 +
|-
 +
|colspan=3| Free time (Lunch is available between 13:00 and 14:30)
 +
|-
 +
| 15:30 || 17:00 || '''Session 5''' Gema Ramírez-Sánchez <p  class="mw-collapsible mw-collapsed">'''Having a look at pretraining data through the stats glass'''<br>At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.</p>
 +
[https://data.hplt-project.org/transfer/Gema-Ramírez-HPLT-Winter-School-2025.pdf Slides]
 +
|-
 +
| 17:00 || 17:20 || Coffee Break
 +
|-
 +
| 17:20 || 19:20 || '''Session 6''' Jenia Jitsev & Marianna Nezhurina <p  class="mw-collapsible mw-collapsed">'''Open Foundation Models: Scaling Laws and Generalization'''</p>
 +
[https://data.hplt-project.org/transfer/Open_Foundation_Models_Scaling_Laws-pre_final_2024.pdf Slides 1]<br>
 +
[https://data.hplt-project.org/transfer/Pitfalls_in_measuring_generalization.pdf Slides 2]
 +
|-
 +
| 19:30 ||  || Dinner
 +
|-
 +
| 21:00 || || '''Evening Session: Findings from HPLT'''
 +
|}
 +
 
 +
 
 +
{| class="wikitable"
 +
|-
 +
!colspan=3|Wednesday, February 5, 2025
 +
|-
 +
|colspan=3| Breakfast is available from 07:30
 +
|-
 +
| 08:30 || 10:00 || '''Session 8''' Ahmet Üstün (online) <p  class="mw-collapsible mw-collapsed">'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''<br>Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models,  to leverage the curated data best.</p>
 +
[https://data.hplt-project.org/transfer/HPLT_Winter_School_Aya.pdf Slides]
 +
|-
 +
| 10:00 || 10:30 || Coffee Break
 +
|-
 +
| 10:30 || 12:00 || '''Session 9''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts about LLMs'''<br>
 +
This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.</p>
 +
[https://data.hplt-project.org/transfer/nlpl_rogers_pt2.pdf Slides]
 +
|-
 +
| 12:30 || 13:30 || Lunch
 +
|-
 +
| 13:45 || 16:45 || Bus transfer to OSL Airport
 +
|}
  
 
= Registration =
 
= Registration =
  
In total, we anticipate around 45 participants in the 2020 Winter School.
+
In total, this year we welcome 62 participants at the 2025 winter school.
Please register your intent of participation through our
+
The winter school is [https://nettskjema.no/a/381438 over-subscribed] and no longer accepting registrations.
[https://indico.neic.no/e/skeikampen20 on-line registration form].
+
We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance.
We will process requests for participation on a first-come, first-served
+
Interested parties who had submitted the registration form have been confirmed in three batches, on '''December 6''', on '''December 13''',
basis; the closing date for registration is Friday, December 13, 2019.
+
and on '''December 20''', which was also the closing date for winter school registration.
Once confirmed by the organizing team, registration will establish a
 
binding agreement with the hotel and a cancellation fee will be
 
incurred (unless we can find someone else to ‘take over’ last-minute
 
spaces).
 
  
= Contact =
+
Once confirmed by the organizing team, participant names are published
 +
on this page, and registration establishes a
 +
''binding agreement'' with the hotel.
 +
Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute
 +
spaces), and no-shows will be charged the full price for at least one night
 +
by the hotel.
  
The 2020 NLPL Winter School is organized by a team of volunteers,
+
= Logistics =
Li-Hsin Chang,
+
 
Filip Ginter,
+
With a few exceptions, winter school participants travel to and from the conference hotel
Bjørn Lindi,  
+
jointly on a chartered bus (the HPLT shuttle).
Farrokh Mehryary,
+
The bus will leave OSL airport no later than 9:45 CET on Monday, February 3.
Joakim Nivre,
+
Thus, please meet up by 9:30 and make your arrival known to your assigned
Stephan Oepen, and
+
‘tour guide’ (who will introduce themselves to you by email beforehand).
Jörg Tiedemann.
+
 
 +
The group will gather near the DNB currency exchange booth in the downstairs
 +
arrivals area, just outside the international arrivals luggage claims and slightly
 +
to the left as one exits the customs area:
 +
the yellow dot numbered (18) on the
 +
[https://avinor.no/globalassets/_oslo-lufthavn/ankomst-arrivals.pdf OSL arrivals map].
 +
The group will then walk over to the bus terminal, to leave the airport not long after 9:40.
 +
The drive to the Skeikampen conference hotel will take us about three hours, and the bus
 +
will make one stop along the way to stretch our legs and fill up on coffee.
 +
 
 +
The winter school will end with lunch on Wednesday, February 5, before the group returns
 +
to OSL airport on the HPLT shuttle.
 +
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL
 +
around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.
 +
 
 +
= Organization =
 +
 
 +
The 2025 Winter School is organized by a team of volunteers at the University
 +
of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond,
 +
please see below.
 
For all inquiries regarding registration, the programme, logistics,
 
For all inquiries regarding registration, the programme, logistics,
or such, please contact <code>outreach@nlpl.eu</code>.
+
or such, please contact <code>hplt-training@ifi.uio.no</code>.
 
 
= Programme =
 
  
'''Program draft:'''
+
The programme committee is comprised of:
[https://docs.google.com/spreadsheets/d/e/2PACX-1vSA7R--zjrxnzhrxpr6cNNzlomy3hvfTk1hedPJkmIcqxk2-ZuBGOG2Spp1YlPK9PtOOdFqwHNO3i9u/pubhtml?gid=530428440&single=true Click here]
 
  
Complete program will be announced soon!
+
* Barry Haddow (University of Edinburgh, UK)
 +
* Andrey Kutuzov (University of Oslo, Norway)
 +
* Stephan Oepen (University of Oslo, Norway)
 +
* Sampo Pyysalo (University of Turku, Finland)
 +
* Jörg Tiedemann (University of Helsinki, Finland)
  
 
= Participants =
 
= Participants =
  
# Jordi Armengol-Estapé (Barcelona)
+
# Nikolay Arefev, University of Oslo (Norway)
# Pepa Atanasova (Copenhagen)
+
# Maria Barrett, Silo AI (Finland)
# Jeremy Barnes (Oslo)
+
# Toms Bergmanis, Tilde (Latvia)
# Ali Basirat (Uppsala)
+
# Alexandra Birch, University of Edinburgh (UK)
# Aleksandrs Berdicevskis (Gothenburg)
+
# Laurie Burchell, University of Edinburgh (UK)
# Maja Buljan (Oslo)
+
# Lucas Charpentie, University of Oslo (Norway)
# Li-Hsin Chang (Turku, co-organizer)
+
# Pinzhen (Patrick) Chen, University of Edinburgh (UK)
# Manuel Ciosici (Copenhagen)
+
# Hannah Clausen, University of Oslo (Norway)
# Cheikh Bamba Dione (Bergen)
+
# Lucia Domenichelli, University of Pisa (Italy)
# Adam Ek (Gothenburg)
+
# Aleksei Dorkin, University of Tartu (Estonia)
# Filip Ginter (Turku, co-organizer)
+
# Kenneth Enevoldsen, Aarhus University (Denmark)
# Yoav Goldberg (Tel Aviv, presenter)
+
# Tita Enstad, National Library (Norway)
# Rob van der Goot (Copenhagen)
+
# Mariia Fedorova, University of Oslo (Norway)
# Daniel Hershcovich (Copenhagen)
+
# Yanzhu Guo, INRIA Paris (France)
# Andreas Holm (Copenhagen)
+
# Arzu Burcu Güven, IT University of Copenhagen (Denmark)
# Suwisa Kaewphan (Turku)
+
# Barry Haddow, University of Edinburgh (UK)
# Jenna Kanerva (Turku)
+
# Jan Hajič, Charles University (Czech Republic)
# Martin Krallinger (Barcelona)
+
# Jindřich Helcl, Charles University (Czech Republic)
# Artur Kulmizev (Uppsala)
+
# Bertram Højer, IT University Copenhagen (Denmark)
# Maria Kunilovskaya (Wolverhampton)
+
# Sekh Mainul Islam, University of Copenhagen (Denmark)
# Jenny Kunz (Linköping)
+
# Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
# Andrey Kutuzov (Oslo)
+
# Márton Kardos, Aarhus University (Denmark)
# Anna Lindahl (Gothenburg)
+
# Anastasiia Klimashevskaia, University of Bergen (Norway)
# Ellinor Lindqvist (Uppsala)
+
# Mateusz Klimaszewski, The University of Edinburgh (UK)
# Juhani Luotolahti (Turku)
+
# Ville Komulainen, University of Turku (Finland)
# Jan Tore Lønning (Oslo)
+
# Markus Koskela, CSC – IT Center for Science (Finland)
# Arild Matsson (Gothenburg)
+
# Martins Kronis, Tilde (Latvia)
# Maite Melero (Barcelona)
+
# Vimal Kumar Kumar, University of Limerick (Ireland)
# Farrokh Mehryary (Turku, co-organizer)
+
# Andrey Kutuzov, University of Oslo (Norway)
# Antonio Miranda (Barcelona)
+
# Hengyu Luo, University of Helsinki (Finland)
# Joakim Nivre (Uppsala, co-organizer)
+
# Farrokh Mehryary, University of Turku (Finland)
# Stephan Oepen (Oslo, co-organizer)
+
# Vladislav Mikhailov, University of Oslo (Norway)
# Ildiko Pilan (Oslo)
+
# Andreas Motzfeldt, IT University of Copenhagen (Denmark)
# Alessandro Raganato (Helsinki)
+
# Zain Muhammad Mujahid, University of Copenhagen (Denmark)
# Vinit Ravishankar (Oslo)
+
# Sebastian Nagel, Common Crawl Foundation (Germany)
# Arradi Nur Rizal (Uppsala)
+
# Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
# Samuel Rönnqvist (Turku)
+
# Stephan Oepen, University of Oslo (Norway)
# Stian Rødven Eide (Gothenburg)
+
# Guilherme Penedo, HuugingFace (France)
# Jörg Tiedemann (Helsinki, co-organizer)
+
# Irina Proskurina, University of Lyon (France)
# Samia Touileb (Oslo)
+
# Taido Purason, University of Tartu (Estonia)
# Erik Velldal (Oslo)
+
# Marie Roald, National Library (Norway)
# Daniel Varab (Copenhagen)
+
# Anna Rogers, IT University Copenhagen (Denmark)
# Marta Villegas (Barcelona)
+
# Ismaël Rousseau, Orange (France)
# Antti Virtanen (Turku)
+
# David Samuel, University of Oslo (Norway)
# Dustin Wright (Copenhagen)
+
# Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
# Lilja Øvrelid (Oslo)
+
# Marta Sartor, University of Pisa (Italy)
 +
# Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
 +
# Étienne Simon,  University of Oslo (Norway)
 +
# Pavel Stepachev, The University of Edinburgh (UK)
 +
# Pedro Ortiz Suarez, Common Crawl Foundation (France)
 +
# Otto Tarkka, University of Turku (Finland)
 +
# Kushal Tatariya, KU Leuven (Belgium)
 +
# Jörg Tiedemann, University of Helsinki (Finland)
 +
# Samia Touileb, University of Bergen (Norway)
 +
# Elke Vandermeerschen, KU Leuven (Belgium)
 +
# Raul Vazquez, University of Helsinki (Finland)
 +
# Ramón Carreño Villar, University of Oslo (Norway)
 +
# Fedor Vitiugin, Aalto University (Finland)
 +
# Tea Vojtěchová, Charles University (Czech Republic)
 +
# Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
 +
# Elaine Zosa, Silo AI (Finland)

Latest revision as of 09:04, 5 February 2025

HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation

HPLT and NLPL Winter School 2024.jpg

Background

Since 2023, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) have joined forces to organize the successful winter school series on Web-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. This 2025 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, 2020, 2023, and 2024 NLPL Winter Schools.

For early 2025, HPLT will hold its winter school from Monday, February 3, to Wednesday, February 5, 2025, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:45 on Monday morning and returning there around 17:30 on Wednesday afternoon.

The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accommodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room), to be paid to the hotel directly upon arrival.

Programme

The 2025 winter school will have a thematic focus on Pretraining Data Quality and Multilingual LLM Evaluation. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) by seasoned experts, with special emphasis on open science and European languages, but also include critical reflections on current development trends in LLM-focussed NLP. The programme will be complemented with a ‘walk-through’ of example experience reports on the shared EuroHPC LUMI supercomputer.

Confirmed presenters and talks include:

  • Alexandra Birch, University of Edinburgh
    EuroLLM and FinLLM – stories from the trenches
  • Jenia Jitsev and Marianna Nezhurina, Jülich Supercomputing Centre / LAION
    Open Foundation Models: Scaling Laws and Generalization
  • Guilherme Penedo, Huggingface
    FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training
  • Gema Ramírez-Sánchez, Prompsit Language Engineering
    A look at Pre-Training Data through the Stats Glass
  • Anna Rogers, IT University of Copenhagen
    Large Language Models and Factuality
  • Pedro Ortiz Suarez and Sebastian Nagel, Common Crawl
    Data Quality, Language Coverage and Ethical Considerations in Web Crawling
  • Ahmet Üstün, Cohere AI
    Recipe for multilingual post-training: How to collect high-quality data and use them?

Schedule

Monday, February 3, 2025
13:00 14:00 Lunch
14:00 15:30 Session 1 Pedro Ortiz Suarez & Sebastian Nagel

Data Quality, Language Coverage and Ethical Considerations in Web Crawling
Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.

Slides

15:30 15:50 Coffee Break
16:00 17:30 Session 2 Anna Rogers

LLMs and Factuality: facts from LLMs
This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.

Slides

17:30 17:50 Coffee Break
17:50 19:20 Session 3 Alexandra Birch

EuroLLM and FinLLM – stories from the trenches
In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.

Slides

19:30 Dinner
Tuesday, February 4, 2025
Breakfast is available from 07:30
09:00 10:30 Session 4 Guilherme Penedo

FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training
FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.

Slides

Free time (Lunch is available between 13:00 and 14:30)
15:30 17:00 Session 5 Gema Ramírez-Sánchez

Having a look at pretraining data through the stats glass
At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.

Slides

17:00 17:20 Coffee Break
17:20 19:20 Session 6 Jenia Jitsev & Marianna Nezhurina

Open Foundation Models: Scaling Laws and Generalization

Slides 1
Slides 2

19:30 Dinner
21:00 Evening Session: Findings from HPLT


Wednesday, February 5, 2025
Breakfast is available from 07:30
08:30 10:00 Session 8 Ahmet Üstün (online)

Recipe for multilingual post-training: How to collect high-quality data and use them?
Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models, to leverage the curated data best.

Slides

10:00 10:30 Coffee Break
10:30 12:00 Session 9 Anna Rogers

LLMs and Factuality: facts about LLMs
This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.

Slides

12:30 13:30 Lunch
13:45 16:45 Bus transfer to OSL Airport

Registration

In total, this year we welcome 62 participants at the 2025 winter school. The winter school is over-subscribed and no longer accepting registrations. We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who had submitted the registration form have been confirmed in three batches, on December 6, on December 13, and on December 20, which was also the closing date for winter school registration.

Once confirmed by the organizing team, participant names are published on this page, and registration establishes a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.

Logistics

With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:45 CET on Monday, February 3. Thus, please meet up by 9:30 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).

The group will gather near the DNB currency exchange booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the left as one exits the customs area: the yellow dot numbered (18) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport not long after 9:40. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.

The winter school will end with lunch on Wednesday, February 5, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.

Organization

The 2025 Winter School is organized by a team of volunteers at the University of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond, please see below. For all inquiries regarding registration, the programme, logistics, or such, please contact hplt-training@ifi.uio.no.

The programme committee is comprised of:

  • Barry Haddow (University of Edinburgh, UK)
  • Andrey Kutuzov (University of Oslo, Norway)
  • Stephan Oepen (University of Oslo, Norway)
  • Sampo Pyysalo (University of Turku, Finland)
  • Jörg Tiedemann (University of Helsinki, Finland)

Participants

  1. Nikolay Arefev, University of Oslo (Norway)
  2. Maria Barrett, Silo AI (Finland)
  3. Toms Bergmanis, Tilde (Latvia)
  4. Alexandra Birch, University of Edinburgh (UK)
  5. Laurie Burchell, University of Edinburgh (UK)
  6. Lucas Charpentie, University of Oslo (Norway)
  7. Pinzhen (Patrick) Chen, University of Edinburgh (UK)
  8. Hannah Clausen, University of Oslo (Norway)
  9. Lucia Domenichelli, University of Pisa (Italy)
  10. Aleksei Dorkin, University of Tartu (Estonia)
  11. Kenneth Enevoldsen, Aarhus University (Denmark)
  12. Tita Enstad, National Library (Norway)
  13. Mariia Fedorova, University of Oslo (Norway)
  14. Yanzhu Guo, INRIA Paris (France)
  15. Arzu Burcu Güven, IT University of Copenhagen (Denmark)
  16. Barry Haddow, University of Edinburgh (UK)
  17. Jan Hajič, Charles University (Czech Republic)
  18. Jindřich Helcl, Charles University (Czech Republic)
  19. Bertram Højer, IT University Copenhagen (Denmark)
  20. Sekh Mainul Islam, University of Copenhagen (Denmark)
  21. Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
  22. Márton Kardos, Aarhus University (Denmark)
  23. Anastasiia Klimashevskaia, University of Bergen (Norway)
  24. Mateusz Klimaszewski, The University of Edinburgh (UK)
  25. Ville Komulainen, University of Turku (Finland)
  26. Markus Koskela, CSC – IT Center for Science (Finland)
  27. Martins Kronis, Tilde (Latvia)
  28. Vimal Kumar Kumar, University of Limerick (Ireland)
  29. Andrey Kutuzov, University of Oslo (Norway)
  30. Hengyu Luo, University of Helsinki (Finland)
  31. Farrokh Mehryary, University of Turku (Finland)
  32. Vladislav Mikhailov, University of Oslo (Norway)
  33. Andreas Motzfeldt, IT University of Copenhagen (Denmark)
  34. Zain Muhammad Mujahid, University of Copenhagen (Denmark)
  35. Sebastian Nagel, Common Crawl Foundation (Germany)
  36. Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
  37. Stephan Oepen, University of Oslo (Norway)
  38. Guilherme Penedo, HuugingFace (France)
  39. Irina Proskurina, University of Lyon (France)
  40. Taido Purason, University of Tartu (Estonia)
  41. Marie Roald, National Library (Norway)
  42. Anna Rogers, IT University Copenhagen (Denmark)
  43. Ismaël Rousseau, Orange (France)
  44. David Samuel, University of Oslo (Norway)
  45. Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
  46. Marta Sartor, University of Pisa (Italy)
  47. Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
  48. Étienne Simon, University of Oslo (Norway)
  49. Pavel Stepachev, The University of Edinburgh (UK)
  50. Pedro Ortiz Suarez, Common Crawl Foundation (France)
  51. Otto Tarkka, University of Turku (Finland)
  52. Kushal Tatariya, KU Leuven (Belgium)
  53. Jörg Tiedemann, University of Helsinki (Finland)
  54. Samia Touileb, University of Bergen (Norway)
  55. Elke Vandermeerschen, KU Leuven (Belgium)
  56. Raul Vazquez, University of Helsinki (Finland)
  57. Ramón Carreño Villar, University of Oslo (Norway)
  58. Fedor Vitiugin, Aalto University (Finland)
  59. Tea Vojtěchová, Charles University (Czech Republic)
  60. Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
  61. Elaine Zosa, Silo AI (Finland)