Difference between revisions of "Community/training"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Programme)
(Schedule)
 
(85 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''HPLT & NLPL Winter School on Large Language Models: Creation, Customization, Evaluation, and Use'''
+
'''HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation'''
  
[[File:Skeikampen.2023.jpg|center]]
+
[[File:HPLT and NLPL Winter School 2024.jpg|center|thumb|upright=2.0]]
  
 
= Background =
 
= Background =
Line 13: Line 13:
 
and experience in using high-performance e-infrastructures for large-scale
 
and experience in using high-performance e-infrastructures for large-scale
 
NLP research.
 
NLP research.
The 2024 edition of the winter school puts special emphasis on
+
This 2025 edition of the winter school puts special emphasis on
 
NLP researchers from countries who participate in the EuroHPC
 
NLP researchers from countries who participate in the EuroHPC
 
[https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium].
 
[https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium].
 
For additional background, please see the archival pages from the
 
For additional background, please see the archival pages from the
[http://wiki.nlpl.eu/index.php/Community/training/2018 2018],
+
[https://wiki.nlpl.eu/index.php/Community/training/2018 2018],
[http://wiki.nlpl.eu/index.php/Community/training/2019 2019],
+
[https://wiki.nlpl.eu/index.php/Community/training/2019 2019],
[http://wiki.nlpl.eu/index.php/Community/training/2020 2020], and
+
[https://wiki.nlpl.eu/index.php/Community/training/2020 2020],
[http://wiki.nlpl.eu/index.php/Community/training/2023 2023]
+
[https://wiki.nlpl.eu/index.php/Community/training/2023 2023], and
 +
[https://wiki.nlpl.eu/index.php/Community/training/2024 2024]
 
NLPL Winter Schools.
 
NLPL Winter Schools.
  
For early 2024, HPLT will hold its winter school from Sunday, February 4, to
+
For early 2025, HPLT will hold its winter school from Monday, February 3, to
Tuesday, February 6, 2024, at a
+
Wednesday, February 5, 2025, at a
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
 
(with skiing and walking opportunities) about two hours north of Oslo.
 
(with skiing and walking opportunities) about two hours north of Oslo.
 
The project will organize group bus transfer from and to the Oslo
 
The project will organize group bus transfer from and to the Oslo
airport ''Gardermoen'', leaving the airport at 9:45 on Sunday morning
+
airport ''Gardermoen'', leaving the airport at 9:45 on Monday morning
and returning there around 17:30 on Tuesday afternoon.
+
and returning there around 17:30 on Wednesday afternoon.
  
 
The winter school is subsidized by the HPLT project: there is no fee for
 
The winter school is subsidized by the HPLT project: there is no fee for
 
participants and no charge for the bus transfer to and from the
 
participants and no charge for the bus transfer to and from the
 
conference hotel.
 
conference hotel.
All participants will have to cover their own travel and accomodation
+
All participants will have to cover their own travel and accommodation
 
at Skeikampen, however.
 
at Skeikampen, however.
Two nights at the hotel, including all meals, will come to NOK 3745 (NOK 3345 per person in a shared double room),  
+
Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room),  
to be paid to the hotel directly.
+
to be paid to the hotel directly upon arrival.
  
 
= Programme =
 
= Programme =
  
The 2024 winter school will have a thematic focus on ''Large Language Models: Creation, Customization, Evaluation, and Use''.
+
The 2025 winter school will have a thematic focus on ''Pretraining Data Quality and Multilingual LLM Evaluation''.
 
The programme will be comprised of in-depth technical presentations (possibly including some
 
The programme will be comprised of in-depth technical presentations (possibly including some
 
hands-on elements) by seasoned experts, with special emphasis on open science and European languages,  
 
hands-on elements) by seasoned experts, with special emphasis on open science and European languages,  
 
but also include critical reflections on current development trends in LLM-focussed NLP.
 
but also include critical reflections on current development trends in LLM-focussed NLP.
The programme will be complemented with a panel discussion and a ‘walk-through’ of available
+
The programme will be complemented with a ‘walk-through’ of example experience
infrastructure on the shared EuroHPC LUMI supercomputer.
+
reports on the shared EuroHPC LUMI supercomputer.
  
Confirmed presenters include:
+
Confirmed presenters and talks include:
  
* [http://afra.alishahi.name Afra Alishahi, Tilburg University, The Netherlands]
+
* [https://sites.google.com/view/alexandra-birch Alexandra Birch], University of Edinburgh</br>'''EuroLLM and FinLLM – stories from the trenches'''
* [https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot, University of Copenhagen, Denmark]
+
* [https://laion.ai/team/ Jenia Jitsev] and [https://laion.ai/team/ Marianna Nezhurina], Jülich Supercomputing Centre / LAION</br>'''Open Foundation Models: Scaling Laws and Generalization'''
* [https://muennighoff.github.io/ Niklas Muennighoff, Contextual AI]
+
* [https://huggingface.co/guipenedo Guilherme Penedo], Huggingface</br>'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''
* [https://perso.limsi.fr/neveol/bio.html Aurélie Névéol, Interdisciplinary Laboratory of Numerical Sciences, France]
+
* [https://scholar.google.com/citations?user=f5FSgPwAAAAJ&hl=en Gema Ramírez-Sánchez], Prompsit Language Engineering</br>'''A look at Pre-Training Data through the Stats Glass''' 
 +
* [https://annargrs.github.io Anna Rogers], IT University of Copenhagen</br>'''Large Language Models and Factuality'''
 +
* [https://portizs.eu Pedro Ortiz Suarez] and [https://commoncrawl.org/team/sebastian-nagel-engineer Sebastian Nagel], Common Crawl</br>'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''
 +
* [https://scholar.google.com.tr/citations?user=fvotcRIAAAAJ&hl=tr Ahmet Üstün], Cohere AI</br>'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''
  
 +
= Schedule =
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Sunday, February 4, 2024
+
!colspan=3|Monday, February 3, 2025
 
|-
 
|-
 
| 13:00 || 14:00 || Lunch
 
| 13:00 || 14:00 || Lunch
 
|-
 
|-
| 14:00 || 15:30 || '''Session 1''': Analyzing and Interpreting Deep Neural Models of Language ([http://afra.alishahi.name Afra Alishahi])
+
| 14:00 || 15:30 || '''Session 1''' Pedro Ortiz Suarez & Sebastian Nagel <p  class="mw-collapsible mw-collapsed">'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''<br>
 +
Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.</p>
 +
[https://data.hplt-project.org/transfer/commoncrawl_2025.pdf Slides]
 
|-
 
|-
 
| 15:30 || 15:50 || Coffee Break
 
| 15:30 || 15:50 || Coffee Break
 
|-
 
|-
| 16:00 || 17:30 || '''Session 2''': Analyzing and Interpreting Deep Neural Models of Language ([http://afra.alishahi.name Afra Alishahi])
+
| 16:00 || 17:30 || '''Session 2''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts from LLMs'''<br>
 +
This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.</p>
 +
[https://data.hplt-project.org/transfer/nlpl_rogers_pt1.pdf Slides]
 
|-
 
|-
 
| 17:30 || 17:50 || Coffee Break
 
| 17:30 || 17:50 || Coffee Break
 
|-
 
|-
| 17:50 || 19:20 || '''Session 3''': Scaling Data-constrained Language Models ([https://muennighoff.github.io/ Niklas Muennighoff])
+
| 17:50 || 19:20 || '''Session 3''' Alexandra Birch <p  class="mw-collapsible mw-collapsed">'''EuroLLM and FinLLM – stories from the trenches'''<br>
 +
In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.</p>
 +
[https://data.hplt-project.org/transfer/2025-02-EuroLLM_and_FinLLM_Birch.pdf Slides]
 
|-
 
|-
 
| 19:30 ||  || Dinner
 
| 19:30 ||  || Dinner
Line 76: Line 87:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Monday, February 5, 2024
+
!colspan=3|Tuesday, February 4, 2025
 
|-
 
|-
 
|colspan=3 | Breakfast is available from 07:30
 
|colspan=3 | Breakfast is available from 07:30
 
|-
 
|-
| 09:00 || 10:30 || '''Session 4''': Bias in Natural Language Processing: focus on large language models ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol])
+
| 09:00 || 10:30 || '''Session 4''' Guilherme Penedo <p  class="mw-collapsible mw-collapsed">'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''<br>FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.</p>
 +
[https://data.hplt-project.org/transfer/FineWeb2_90min.pdf Slides]
 
|-
 
|-
 
|colspan=3| Free time (Lunch is available between 13:00 and 14:30)
 
|colspan=3| Free time (Lunch is available between 13:00 and 14:30)
 
|-
 
|-
| 15:00 || 16:30 || '''Session 5''': Multilingual and multimodal language models ([https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot])
+
| 15:30 || 17:00 || '''Session 5''' Gema Ramírez-Sánchez <p  class="mw-collapsible mw-collapsed">'''Having a look at pretraining data through the stats glass'''<br>At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.</p>
 +
[https://data.hplt-project.org/transfer/Gema-Ramírez-HPLT-Winter-School-2025.pdf Slides]
 
|-
 
|-
| 16:30 || 16:50 || Coffee Break
+
| 17:00 || 17:20 || Coffee Break
 
|-
 
|-
| 16:50 || 17:40 || '''Session 6''': Multilingual and multimodal language models ([https://di.ku.dk/english/staff/vip/?pure=en/persons/631668 Desmond Elliot])
+
| 17:20 || 19:20 || '''Session 6''' Jenia Jitsev & Marianna Nezhurina <p  class="mw-collapsible mw-collapsed">'''Open Foundation Models: Scaling Laws and Generalization'''</p>
|-
+
[https://data.hplt-project.org/transfer/Open_Foundation_Models_Scaling_Laws-pre_final_2024.pdf Slides 1]<br>
| 17:40 || 18:00 || Coffee Break
+
[https://data.hplt-project.org/transfer/Pitfalls_in_measuring_generalization.pdf Slides 2]
|-
 
| 18:00 || 19:15 || '''Session 7'''. «Large vs. Small»: panel discussion. Panelists: Desmond Elliott (University of Copenhagen), Evangelia Gogoulou (RISE, Sweden), Afra Alishahi (Tilburg University), Jan Hajič (Charles University in Prague), and Aurélie Névéol (LISN, France)
 
 
|-
 
|-
 
| 19:30 ||  || Dinner
 
| 19:30 ||  || Dinner
 
|-
 
|-
| 21:00 || || '''Evening Session'''. LUMI: BERT in an Hour, GPT in a Week. Speakers: David Samuel (University of Oslo) and Risto Luukkonen (University of Turku, Silo AI)
+
| 21:00 || || '''Evening Session: Findings from HPLT'''
 
|}
 
|}
  
Line 102: Line 113:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Tuesday, February 6, 2024
+
!colspan=3|Wednesday, February 5, 2025
 
|-
 
|-
 
|colspan=3| Breakfast is available from 07:30
 
|colspan=3| Breakfast is available from 07:30
 
|-
 
|-
| 08:30 || 10:00 || '''Session 8''': Reproducibility in Natural Language Processing ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol])
+
| 08:30 || 10:00 || '''Session 8''' Ahmet Üstün (online) <p  class="mw-collapsible mw-collapsed">'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''<br>Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models,  to leverage the curated data best.</p>
 +
[https://data.hplt-project.org/transfer/HPLT_Winter_School_Aya.pdf Slides]
 
|-
 
|-
 
| 10:00 || 10:30 || Coffee Break
 
| 10:00 || 10:30 || Coffee Break
 
|-
 
|-
| 10:30 || 12:00 || '''Session 9''': Understanding and measuring the environmental impact of Natural Language Processing ([https://perso.limsi.fr/neveol/bio.html Aurélie Névéol])
+
| 10:30 || 12:00 || '''Session 9''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts about LLMs'''<br>
 +
This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.</p>
 +
[https://data.hplt-project.org/transfer/nlpl_rogers_pt2.pdf Slides]
 
|-
 
|-
 
| 12:30 || 13:30 || Lunch
 
| 12:30 || 13:30 || Lunch
 +
|-
 +
| 13:45 || 16:45 || Bus transfer to OSL Airport
 
|}
 
|}
  
 
= Registration =
 
= Registration =
  
In total, we anticipate around 55 participants at the 2024 winter school.
+
In total, this year we welcome 62 participants at the 2025 winter school.
We have received more requests for participation than we will be able to accommodate,
+
The winter school is [https://nettskjema.no/a/381438 over-subscribed] and no longer accepting registrations.
and the registration form has now been closed.
+
We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance.
We processed requests for participation on a first-come, first-served basis, with an eye toward regional balance.
+
Interested parties who had submitted the registration form have been confirmed in three batches, on '''December 6''', on '''December 13''',
Interested parties who have submitted the registration form were confirmed in three batches, on December 11, on December 15,
+
and on '''December 20''', which was also the closing date for winter school registration.
and on December 22, which was also the closing date for winter school registration.
 
  
 
Once confirmed by the organizing team, participant names are published
 
Once confirmed by the organizing team, participant names are published
Line 135: Line 150:
 
With a few exceptions, winter school participants travel to and from the conference hotel
 
With a few exceptions, winter school participants travel to and from the conference hotel
 
jointly on a chartered bus (the HPLT shuttle).
 
jointly on a chartered bus (the HPLT shuttle).
The bus will leave OSL airport no later than 9:45 CET on Sunday, February 4.
+
The bus will leave OSL airport no later than 9:45 CET on Monday, February 3.
 
Thus, please meet up by 9:30 and make your arrival known to your assigned
 
Thus, please meet up by 9:30 and make your arrival known to your assigned
 
‘tour guide’ (who will introduce themselves to you by email beforehand).
 
‘tour guide’ (who will introduce themselves to you by email beforehand).
Line 148: Line 163:
 
will make one stop along the way to stretch our legs and fill up on coffee.
 
will make one stop along the way to stretch our legs and fill up on coffee.
  
The winter school will end with lunch on Tuesday, February 6, before the group returns
+
The winter school will end with lunch on Wednesday, February 5, before the group returns
 
to OSL airport on the HPLT shuttle.
 
to OSL airport on the HPLT shuttle.
 
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL
 
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL
Line 155: Line 170:
 
= Organization =
 
= Organization =
  
The 2024 Winter School is organized by a team of volunteers at the University
+
The 2025 Winter School is organized by a team of volunteers at the University
 
of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond,
 
of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond,
 
please see below.
 
please see below.
Line 163: Line 178:
 
The programme committee is comprised of:
 
The programme committee is comprised of:
  
* Isabelle Augenstein (University of Copenhagen, Denmark)
+
* Barry Haddow (University of Edinburgh, UK)
* Emily M. Bemder (University of Washington, USA)
 
* Kenneth Heafield (Edinburgh University, UK)
 
* Jindřich Helcl (Charles University, Czech Republic)
 
* Marco Kuhlmann (Linköping University, Sweden)
 
* Per Egil Kummervold (National Library of Norway)
 
 
* Andrey Kutuzov (University of Oslo, Norway)
 
* Andrey Kutuzov (University of Oslo, Norway)
* Joakim Nivre (RISE and Uppsala University, Sweden)
 
 
* Stephan Oepen (University of Oslo, Norway)
 
* Stephan Oepen (University of Oslo, Norway)
 
* Sampo Pyysalo (University of Turku, Finland)
 
* Sampo Pyysalo (University of Turku, Finland)
* Gema Ramirez (Prompsit Language Engineering, Spain)
 
* Anna Rogers (IT University of Copenhagen, Denmark)
 
* Magnus Sahlgreen (AI Sweden)
 
* David Samuel (University of Oslo, Norway)
 
 
* Jörg Tiedemann (University of Helsinki, Finland)
 
* Jörg Tiedemann (University of Helsinki, Finland)
* Erik Velldal (University of Oslo, Norway)
 
  
 
= Participants =
 
= Participants =
  
# Afra Alishahi, Tilburg University (The Netherlands)
 
# Ali Allaith, University of Copenhagen (Denmark)
 
 
# Nikolay Arefev, University of Oslo (Norway)
 
# Nikolay Arefev, University of Oslo (Norway)
# Joseph Attieh, University of Helsinki (Finland)
+
# Maria Barrett, Silo AI (Finland)
# Christopher Brückner, Charles University in Prague (Czech Republic)
+
# Toms Bergmanis, Tilde (Latvia)
# Lucas Charpentier, University of Oslo (Norway)
+
# Alexandra Birch, University of Edinburgh (UK)
# Konstantin Dobler, Hasso Plattner Institute (Germany)
+
# Laurie Burchell, University of Edinburgh (UK)
 +
# Lucas Charpentie, University of Oslo (Norway)
 +
# Pinzhen (Patrick) Chen, University of Edinburgh (UK)
 +
# Hannah Clausen, University of Oslo (Norway)
 +
# Lucia Domenichelli, University of Pisa (Italy)
 
# Aleksei Dorkin, University of Tartu (Estonia)
 
# Aleksei Dorkin, University of Tartu (Estonia)
# Luise Dürlich, Uppsala University (Sweden)
 
# Simen Eide, Schibsted (Norway)
 
# Desmond Elliott, University of Copenhagen (Denmark)
 
 
# Kenneth Enevoldsen, Aarhus University (Denmark)
 
# Kenneth Enevoldsen, Aarhus University (Denmark)
 +
# Tita Enstad, National Library (Norway)
 
# Mariia Fedorova, University of Oslo (Norway)
 
# Mariia Fedorova, University of Oslo (Norway)
# Emilie Francis, Gothenburg University (Sweden)
+
# Yanzhu Guo, INRIA Paris (France)
# Evangelia Gogoulou, RISE (Sweden)
+
# Arzu Burcu Güven, IT University of Copenhagen (Denmark)
# Jan Hajič, Charles University in Prague (Czech Republic)
+
# Barry Haddow, University of Edinburgh (UK)
# Lasse Hansen, Aarhus University Hospital (Denmark)
+
# Jan Hajič, Charles University (Czech Republic)
# Jindřich Helcl, Charles University in Prague (Czech Republic)
+
# Jindřich Helcl, Charles University (Czech Republic)
# Yiping Jin, Pompeu Fabra University (Spain)
+
# Bertram Højer, IT University Copenhagen (Denmark)
# Amanda Kann, Stockholm University (Sweden)
+
# Sekh Mainul Islam, University of Copenhagen (Denmark)
# Jan Kostkan, Aarhus University (Denmark)
+
# Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
# Per Kummervold, National Library og Norway
+
# Márton Kardos, Aarhus University (Denmark)
 +
# Anastasiia Klimashevskaia, University of Bergen (Norway)
 +
# Mateusz Klimaszewski, The University of Edinburgh (UK)
 +
# Ville Komulainen, University of Turku (Finland)
 +
# Markus Koskela, CSC – IT Center for Science (Finland)
 +
# Martins Kronis, Tilde (Latvia)
 +
# Vimal Kumar Kumar, University of Limerick (Ireland)
 
# Andrey Kutuzov, University of Oslo (Norway)
 
# Andrey Kutuzov, University of Oslo (Norway)
# Tsz Kin Lam, University of Edinburgh (UK)
+
# Hengyu Luo, University of Helsinki (Finland)
# Wenyan Li, University of Copenhagen (Denmark)
+
# Farrokh Mehryary, University of Turku (Finland)
# Pierre Lison, Norsk Regnesentral
 
# Jouni Luoma, University of Turku (Finland)
 
# Risto Luukkonen, University of Turku (Finland)
 
# Arianna Masciolini, Gothenburg University (Sweden)
 
# Petter Mæhlum, University of Oslo (Norway)
 
 
# Vladislav Mikhailov, University of Oslo (Norway)
 
# Vladislav Mikhailov, University of Oslo (Norway)
# Yousuf Ali Mohammed, Gothenburg University (Sweden)
+
# Andreas Motzfeldt, IT University of Copenhagen (Denmark)
# Aurélie Névéol, LISN & CNRS (France)
+
# Zain Muhammad Mujahid, University of Copenhagen (Denmark)
# Tobias Norlund, AI Sweden (Sweden)
+
# Sebastian Nagel, Common Crawl Foundation (Germany)
 +
# Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
 
# Stephan Oepen, University of Oslo (Norway)
 
# Stephan Oepen, University of Oslo (Norway)
# Lilja Øvrelid, University of Oslo (Norway)
+
# Guilherme Penedo, HuugingFace (France)
# Alberto Parola, University of Copenhagen (Denmark)
+
# Irina Proskurina, University of Lyon (France)
# Siddhesh Pawar, University of Copenhagen (Denmark)
+
# Taido Purason, University of Tartu (Estonia)
# Erofili Psaltaki, University of Helsinki (Finland)
+
# Marie Roald, National Library (Norway)
# Akseli Reunamo, University of Turku (Finland)
+
# Anna Rogers, IT University Copenhagen (Denmark)
 +
# Ismaël Rousseau, Orange (France)
 
# David Samuel, University of Oslo (Norway)
 
# David Samuel, University of Oslo (Norway)
# Ricardo Muñoz Sánchez, Gothenburg University (Sweden)
+
# Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
# Gautam Kishore Shahi, University of Duisburg-Essen (Germany)
+
# Marta Sartor, University of Pisa (Italy)
# Janine Siewert, University of Helsinki (Finland)
+
# Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
# Étienne Simon, University of Oslo (Norway)
+
# Étienne Simon, University of Oslo (Norway)
# Inguna Skadiņa, University of Latvia
+
# Pavel Stepachev, The University of Edinburgh (UK)
# Ondrej Sotolar, Masaryk University (Czech Republic)
+
# Pedro Ortiz Suarez, Common Crawl Foundation (France)
# Pavel Stranak, Charles University in Prague (Czech Republic)
+
# Otto Tarkka, University of Turku (Finland)
# Maria Irena Szawerna, Gothenburg University (Sweden)
+
# Kushal Tatariya, KU Leuven (Belgium)
 
# Jörg Tiedemann, University of Helsinki (Finland)
 
# Jörg Tiedemann, University of Helsinki (Finland)
# Ekaterina Uetova, Technological University Dublin (Ireland)
+
# Samia Touileb, University of Bergen (Norway)
# Erik Velldal, University of Oslo (Norway)
+
# Elke Vandermeerschen, KU Leuven (Belgium)
# Tea Vojtěchová, Charles University in Prague (Czech Republic)
+
# Raul Vazquez, University of Helsinki (Finland)
# Jonas Waldendorf, University of Edinburgh (UK)
+
# Ramón Carreño Villar, University of Oslo (Norway)
# Jaume Zaragoza-Bernabeu, Prompsit Language Engineering (Spain)
+
# Fedor Vitiugin, Aalto University (Finland)
# Giulio Zhou, University of Edinburgh (UK)
+
# Tea Vojtěchová, Charles University (Czech Republic)
 +
# Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
 +
# Elaine Zosa, Silo AI (Finland)

Latest revision as of 09:04, 5 February 2025

HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation

HPLT and NLPL Winter School 2024.jpg

Background

Since 2023, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) have joined forces to organize the successful winter school series on Web-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. This 2025 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, 2020, 2023, and 2024 NLPL Winter Schools.

For early 2025, HPLT will hold its winter school from Monday, February 3, to Wednesday, February 5, 2025, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:45 on Monday morning and returning there around 17:30 on Wednesday afternoon.

The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accommodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room), to be paid to the hotel directly upon arrival.

Programme

The 2025 winter school will have a thematic focus on Pretraining Data Quality and Multilingual LLM Evaluation. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) by seasoned experts, with special emphasis on open science and European languages, but also include critical reflections on current development trends in LLM-focussed NLP. The programme will be complemented with a ‘walk-through’ of example experience reports on the shared EuroHPC LUMI supercomputer.

Confirmed presenters and talks include:

  • Alexandra Birch, University of Edinburgh
    EuroLLM and FinLLM – stories from the trenches
  • Jenia Jitsev and Marianna Nezhurina, Jülich Supercomputing Centre / LAION
    Open Foundation Models: Scaling Laws and Generalization
  • Guilherme Penedo, Huggingface
    FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training
  • Gema Ramírez-Sánchez, Prompsit Language Engineering
    A look at Pre-Training Data through the Stats Glass
  • Anna Rogers, IT University of Copenhagen
    Large Language Models and Factuality
  • Pedro Ortiz Suarez and Sebastian Nagel, Common Crawl
    Data Quality, Language Coverage and Ethical Considerations in Web Crawling
  • Ahmet Üstün, Cohere AI
    Recipe for multilingual post-training: How to collect high-quality data and use them?

Schedule

Monday, February 3, 2025
13:00 14:00 Lunch
14:00 15:30 Session 1 Pedro Ortiz Suarez & Sebastian Nagel

Data Quality, Language Coverage and Ethical Considerations in Web Crawling
Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.

Slides

15:30 15:50 Coffee Break
16:00 17:30 Session 2 Anna Rogers

LLMs and Factuality: facts from LLMs
This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.

Slides

17:30 17:50 Coffee Break
17:50 19:20 Session 3 Alexandra Birch

EuroLLM and FinLLM – stories from the trenches
In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.

Slides

19:30 Dinner
Tuesday, February 4, 2025
Breakfast is available from 07:30
09:00 10:30 Session 4 Guilherme Penedo

FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training
FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.

Slides

Free time (Lunch is available between 13:00 and 14:30)
15:30 17:00 Session 5 Gema Ramírez-Sánchez

Having a look at pretraining data through the stats glass
At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.

Slides

17:00 17:20 Coffee Break
17:20 19:20 Session 6 Jenia Jitsev & Marianna Nezhurina

Open Foundation Models: Scaling Laws and Generalization

Slides 1
Slides 2

19:30 Dinner
21:00 Evening Session: Findings from HPLT


Wednesday, February 5, 2025
Breakfast is available from 07:30
08:30 10:00 Session 8 Ahmet Üstün (online)

Recipe for multilingual post-training: How to collect high-quality data and use them?
Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models, to leverage the curated data best.

Slides

10:00 10:30 Coffee Break
10:30 12:00 Session 9 Anna Rogers

LLMs and Factuality: facts about LLMs
This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.

Slides

12:30 13:30 Lunch
13:45 16:45 Bus transfer to OSL Airport

Registration

In total, this year we welcome 62 participants at the 2025 winter school. The winter school is over-subscribed and no longer accepting registrations. We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who had submitted the registration form have been confirmed in three batches, on December 6, on December 13, and on December 20, which was also the closing date for winter school registration.

Once confirmed by the organizing team, participant names are published on this page, and registration establishes a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.

Logistics

With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:45 CET on Monday, February 3. Thus, please meet up by 9:30 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).

The group will gather near the DNB currency exchange booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the left as one exits the customs area: the yellow dot numbered (18) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport not long after 9:40. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.

The winter school will end with lunch on Wednesday, February 5, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.

Organization

The 2025 Winter School is organized by a team of volunteers at the University of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond, please see below. For all inquiries regarding registration, the programme, logistics, or such, please contact hplt-training@ifi.uio.no.

The programme committee is comprised of:

  • Barry Haddow (University of Edinburgh, UK)
  • Andrey Kutuzov (University of Oslo, Norway)
  • Stephan Oepen (University of Oslo, Norway)
  • Sampo Pyysalo (University of Turku, Finland)
  • Jörg Tiedemann (University of Helsinki, Finland)

Participants

  1. Nikolay Arefev, University of Oslo (Norway)
  2. Maria Barrett, Silo AI (Finland)
  3. Toms Bergmanis, Tilde (Latvia)
  4. Alexandra Birch, University of Edinburgh (UK)
  5. Laurie Burchell, University of Edinburgh (UK)
  6. Lucas Charpentie, University of Oslo (Norway)
  7. Pinzhen (Patrick) Chen, University of Edinburgh (UK)
  8. Hannah Clausen, University of Oslo (Norway)
  9. Lucia Domenichelli, University of Pisa (Italy)
  10. Aleksei Dorkin, University of Tartu (Estonia)
  11. Kenneth Enevoldsen, Aarhus University (Denmark)
  12. Tita Enstad, National Library (Norway)
  13. Mariia Fedorova, University of Oslo (Norway)
  14. Yanzhu Guo, INRIA Paris (France)
  15. Arzu Burcu Güven, IT University of Copenhagen (Denmark)
  16. Barry Haddow, University of Edinburgh (UK)
  17. Jan Hajič, Charles University (Czech Republic)
  18. Jindřich Helcl, Charles University (Czech Republic)
  19. Bertram Højer, IT University Copenhagen (Denmark)
  20. Sekh Mainul Islam, University of Copenhagen (Denmark)
  21. Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
  22. Márton Kardos, Aarhus University (Denmark)
  23. Anastasiia Klimashevskaia, University of Bergen (Norway)
  24. Mateusz Klimaszewski, The University of Edinburgh (UK)
  25. Ville Komulainen, University of Turku (Finland)
  26. Markus Koskela, CSC – IT Center for Science (Finland)
  27. Martins Kronis, Tilde (Latvia)
  28. Vimal Kumar Kumar, University of Limerick (Ireland)
  29. Andrey Kutuzov, University of Oslo (Norway)
  30. Hengyu Luo, University of Helsinki (Finland)
  31. Farrokh Mehryary, University of Turku (Finland)
  32. Vladislav Mikhailov, University of Oslo (Norway)
  33. Andreas Motzfeldt, IT University of Copenhagen (Denmark)
  34. Zain Muhammad Mujahid, University of Copenhagen (Denmark)
  35. Sebastian Nagel, Common Crawl Foundation (Germany)
  36. Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
  37. Stephan Oepen, University of Oslo (Norway)
  38. Guilherme Penedo, HuugingFace (France)
  39. Irina Proskurina, University of Lyon (France)
  40. Taido Purason, University of Tartu (Estonia)
  41. Marie Roald, National Library (Norway)
  42. Anna Rogers, IT University Copenhagen (Denmark)
  43. Ismaël Rousseau, Orange (France)
  44. David Samuel, University of Oslo (Norway)
  45. Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
  46. Marta Sartor, University of Pisa (Italy)
  47. Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
  48. Étienne Simon, University of Oslo (Norway)
  49. Pavel Stepachev, The University of Edinburgh (UK)
  50. Pedro Ortiz Suarez, Common Crawl Foundation (France)
  51. Otto Tarkka, University of Turku (Finland)
  52. Kushal Tatariya, KU Leuven (Belgium)
  53. Jörg Tiedemann, University of Helsinki (Finland)
  54. Samia Touileb, University of Bergen (Norway)
  55. Elke Vandermeerschen, KU Leuven (Belgium)
  56. Raul Vazquez, University of Helsinki (Finland)
  57. Ramón Carreño Villar, University of Oslo (Norway)
  58. Fedor Vitiugin, Aalto University (Finland)
  59. Tea Vojtěchová, Charles University (Czech Republic)
  60. Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
  61. Elaine Zosa, Silo AI (Finland)