Difference between revisions of "Community/training"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Programme)
(Schedule)
 
(156 intermediate revisions by 3 users not shown)
Line 1: Line 1:
'''HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data'''
+
'''HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation'''
  
[[File:skeikampen.2020.png|center]]
+
[[File:HPLT and NLPL Winter School 2024.jpg|center|thumb|upright=2.0]]
  
 
= Background =
 
= Background =
  
After a two-year pandemic hiatus, the NLPL network and Horizon Europe
+
Since 2023, the NLPL network and Horizon Europe
project ''High-Performance Language Technologies'' (HPLT) join forces
+
project ''[https://hplt-project.org High-Performance Language Technologies]'' (HPLT)
to re-launch the successful winter school series on large-scale NLP.
+
have joined forces to organize the successful winter school series on Web-scale NLP.
 
The winter school seeks to stimulate ''community formation'',
 
The winter school seeks to stimulate ''community formation'',
i.e. strengthening interaction and collaboration among Nordic and
+
i.e. strengthening interaction and collaboration among
 
European research teams in NLP and advancing a shared level of knowledge
 
European research teams in NLP and advancing a shared level of knowledge
 
and experience in using high-performance e-infrastructures for large-scale
 
and experience in using high-performance e-infrastructures for large-scale
 
NLP research.
 
NLP research.
The 2023 edition of the winter school puts special emphasis on
+
This 2025 edition of the winter school puts special emphasis on
 
NLP researchers from countries who participate in the EuroHPC
 
NLP researchers from countries who participate in the EuroHPC
 
[https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium].
 
[https://www.lumi-supercomputer.eu/lumi-consortium/ LUMI consortium].
 
For additional background, please see the archival pages from the
 
For additional background, please see the archival pages from the
[http://wiki.nlpl.eu/index.php/Community/training/2018 2018],
+
[https://wiki.nlpl.eu/index.php/Community/training/2018 2018],
[http://wiki.nlpl.eu/index.php/Community/training/2019 2019], and
+
[https://wiki.nlpl.eu/index.php/Community/training/2019 2019],
[http://wiki.nlpl.eu/index.php/Community/training/2020 2020]
+
[https://wiki.nlpl.eu/index.php/Community/training/2020 2020],
 +
[https://wiki.nlpl.eu/index.php/Community/training/2023 2023], and
 +
[https://wiki.nlpl.eu/index.php/Community/training/2024 2024]
 
NLPL Winter Schools.
 
NLPL Winter Schools.
  
For early 2023, HPLT will hold its winter school from Monday, February 6, to
+
For early 2025, HPLT will hold its winter school from Monday, February 3, to
Wednesday, February 8, 2023, at a
+
Wednesday, February 5, 2025, at a
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
 
[https://www.thonhotels.com/our-hotels/norway/skeikampen/ mountain-side hotel]
 
(with skiing and walking opportunities) about two hours north of Oslo.
 
(with skiing and walking opportunities) about two hours north of Oslo.
 
The project will organize group bus transfer from and to the Oslo
 
The project will organize group bus transfer from and to the Oslo
airport ''Gardermoen'', leaving the airport at 9:30 on Monday morning
+
airport ''Gardermoen'', leaving the airport at 9:45 on Monday morning
 
and returning there around 17:30 on Wednesday afternoon.
 
and returning there around 17:30 on Wednesday afternoon.
  
Line 33: Line 35:
 
participants and no charge for the bus transfer to and from the
 
participants and no charge for the bus transfer to and from the
 
conference hotel.
 
conference hotel.
All participants will have to cover their own travel and accomodation
+
All participants will have to cover their own travel and accommodation
 
at Skeikampen, however.
 
at Skeikampen, however.
Two nights at the hotel, including all meals, will come to NOK 3190 (NOK 2790 per person in a shared double room),  
+
Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room),  
to be paid to the hotel directly.
+
to be paid to the hotel directly upon arrival.
  
 
= Programme =
 
= Programme =
  
The 2023 winter school will have a thematic focus on ''Web Data for Large-Scale Language Modeling and Neural Machine Translation''.
+
The 2025 winter school will have a thematic focus on ''Pretraining Data Quality and Multilingual LLM Evaluation''.
 
The programme will be comprised of in-depth technical presentations (possibly including some
 
The programme will be comprised of in-depth technical presentations (possibly including some
hands-on elements) from, among others, the
+
hands-on elements) by seasoned experts, with special emphasis on open science and European languages,  
[https://bigscience.huggingface.co BigScience] and [https://commoncrawl.org Common Crawl] initiatives,
+
but also include critical reflections on current development trends in LLM-focussed NLP.
but also include critical reflections on working with massive, uncurated language data.
+
The programme will be complemented with a ‘walk-through’ of example experience
The programme will be complemented with a panel discussion and a ‘walk-through’ of available infrastructure on the shared EuroHPC LUMI supercomputer.
+
reports on the shared EuroHPC LUMI supercomputer.
  
Confirmed presenters include:
+
Confirmed presenters and talks include:
  
* Mehdi Ali, Fraunhofer IAIS</br><b>OpenGPT-X: Development of a Gaia-X Node for Large AI Language Models and Innovative Language Application Service</b></br>The development of large language models is currently dominated by non-European organizations. The OpenGPT-X project aims to ensure the European data and AI sovereignty to develop this technology. The developed language models will be made open-source, facilitating the usage and research of these models. In this talk, we provide an overview of the project, describe the current status and give an outlook.
+
* [https://sites.google.com/view/alexandra-birch Alexandra Birch], University of Edinburgh</br>'''EuroLLM and FinLLM – stories from the trenches'''
* [https://faculty.washington.edu/ebender/ Emily M. Bender, University of Washington]</br><b>[http://nlpl.eu/skeikampen23/bender.pdf Towards Responsible Development and Application of Large Language Models]</b></br>This session will begin with a problematization of the rush for scale in language models and the "foundation model" conceptualization and exploration of the risks of ever larger language models. I will then turn to some discussion of what can be done better, drawing on value sensitive design and with a focus on evaluation grounded in specific use cases and on thorough documentation. Finally, I will reflect on the dangers and responsibilities that come with working in an area with intense media and corporate interest.
+
* [https://laion.ai/team/ Jenia Jitsev] and [https://laion.ai/team/ Marianna Nezhurina], Jülich Supercomputing Centre / LAION</br>'''Open Foundation Models: Scaling Laws and Generalization'''
* [https://huggingface.co/teven Teven Le Scao, Hugging Face]</br><b>[http://nlpl.eu/skeikampen23/lescao.pdf Large Language Models: A How-To Starting Guide]</b></br>The new capabilities of large language models (LLMs) have prompted a paradigm change in NLP. However, most are developed by resource-rich organizations and kept from the public. In the framework of the BigScience workshop, a collaboration of hundreds of researchers dedicated to democratizing this powerful technology, we created BLOOM, a 176B-parameter open-access multilingual language model. This talk will be a tutorial to share the learnings of this project and make it easier for others to build their own large language models.
+
* [https://huggingface.co/guipenedo Guilherme Penedo], Huggingface</br>'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''
* [https://nljubesi.github.io Nikola Ljubešić, Jožef Stefan Institute & University of Ljubljana]</br><b>[http://nlpl.eu/skeikampen23/ljubešić.230206.pdf MaCoCu Corpora: Why Top-Level-Domain Crawling and Web Data Enrichment Matter]</b></br>Exploitation of huge crawl dumps seems not to be the most economical approach to obtaining data for smaller languages. While one might argue that the "needle in the haystack" problem of smaller languages in crawl dumps can be circumvented with "gathering all the different needles at the same time", in practice this approach often fails due to various reasons, one of which is the fact that language identification tools that cover many languages do not perform well enough on smaller languages. In our talk we will present the MaCoCu way of collecting web data which, beyond focusing on crawling top-level domains in the quest for high-quality up-to-date data, also encompasses various forms of data enrichment, crucial ingredients for understanding what kind of data we include in our language and translation models.
+
* [https://scholar.google.com/citations?user=f5FSgPwAAAAJ&hl=en Gema Ramírez-Sánchez], Prompsit Language Engineering</br>'''A look at Pre-Training Data through the Stats Glass'''  
* [https://commoncrawl.org/about/team/#headshot-14714 Sebastian Nagel, Common Crawl]</br><b>[http://nlpl.eu/skeikampen23/nagel.230206.pdf Common Crawl: Data Collection and Use Cases for NLP]</b></br>The Common Crawl data sets are sample collections of web pages made accessible free of charge to everyone interested in running machine-scale analysis on web data.  The presentation starts with a short outline of data collection, the crawlers and technologies used from 2008 until today with an emphasis on the challenges to achieve a balanced, both diverse and representative sample of web sites while operating an efficient and  polite crawler. After an overview of the data formats used to store the primary web page captures, but also text and metadata extracts, indexes, hyperlink graphs, we showcase how Common Crawl data can be processed. We put the focus on three use cases for NLP: bulk processing of plain text and HTML pages, exploration and statistics based on the URL and metadata index, and the "vertical" use of data from specific sites or by content language.
+
* [https://annargrs.github.io Anna Rogers], IT University of Copenhagen</br>'''Large Language Models and Factuality'''
* [https://annargrs.github.io Anna Rogers, University of Copenhagen]</br><b>Big Corpus Linguistics: Lessons from the BigScience Workshop</b></br>The continued growth of large language models and their wide-scale adoption in commercial applications make it increasingly important to investigate their training data, both for research and ethical reasons.  However, inspecting such large corpora has been problematic due to difficulties with data access, and the need for large-scale infrastructure. This talk will discuss some lessons learned during the BigScience workshop,  as well as an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.
+
* [https://portizs.eu Pedro Ortiz Suarez] and [https://commoncrawl.org/team/sebastian-nagel-engineer Sebastian Nagel], Common Crawl</br>'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''
* [https://portizs.eu/#about Pedro Ortiz Suarez, University of Mannheim and DFKI]</br><b>[http://nlpl.eu/skeikampen23/suarez.pdf The OSCAR Project: Improving Data Quality in Multilingual Heterogeneous Web-Based Corpora]</b></br>In this talk we will introduce the OSCAR project and present our recent efforts in overcoming the difficulties posed by the heterogeneity, noisiness and size of web resources; in order to produce higher quality textual data for as many languages as possible. We will also discuss recent developments in the project, including our data-processing pipelines to annotate and classify large amounts of textual data in constrained infrastructures, as well as our first attempts to become a fully open-source project and manage our growing community. Finally, we will present how the OSCAR initiative is currently collaborating with other projects in order to improve data quality and availability.
+
* [https://scholar.google.com.tr/citations?user=fvotcRIAAAAJ&hl=tr Ahmet Üstün], Cohere AI</br>'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''
* Zeerak Talat, Simon Fraser University,</br><b>NLP and Futuring the Past</b></br>Machine learning and NLP are technological projects that implicitly seek to create possible futures and it is therefore important to consider the values that NLP projects into the future as a field. In this session, we will be discussing the values of language technology, how they arise, and their complicated relationship with the potential for equitable futures.
 
* [https://sites.google.com/site/ivanvulic/ Ivan Vulić, Cambridge University]</br><b>Modular and Parameter-Efficient Adaptation of Multilingual NLP Models</b></br>A key challenge in multilingual NLP is developing general language-independent architectures that will be equally applicable to any language. However, this ambition is hindered by the large variation in 1) structural and semantic properties of the world’s languages, as well as 2) raw and task data scarcity for many different languages, tasks, and application domains. As a consequence, existing language technology is still largely limited to a handful of resource-rich languages, leaving the vast majority of the world’s 7,000+ languages and their speakers behind, thus amplifying the problem of the “digital language divide”. In this lecture, we will demonstrate that modularity enables widening the reach of multilingual NLP to minor and low-resource languages and communities, also boosting efficiency and reusability of models' constituent components: modules. We will introduce a range of recent modular and parameter-efficient techniques, additionally pointing to their high-level similarities and differences, that aim to deal with large cross-language variations and low-data learning regimes. We will also demonstrate that low-resource languages, despite very positive research trends and results achieved in recent years, still lag behind major languages in terms of performance, resources, overall representation in NLP research and other key aspects, and will outline several crucial challenges for future research in this area.
 
  
 +
= Schedule =
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Monday, February 6, 2023
+
!colspan=3|Monday, February 3, 2025
 
|-
 
|-
 
| 13:00 || 14:00 || Lunch
 
| 13:00 || 14:00 || Lunch
 
|-
 
|-
| 14:00 || 15:30 || '''Session 1''' [http://nlpl.eu/skeikampen23/nagel.230206.pdf Sebastian Nagel]
+
| 14:00 || 15:30 || '''Session 1''' Pedro Ortiz Suarez & Sebastian Nagel <p  class="mw-collapsible mw-collapsed">'''Data Quality, Language Coverage and Ethical Considerations in Web Crawling'''<br>
 +
Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.</p>
 +
[https://data.hplt-project.org/transfer/commoncrawl_2025.pdf Slides]
 
|-
 
|-
 
| 15:30 || 15:50 || Coffee Break
 
| 15:30 || 15:50 || Coffee Break
 
|-
 
|-
| 16:00 || 17:30 || '''Session 2''' [http://nlpl.eu/skeikampen23/ljubešić.230206.pdf Nikola Ljubešić]
+
| 16:00 || 17:30 || '''Session 2''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts from LLMs'''<br>
 +
This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.</p>
 +
[https://data.hplt-project.org/transfer/nlpl_rogers_pt1.pdf Slides]
 
|-
 
|-
 
| 17:30 || 17:50 || Coffee Break
 
| 17:30 || 17:50 || Coffee Break
 
|-
 
|-
| 17:50 || 19:20 || '''Session 3''' [http://nlpl.eu/skeikampen23/panel.pdf Is the end of academic NLP research in sight?]
+
| 17:50 || 19:20 || '''Session 3''' Alexandra Birch <p  class="mw-collapsible mw-collapsed">'''EuroLLM and FinLLM – stories from the trenches'''<br>
 +
In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.</p>
 +
[https://data.hplt-project.org/transfer/2025-02-EuroLLM_and_FinLLM_Birch.pdf Slides]
 
|-
 
|-
 
| 19:30 ||  || Dinner
 
| 19:30 ||  || Dinner
Line 80: Line 87:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Tuesday, February 7, 2023
+
!colspan=3|Tuesday, February 4, 2025
 
|-
 
|-
 
|colspan=3 | Breakfast is available from 07:30
 
|colspan=3 | Breakfast is available from 07:30
 
|-
 
|-
| 09:00 || 10:00 || '''Session 4''' http://nlpl.eu/skeikampen23/suarez.pdf Pedro Suarez]
+
| 09:00 || 10:30 || '''Session 4''' Guilherme Penedo <p  class="mw-collapsible mw-collapsed">'''FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training'''<br>FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.</p>
 +
[https://data.hplt-project.org/transfer/FineWeb2_90min.pdf Slides]
 
|-
 
|-
|colspan=3| Lunch is available between 13:00 and 14:30
+
|colspan=3| Free time (Lunch is available between 13:00 and 14:30)
 
|-
 
|-
| 15:00 || 16:30 || '''Session 5''' [http://nlpl.eu/skeikampen23/lescao.pdf Teven Le Scao]
+
| 15:30 || 17:00 || '''Session 5''' Gema Ramírez-Sánchez <p  class="mw-collapsible mw-collapsed">'''Having a look at pretraining data through the stats glass'''<br>At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.</p>
 +
[https://data.hplt-project.org/transfer/Gema-Ramírez-HPLT-Winter-School-2025.pdf Slides]
 
|-
 
|-
| 16:30 || 16:50 || Coffee Break
+
| 17:00 || 17:20 || Coffee Break
 
|-
 
|-
| 16:50 || 17:40 || '''Session 6''' Mehdi Ali
+
| 17:20 || 19:20 || '''Session 6''' Jenia Jitsev & Marianna Nezhurina <p  class="mw-collapsible mw-collapsed">'''Open Foundation Models: Scaling Laws and Generalization'''</p>
|-
+
[https://data.hplt-project.org/transfer/Open_Foundation_Models_Scaling_Laws-pre_final_2024.pdf Slides 1]<br>
| 17:40 || 18:00 || Coffee Break
+
[https://data.hplt-project.org/transfer/Pitfalls_in_measuring_generalization.pdf Slides 2]
|-
 
| 18:00 || 19:30 || '''Session 7''' [http://nlpl.eu/skeikampen23/bender.pdf Emily Bender]
 
 
|-
 
|-
 
| 19:30 ||  || Dinner
 
| 19:30 ||  || Dinner
 
|-
 
|-
| 21:00 || || '''Evening Session''' HPLT, LUMI, LLM & NMT
+
| 21:00 || || '''Evening Session: Findings from HPLT'''
 
|}
 
|}
  
Line 106: Line 113:
 
{| class="wikitable"
 
{| class="wikitable"
 
|-
 
|-
!colspan=3|Wednesday, February 8, 2023
+
!colspan=3|Wednesday, February 5, 2025
 
|-
 
|-
 
|colspan=3| Breakfast is available from 07:30
 
|colspan=3| Breakfast is available from 07:30
 
|-
 
|-
| 08:30 || 10:00 || '''Session 8''' Ivan Vulić
+
| 08:30 || 10:00 || '''Session 8''' Ahmet Üstün (online) <p  class="mw-collapsible mw-collapsed">'''Recipe for multilingual post-training: How to collect high-quality data and use them?'''<br>Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models,  to leverage the curated data best.</p>
 +
[https://data.hplt-project.org/transfer/HPLT_Winter_School_Aya.pdf Slides]
 
|-
 
|-
 
| 10:00 || 10:30 || Coffee Break
 
| 10:00 || 10:30 || Coffee Break
 
|-
 
|-
| 10:30 || 12:00 || '''Session 9''' Zeerak Talat
+
| 10:30 || 12:00 || '''Session 9''' Anna Rogers <p  class="mw-collapsible mw-collapsed">'''LLMs and Factuality: facts about LLMs'''<br>
 +
This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.</p>
 +
[https://data.hplt-project.org/transfer/nlpl_rogers_pt2.pdf Slides]
 
|-
 
|-
 
| 12:30 || 13:30 || Lunch
 
| 12:30 || 13:30 || Lunch
 +
|-
 +
| 13:45 || 16:45 || Bus transfer to OSL Airport
 
|}
 
|}
  
 
= Registration =
 
= Registration =
  
Registration is now closed.  The 2023 winter school was heavily over-subscribed.
+
In total, this year we welcome 62 participants at the 2025 winter school.
 
+
The winter school is [https://nettskjema.no/a/381438 over-subscribed] and no longer accepting registrations.
In total, we anticipate up to 60 participants in the 2023 Winter School.
+
We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance.
Please register your intent of participation through our
+
Interested parties who had submitted the registration form have been confirmed in three batches, on '''December 6''', on '''December 13''',
[https://nettskjema.no/a/300790 on-line registration form].
+
and on '''December 20''', which was also the closing date for winter school registration.
We will process requests for participation on a first-come, first-served
 
basis, with an eye toward regional balance.
 
Interested parties who have submitted the registration form will be confirmed
 
in three batches, one on December 5, another one on December 12, and finally
 
after the closing date for registration, which is Thursday, December 15, 2022.
 
  
Once confirmed by the organizing team, participant names will be published
+
Once confirmed by the organizing team, participant names are published
on this page, and registration will establish a
+
on this page, and registration establishes a
 
''binding agreement'' with the hotel.
 
''binding agreement'' with the hotel.
 
Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute
 
Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute
Line 143: Line 150:
 
With a few exceptions, winter school participants travel to and from the conference hotel
 
With a few exceptions, winter school participants travel to and from the conference hotel
 
jointly on a chartered bus (the HPLT shuttle).
 
jointly on a chartered bus (the HPLT shuttle).
The bus will leave OSL airport no later than 9:30 CET on Monday, February 6.
+
The bus will leave OSL airport no later than 9:45 CET on Monday, February 3.
Thus, please meet up at 9:15 and make your arrival known to your assigned
+
Thus, please meet up by 9:30 and make your arrival known to your assigned
 
‘tour guide’ (who will introduce themselves to you by email beforehand).
 
‘tour guide’ (who will introduce themselves to you by email beforehand).
  
The group will gather near the bus and taxi information booth in the downstairs
+
The group will gather near the DNB currency exchange booth in the downstairs
 
arrivals area, just outside the international arrivals luggage claims and slightly
 
arrivals area, just outside the international arrivals luggage claims and slightly
to the right, as one exits the customs area:
+
to the left as one exits the customs area:
The yellow dot numbered (17) on the
+
the yellow dot numbered (18) on the
 
[https://avinor.no/globalassets/_oslo-lufthavn/ankomst-arrivals.pdf OSL arrivals map].
 
[https://avinor.no/globalassets/_oslo-lufthavn/ankomst-arrivals.pdf OSL arrivals map].
The group will then walk over to the bus terminal, to leave the airport by 9:30.
+
The group will then walk over to the bus terminal, to leave the airport not long after 9:40.
 
The drive to the Skeikampen conference hotel will take us about three hours, and the bus
 
The drive to the Skeikampen conference hotel will take us about three hours, and the bus
 
will make one stop along the way to stretch our legs and fill up on coffee.
 
will make one stop along the way to stretch our legs and fill up on coffee.
  
The winter school will end with lunch on Wednesday, February 8, before the group returns
+
The winter school will end with lunch on Wednesday, February 5, before the group returns
 
to OSL airport on the HPLT shuttle.
 
to OSL airport on the HPLT shuttle.
 
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL
 
The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL
Line 163: Line 170:
 
= Organization =
 
= Organization =
  
The 2023 Winter School is organized by a team of volunteers from the NLPL and HPLT networks,
+
The 2025 Winter School is organized by a team of volunteers at the University
 +
of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond,
 
please see below.
 
please see below.
 
For all inquiries regarding registration, the programme, logistics,
 
For all inquiries regarding registration, the programme, logistics,
 
or such, please contact <code>hplt-training@ifi.uio.no</code>.
 
or such, please contact <code>hplt-training@ifi.uio.no</code>.
  
The programme committee is comprised of (regrettably lacking in diversity)
+
The programme committee is comprised of:
  
* Hans Eide (Uninett Sigma2, Norway)
 
* Filip Ginter (University of Turku, Finland)
 
 
* Barry Haddow (University of Edinburgh, UK)
 
* Barry Haddow (University of Edinburgh, UK)
* Jan Hajič (Charles University in Prague, Czech Republic)
 
* Daniel Hershcovich (University of Copenhagen, Denmark)
 
* Marco Kuhlmann (Linköping University, Sweden)
 
 
* Andrey Kutuzov (University of Oslo, Norway)
 
* Andrey Kutuzov (University of Oslo, Norway)
* Joakim Nivre (RISE and Uppsala University, Sweden)
 
 
* Stephan Oepen (University of Oslo, Norway)
 
* Stephan Oepen (University of Oslo, Norway)
 
* Sampo Pyysalo (University of Turku, Finland)
 
* Sampo Pyysalo (University of Turku, Finland)
* Gema Ramirez (Prompsit Language Engineering, Spain)
 
* Magnus Sahlgreen (AI Sweden)
 
* David Samuel (University of Oslo, Norway)
 
 
* Jörg Tiedemann (University of Helsinki, Finland)
 
* Jörg Tiedemann (University of Helsinki, Finland)
  
 
= Participants =
 
= Participants =
  
# Mehdi Ali (Fraunhofer IAIS)
+
# Nikolay Arefev, University of Oslo (Norway)
# Chantal Amrhein (University of Zurich)
+
# Maria Barrett, Silo AI (Finland)
# Mark Anderson (Norsk regnesentral)
+
# Toms Bergmanis, Tilde (Latvia)
# Nikolay Arefev (University of Oslo)
+
# Alexandra Birch, University of Edinburgh (UK)
# Mikko Aulamo (University of Helsinki)
+
# Laurie Burchell, University of Edinburgh (UK)
# Elisa Bassignana (IT University of Copenhagen)
+
# Lucas Charpentie, University of Oslo (Norway)
# Emily M. Bender (University of Washington)
+
# Pinzhen (Patrick) Chen, University of Edinburgh (UK)
# Vladimír Benko (Slovak Academy of Sciences)
+
# Hannah Clausen, University of Oslo (Norway)
# Nikolay Bogoychev (Edinburgh University)
+
# Lucia Domenichelli, University of Pisa (Italy)
# Dhairya Dalal (University of Galway)
+
# Aleksei Dorkin, University of Tartu (Estonia)
# Annerose Eichel (University of Stuttgart)
+
# Kenneth Enevoldsen, Aarhus University (Denmark)
# Kenneth Enevoldsen (Aarhus University)
+
# Tita Enstad, National Library (Norway)
# Mehrdad Farahani (Chalmers University of Technology)
+
# Mariia Fedorova, University of Oslo (Norway)
# Ona de Gibert (University of Helsinki)
+
# Yanzhu Guo, INRIA Paris (France)
# Janis Goldzycher (University of Zurich)
+
# Arzu Burcu Güven, IT University of Copenhagen (Denmark)
# Jan Hajič (Charles University in Prague)
+
# Barry Haddow, University of Edinburgh (UK)
# Jindřich Helcl (Charles University in Prague)
+
# Jan Hajič, Charles University (Czech Republic)
# Oskar Holmström (Linköping University)
+
# Jindřich Helcl, Charles University (Czech Republic)
# Sami Itkonen (University of Helsinki)
+
# Bertram Højer, IT University Copenhagen (Denmark)
# Antonia Karamolegkou (University of Copenhagen)
+
# Sekh Mainul Islam, University of Copenhagen (Denmark)
# Nina Khairova (Umeå universitet)
+
# Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
# Marco Kuhlmann (Linköping University)
+
# Márton Kardos, Aarhus University (Denmark)
# Per Egil Kummervold (National Library of Norway)
+
# Anastasiia Klimashevskaia, University of Bergen (Norway)
# Andrey Kutuzov (University of Oslo)
+
# Mateusz Klimaszewski, The University of Edinburgh (UK)
# Jelmer van der Linde (Edinburgh University)
+
# Ville Komulainen, University of Turku (Finland)
# Pierre Lison (Norsk regnesentral)
+
# Markus Koskela, CSC – IT Center for Science (Finland)
# Nikola Ljubešić (Jožef Stefan Institute & University of Ljubljana)
+
# Martins Kronis, Tilde (Latvia)
# Yan Meng (University of Amsterdam)
+
# Vimal Kumar Kumar, University of Limerick (Ireland)
# Max Müller-Eberstein (IT University of Copenhagen)
+
# Andrey Kutuzov, University of Oslo (Norway)
# Sebastian Nagel (Common Crawl)
+
# Hengyu Luo, University of Helsinki (Finland)
# Graeme Nail (Edinburgh University)
+
# Farrokh Mehryary, University of Turku (Finland)
# Anna Nikiforovskaja (Université de Lorraine)
+
# Vladislav Mikhailov, University of Oslo (Norway)
# Irina Nikishina (Universität Hamburg)
+
# Andreas Motzfeldt, IT University of Copenhagen (Denmark)
# Joakim Nivre (RISE and Uppsala University)
+
# Zain Muhammad Mujahid, University of Copenhagen (Denmark)
# Stephan Oepen (University of Oslo)
+
# Sebastian Nagel, Common Crawl Foundation (Germany)
# Anders Jess Pedersen (Alexandra Institute)
+
# Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
# Laura Cabello Piqueras (University of Copenhagen)
+
# Stephan Oepen, University of Oslo (Norway)
# Myrthe Reuver (Vrije Universiteit Amsterdam)
+
# Guilherme Penedo, HuugingFace (France)
# Anna Rogers (University of Copenhagen)
+
# Irina Proskurina, University of Lyon (France)
# Frankie Robertson (University of Jyväskylä)
+
# Taido Purason, University of Tartu (Estonia)
# Javier De La Rosa (National Library of Norway)
+
# Marie Roald, National Library (Norway)
# Phillip Rust (University of Copenhagen)
+
# Anna Rogers, IT University Copenhagen (Denmark)
# Egil Rønnestad (University of Oslo)
+
# Ismaël Rousseau, Orange (France)
# David Samuel (University of Oslo)
+
# David Samuel, University of Oslo (Norway)
# Diana Santos (University of Oslo)
+
# Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
# Teven Le Scao (Hugging Face)
+
# Marta Sartor, University of Pisa (Italy)
# Yves Scherrer (University of Helsinki)
+
# Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
# Edoardo Signoroni (Masaryk University)
+
# Étienne Simon,  University of Oslo (Norway)
# Michal Štefánik (Masaryk University)
+
# Pavel Stepachev, The University of Edinburgh (UK)
# Pedro Ortiz Suarez (University of Mannheim and DFKI)
+
# Pedro Ortiz Suarez, Common Crawl Foundation (France)
# Zeerak Talat (Simon Fraser University)
+
# Otto Tarkka, University of Turku (Finland)
# Jörg Tiedemann (University of Helsinki)
+
# Kushal Tatariya, KU Leuven (Belgium)
# Samia Touileb (University of Bergen)
+
# Jörg Tiedemann, University of Helsinki (Finland)
# Teemu Vahtola (University of Helsinki)
+
# Samia Touileb, University of Bergen (Norway)
# Thomas Vakili (Stockholm University)
+
# Elke Vandermeerschen, KU Leuven (Belgium)
# Dušan Variš (Charles University in Prague)
+
# Raul Vazquez, University of Helsinki (Finland)
# Tea Vojtěchová (Charles University in Prague)
+
# Ramón Carreño Villar, University of Oslo (Norway)
# Ivan Vulić (University of Cambridge)
+
# Fedor Vitiugin, Aalto University (Finland)
# Nicholas Walker (Norsk regnesentral)
+
# Tea Vojtěchová, Charles University (Czech Republic)
# Sondre Wold (University of Oslo)
+
# Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
# Jaume Zaragoza-Bernabeu (Prompsit)
+
# Elaine Zosa, Silo AI (Finland)

Latest revision as of 09:04, 5 February 2025

HPLT & NLPL 2025 Winter School on Pretraining Data Quality and Multilingual LLM Evaluation

HPLT and NLPL Winter School 2024.jpg

Background

Since 2023, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) have joined forces to organize the successful winter school series on Web-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. This 2025 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, 2020, 2023, and 2024 NLPL Winter Schools.

For early 2025, HPLT will hold its winter school from Monday, February 3, to Wednesday, February 5, 2025, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:45 on Monday morning and returning there around 17:30 on Wednesday afternoon.

The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accommodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3855 (NOK 3455 per person in a shared double room), to be paid to the hotel directly upon arrival.

Programme

The 2025 winter school will have a thematic focus on Pretraining Data Quality and Multilingual LLM Evaluation. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) by seasoned experts, with special emphasis on open science and European languages, but also include critical reflections on current development trends in LLM-focussed NLP. The programme will be complemented with a ‘walk-through’ of example experience reports on the shared EuroHPC LUMI supercomputer.

Confirmed presenters and talks include:

  • Alexandra Birch, University of Edinburgh
    EuroLLM and FinLLM – stories from the trenches
  • Jenia Jitsev and Marianna Nezhurina, Jülich Supercomputing Centre / LAION
    Open Foundation Models: Scaling Laws and Generalization
  • Guilherme Penedo, Huggingface
    FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training
  • Gema Ramírez-Sánchez, Prompsit Language Engineering
    A look at Pre-Training Data through the Stats Glass
  • Anna Rogers, IT University of Copenhagen
    Large Language Models and Factuality
  • Pedro Ortiz Suarez and Sebastian Nagel, Common Crawl
    Data Quality, Language Coverage and Ethical Considerations in Web Crawling
  • Ahmet Üstün, Cohere AI
    Recipe for multilingual post-training: How to collect high-quality data and use them?

Schedule

Monday, February 3, 2025
13:00 14:00 Lunch
14:00 15:30 Session 1 Pedro Ortiz Suarez & Sebastian Nagel

Data Quality, Language Coverage and Ethical Considerations in Web Crawling
Common Crawl is a free, open repository of web crawl data that can be used by anyone, crawled since 2008. Throughout the years the foundation has focused on achieving a balance in a diversity and representative sample of web sites while operating an efficient and polite crawler. In recent years, with the advent of LLMs and multimodal models, the interest in obtaining large amounts of high quality data has skyrocketed, while also raising concerns about the ethical considerations of large scale data curation. After a quick introduction into the history of the Common Crawl Foundation, we present our recent efforts to respond to this new data requirements while also expanding the language and cultural coverage of our dataset, and addressing the practical and ethical questions that have arisen around web crawling in the era of LLMs.

Slides

15:30 15:50 Coffee Break
16:00 17:30 Session 2 Anna Rogers

LLMs and Factuality: facts from LLMs
This lecture focuses on the workflows for using LLMs as information sources, the types of problems that may result from that, and the main current mitigation strategies (RAG and CoT). Finally, I will discuss the problem of detecting generated texts, and the impact of LLMs on the information ecosphere and content economy.

Slides

17:30 17:50 Coffee Break
17:50 19:20 Session 3 Alexandra Birch

EuroLLM and FinLLM – stories from the trenches
In this talk, we share our experiences building two large language models: EuroLLM, a multilingual model designed to serve the diverse linguistic and cultural landscape of Europe, and FinLLM, a financial LLM tailored for the UK’s highly specialized finance industry with our partners Aveni.ai, Lloyds, and Nationwide. We will discuss the challenges of curating high-quality training data: data mixes, cleaning pipelines training recipes and also at creating meaningful benchmarks.

Slides

19:30 Dinner
Tuesday, February 4, 2025
Breakfast is available from 07:30
09:00 10:30 Session 4 Guilherme Penedo

FineWeb2: Creating a Large Multilingual Dataset for LLM Pre-Training
FineWeb2 is a recent multilingual web based dataset for large language model (LLM) pretraining, that produces better-performing LLMs than other popular datasets. In this talk, we discuss in depth the many challenges involved in adapting processing pipelines commonly used for English data to over 1000 languages, including evaluation task selection for ablation experiments, language identification, filtering, and deduplication.

Slides

Free time (Lunch is available between 13:00 and 14:30)
15:30 17:00 Session 5 Gema Ramírez-Sánchez

Having a look at pretraining data through the stats glass
At the moment of speaking, zillions of tokens of pretraining data are being collected and curated to train LLMs by several initiatives, all aiming at gathering the best set to get the best model performance. These curated datasets are huge and in many cases multilingual, making the smallest evaluation task an enormous task. But we can always ask stats for help, and data will confess. In this session we will have a look at several pretraining (textual) datasets through the stats glass, and see together what are the ups and downs revealed by it.

Slides

17:00 17:20 Coffee Break
17:20 19:20 Session 6 Jenia Jitsev & Marianna Nezhurina

Open Foundation Models: Scaling Laws and Generalization

Slides 1
Slides 2

19:30 Dinner
21:00 Evening Session: Findings from HPLT


Wednesday, February 5, 2025
Breakfast is available from 07:30
08:30 10:00 Session 8 Ahmet Üstün (online)

Recipe for multilingual post-training: How to collect high-quality data and use them?
Post-training is a crucial step for building state-of-the-art LLMs and aligning them according to human preferences. Although many public post-training datasets are available, they are predominantly curated for English, and multilingual datasets are extremely scarce. This lecture will cover methods for collecting high-quality post-training datasets such as human annotation, multilingual templates, and synthetic data generation. We will also complement methods for high-quality data collection with post-training recipes from Aya-101, Aya-23, and recently released Aya Expanse models, to leverage the curated data best.

Slides

10:00 10:30 Coffee Break
10:30 12:00 Session 9 Anna Rogers

LLMs and Factuality: facts about LLMs
This lecture critically examines a set of common claims about the modern LLMs, including the claims of their high performance, robustness, general-purpose technology status, and "emergent properties". I will also re-examine the "bitter lesson" as applied to LLMs, and its implications for the future of the field.

Slides

12:30 13:30 Lunch
13:45 16:45 Bus transfer to OSL Airport

Registration

In total, this year we welcome 62 participants at the 2025 winter school. The winter school is over-subscribed and no longer accepting registrations. We have processed requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who had submitted the registration form have been confirmed in three batches, on December 6, on December 13, and on December 20, which was also the closing date for winter school registration.

Once confirmed by the organizing team, participant names are published on this page, and registration establishes a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.

Logistics

With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:45 CET on Monday, February 3. Thus, please meet up by 9:30 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).

The group will gather near the DNB currency exchange booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the left as one exits the customs area: the yellow dot numbered (18) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport not long after 9:40. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.

The winter school will end with lunch on Wednesday, February 5, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.

Organization

The 2025 Winter School is organized by a team of volunteers at the University of Oslo, supported by a programme committee from the HPLT and NLPL network and beyond, please see below. For all inquiries regarding registration, the programme, logistics, or such, please contact hplt-training@ifi.uio.no.

The programme committee is comprised of:

  • Barry Haddow (University of Edinburgh, UK)
  • Andrey Kutuzov (University of Oslo, Norway)
  • Stephan Oepen (University of Oslo, Norway)
  • Sampo Pyysalo (University of Turku, Finland)
  • Jörg Tiedemann (University of Helsinki, Finland)

Participants

  1. Nikolay Arefev, University of Oslo (Norway)
  2. Maria Barrett, Silo AI (Finland)
  3. Toms Bergmanis, Tilde (Latvia)
  4. Alexandra Birch, University of Edinburgh (UK)
  5. Laurie Burchell, University of Edinburgh (UK)
  6. Lucas Charpentie, University of Oslo (Norway)
  7. Pinzhen (Patrick) Chen, University of Edinburgh (UK)
  8. Hannah Clausen, University of Oslo (Norway)
  9. Lucia Domenichelli, University of Pisa (Italy)
  10. Aleksei Dorkin, University of Tartu (Estonia)
  11. Kenneth Enevoldsen, Aarhus University (Denmark)
  12. Tita Enstad, National Library (Norway)
  13. Mariia Fedorova, University of Oslo (Norway)
  14. Yanzhu Guo, INRIA Paris (France)
  15. Arzu Burcu Güven, IT University of Copenhagen (Denmark)
  16. Barry Haddow, University of Edinburgh (UK)
  17. Jan Hajič, Charles University (Czech Republic)
  18. Jindřich Helcl, Charles University (Czech Republic)
  19. Bertram Højer, IT University Copenhagen (Denmark)
  20. Sekh Mainul Islam, University of Copenhagen (Denmark)
  21. Jenia Jitsev, Jülich Supercomputing Centre / LAION (Germany)
  22. Márton Kardos, Aarhus University (Denmark)
  23. Anastasiia Klimashevskaia, University of Bergen (Norway)
  24. Mateusz Klimaszewski, The University of Edinburgh (UK)
  25. Ville Komulainen, University of Turku (Finland)
  26. Markus Koskela, CSC – IT Center for Science (Finland)
  27. Martins Kronis, Tilde (Latvia)
  28. Vimal Kumar Kumar, University of Limerick (Ireland)
  29. Andrey Kutuzov, University of Oslo (Norway)
  30. Hengyu Luo, University of Helsinki (Finland)
  31. Farrokh Mehryary, University of Turku (Finland)
  32. Vladislav Mikhailov, University of Oslo (Norway)
  33. Andreas Motzfeldt, IT University of Copenhagen (Denmark)
  34. Zain Muhammad Mujahid, University of Copenhagen (Denmark)
  35. Sebastian Nagel, Common Crawl Foundation (Germany)
  36. Marianna Nezhurina, Jülich Supercomputing Centre / LAION (Germany)
  37. Stephan Oepen, University of Oslo (Norway)
  38. Guilherme Penedo, HuugingFace (France)
  39. Irina Proskurina, University of Lyon (France)
  40. Taido Purason, University of Tartu (Estonia)
  41. Marie Roald, National Library (Norway)
  42. Anna Rogers, IT University Copenhagen (Denmark)
  43. Ismaël Rousseau, Orange (France)
  44. David Samuel, University of Oslo (Norway)
  45. Gema Ramírez Sánchez, Prompsit Language Engineering (Spain)
  46. Marta Sartor, University of Pisa (Italy)
  47. Ipek Baris Schlicht, Universitat Politècnica de València (Spain)
  48. Étienne Simon, University of Oslo (Norway)
  49. Pavel Stepachev, The University of Edinburgh (UK)
  50. Pedro Ortiz Suarez, Common Crawl Foundation (France)
  51. Otto Tarkka, University of Turku (Finland)
  52. Kushal Tatariya, KU Leuven (Belgium)
  53. Jörg Tiedemann, University of Helsinki (Finland)
  54. Samia Touileb, University of Bergen (Norway)
  55. Elke Vandermeerschen, KU Leuven (Belgium)
  56. Raul Vazquez, University of Helsinki (Finland)
  57. Ramón Carreño Villar, University of Oslo (Norway)
  58. Fedor Vitiugin, Aalto University (Finland)
  59. Tea Vojtěchová, Charles University (Czech Republic)
  60. Artūrs Znotiņš, IMCS at University of Latvia (Latvia)
  61. Elaine Zosa, Silo AI (Finland)