Community/training
HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data
Background
After a two-year pandemic hiatus, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) join forces to re-launch the successful winter school series on large-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among Nordic and European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. The 2023 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, and 2020 NLPL Winter Schools.
For early 2023, HPLT will hold its winter school from Monday, February 6, to Wednesday, February 8, 2023, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:30 on Monday morning and returning there around 17:30 on Wednesday afternoon.
The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accomodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3190 (NOK 2790 per person in a shared double room), to be paid to the hotel directly.
Programme
The 2023 winter school will have a thematic focus on Web Data for Large-Scale Language Modeling and Neural Machine Translation. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) from, among others, the BigScience and Common Crawl initiatives, but also include critical reflections on working with massive, uncurated language data. The programme will be complemented with a panel discussion and a ‘walk-through’ of available infrastructure on the shared EuroHPC LUMI supercomputer.
Confirmed presenters include:
- Mehdi Ali, Fraunhofer IAIS
OpenGPT-X: Development of a Gaia-X Node for Large AI Language Models and Innovative Language Application Service
The development of large language models is currently dominated by non-European organizations. The OpenGPT-X project aims to ensure the European data and AI sovereignty to develop this technology. The developed language models will be made open-source, facilitating the usage and research of these models. In this talk, we provide an overview of the project, describe the current status and give an outlook. - Emily M. Bender, University of Washington
Towards Responsible Development and Application of Large Language Models
This session will begin with a problematization of the rush for scale in language models and the "foundation model" conceptualization and exploration of the risks of ever larger language models. I will then turn to some discussion of what can be done better, drawing on value sensitive design and with a focus on evaluation grounded in specific use cases and on thorough documentation. Finally, I will reflect on the dangers and responsibilities that come with working in an area with intense media and corporate interest. - Teven Le Scao, Hugging Face
Large Language Models: A How-To Starting Guide
The new capabilities of large language models (LLMs) have prompted a paradigm change in NLP. However, most are developed by resource-rich organizations and kept from the public. In the framework of the BigScience workshop, a collaboration of hundreds of researchers dedicated to democratizing this powerful technology, we created BLOOM, a 176B-parameter open-access multilingual language model. This talk will be a tutorial to share the learnings of this project and make it easier for others to build their own large language models. - Nikola Ljubešić, Jožef Stefan Institute & University of Ljubljana
MaCoCu Corpora: Why Top-Level-Domain Crawling and Web Data Enrichment Matter
Exploitation of huge crawl dumps seems not to be the most economical approach to obtaining data for smaller languages. While one might argue that the "needle in the haystack" problem of smaller languages in crawl dumps can be circumvented with "gathering all the different needles at the same time", in practice this approach often fails due to various reasons, one of which is the fact that language identification tools that cover many languages do not perform well enough on smaller languages. In our talk we will present the MaCoCu way of collecting web data which, beyond focusing on crawling top-level domains in the quest for high-quality up-to-date data, also encompasses various forms of data enrichment, crucial ingredients for understanding what kind of data we include in our language and translation models. - Sebastian Nagel, Common Crawl
Common Crawl: Data Collection and Use Cases for NLP
The Common Crawl data sets are sample collections of web pages made accessible free of charge to everyone interested in running machine-scale analysis on web data. The presentation starts with a short outline of data collection, the crawlers and technologies used from 2008 until today with an emphasis on the challenges to achieve a balanced, both diverse and representative sample of web sites while operating an efficient and polite crawler. After an overview of the data formats used to store the primary web page captures, but also text and metadata extracts, indexes, hyperlink graphs, we showcase how Common Crawl data can be processed. We put the focus on three use cases for NLP: bulk processing of plain text and HTML pages, exploration and statistics based on the URL and metadata index, and the "vertical" use of data from specific sites or by content language. - Anna Rogers, University of Copenhagen
Big Corpus Linguistics: Lessons from the BigScience Workshop
The continued growth of large language models and their wide-scale adoption in commercial applications make it increasingly important to investigate their training data, both for research and ethical reasons. However, inspecting such large corpora has been problematic due to difficulties with data access, and the need for large-scale infrastructure. This talk will discuss some lessons learned during the BigScience workshop, as well as an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus. - Pedro Ortiz Suarez, University of Mannheim and DFKI
The OSCAR Project: Improving Data Quality in Multilingual Heterogeneous Web-Based Corpora
In this talk we will introduce the OSCAR project and present our recent efforts in overcoming the difficulties posed by the heterogeneity, noisiness and size of web resources; in order to produce higher quality textual data for as many languages as possible. We will also discuss recent developments in the project, including our data-processing pipelines to annotate and classify large amounts of textual data in constrained infrastructures, as well as our first attempts to become a fully open-source project and manage our growing community. Finally, we will present how the OSCAR initiative is currently collaborating with other projects in order to improve data quality and availability. - Zeerak Talat, Simon Fraser University,
NLP and Futuring the Past
Machine learning and NLP are technological projects that implicitly seek to create possible futures and it is therefore important to consider the values that NLP projects into the future as a field. In this session, we will be discussing the values of language technology, how they arise, and their complicated relationship with the potential for equitable futures. - Ivan Vulić, Cambridge University
Modular and Parameter-Efficient Adaptation of Multilingual NLP Models
A key challenge in multilingual NLP is developing general language-independent architectures that will be equally applicable to any language. However, this ambition is hindered by the large variation in 1) structural and semantic properties of the world’s languages, as well as 2) raw and task data scarcity for many different languages, tasks, and application domains. As a consequence, existing language technology is still largely limited to a handful of resource-rich languages, leaving the vast majority of the world’s 7,000+ languages and their speakers behind, thus amplifying the problem of the “digital language divide”. In this lecture, we will demonstrate that modularity enables widening the reach of multilingual NLP to minor and low-resource languages and communities, also boosting efficiency and reusability of models' constituent components: modules. We will introduce a range of recent modular and parameter-efficient techniques, additionally pointing to their high-level similarities and differences, that aim to deal with large cross-language variations and low-data learning regimes. We will also demonstrate that low-resource languages, despite very positive research trends and results achieved in recent years, still lag behind major languages in terms of performance, resources, overall representation in NLP research and other key aspects, and will outline several crucial challenges for future research in this area.
Monday, February 6, 2023 | ||
---|---|---|
13:00 | 14:00 | Lunch |
14:00 | 15:30 | Session 1 Sebastian Nagel |
15:30 | 15:50 | Coffee Break |
15:50 | 17:20 | Session 2 Nikola Ljubešić |
17:20 | 17:40 | Coffee Break |
17:40 | 19:10 | Session 3 Panel Discussion "Is the end of academic NLP research in sight?" with Joakim Nivre, Marco Kuhlmann and all the participants |
19:30 | Dinner |
Tuesday, February 7, 2023 | ||
---|---|---|
Breakfast is available from 07:30 | ||
08:30 | 10:00 | Session 4 Anna Rogers |
Lunch is available between 13:00 and 14:30 | ||
15:00 | 16:20 | Session 5 Teven Le Scao |
16:20 | 16:40 | Coffee Break |
16:40 | 18:00 | Session 6 Mehdi Ali, Pedro Suarez |
18:00 | 18:10 | Coffee Break |
18:10 | 19:30 | Session 7 Emily Bender |
19:30 | Dinner | |
21:00 | Evening Session HPLT, LUMI, LLM & NMT |
Wednesday, February 8, 2023 | ||
---|---|---|
Breakfast is available from 07:30 | ||
08:30 | 10:00 | Session 8 Ivan Vulić |
10:00 | 10:30 | Coffee Break |
10:30 | 12:00 | Session 9 Zeerak Talat |
12:30 | 13:30 | Lunch |
Registration
Registration is now closed. The 2023 winter school was heavily over-subscribed.
In total, we anticipate up to 60 participants in the 2023 Winter School. Please register your intent of participation through our on-line registration form. We will process requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who have submitted the registration form will be confirmed in three batches, one on December 5, another one on December 12, and finally after the closing date for registration, which is Thursday, December 15, 2022.
Once confirmed by the organizing team, participant names will be published on this page, and registration will establish a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.
Logistics
With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:30 CET on Monday, February 6. Thus, please meet up at 9:15 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).
The group will gather near the bus and taxi information booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the right, as one exits the customs area: The yellow dot numbered (17) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport by 9:30. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.
The winter school will end with lunch on Wednesday, February 8, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET. After stopping at the OSL airport, the bus will continue to central Oslo.
Organization
The 2023 Winter School is organized by a team of volunteers from the NLPL and HPLT networks,
please see below.
For all inquiries regarding registration, the programme, logistics,
or such, please contact hplt-training@ifi.uio.no
.
The programme committee is comprised of (regrettably lacking in diversity)
- Hans Eide (Uninett Sigma2, Norway)
- Filip Ginter (University of Turku, Finland)
- Barry Haddow (University of Edinburgh, UK)
- Jan Hajič (Charles University in Prague, Czech Republic)
- Daniel Hershcovich (University of Copenhagen, Denmark)
- Marco Kuhlmann (Linköping University, Sweden)
- Andrey Kutuzov (University of Oslo, Norway)
- Joakim Nivre (RISE and Uppsala University, Sweden)
- Stephan Oepen (University of Oslo, Norway)
- Sampo Pyysalo (University of Turku, Finland)
- Gema Ramirez (Prompsit Language Engineering, Spain)
- Magnus Sahlgreen (AI Sweden)
- David Samuel (University of Oslo, Norway)
- Jörg Tiedemann (University of Helsinki, Finland)
Participants
- Mehdi Ali (Fraunhofer IAIS)
- Chantal Amrhein (University of Zurich)
- Mark Anderson (Norsk regnesentral)
- Nikolay Arefev (University of Oslo)
- Mikko Aulamo (University of Helsinki)
- Elisa Bassignana (IT University of Copenhagen)
- Emily M. Bender (University of Washington)
- Vladimír Benko (Slovak Academy of Sciences)
- Nikolay Bogoychev (Edinburgh University)
- Dhairya Dalal (University of Galway)
- Annerose Eichel (University of Stuttgart)
- Kenneth Enevoldsen (Aarhus University)
- Mehrdad Farahani (Chalmers University of Technology)
- Ona de Gibert (University of Helsinki)
- Janis Goldzycher (University of Zurich)
- Jan Hajič (Charles University in Prague)
- Jindřich Helcl (Charles University in Prague)
- Oskar Holmström (Linköping University)
- Sami Itkonen (University of Helsinki)
- Antonia Karamolegkou (University of Copenhagen)
- Nina Khairova (Umeå universitet)
- Marco Kuhlmann (Linköping University)
- Per Egil Kummervold (National Library of Norway)
- Andrey Kutuzov (University of Oslo)
- Jelmer van der Linde (Edinburgh University)
- Pierre Lison (Norsk regnesentral)
- Nikola Ljubešić (Jožef Stefan Institute & University of Ljubljana)
- Yan Meng (University of Amsterdam)
- Max Müller-Eberstein (IT University of Copenhagen)
- Sebastian Nagel (Common Crawl)
- Graeme Nail (Edinburgh University)
- Anna Nikiforovskaja (Université de Lorraine)
- Irina Nikishina (Universität Hamburg)
- Joakim Nivre (RISE and Uppsala University)
- Stephan Oepen (University of Oslo)
- Anders Jess Pedersen (Alexandra Institute)
- Laura Cabello Piqueras (University of Copenhagen)
- Myrthe Reuver (Vrije Universiteit Amsterdam)
- Anna Rogers (University of Copenhagen)
- Frankie Robertson (University of Jyväskylä)
- Javier De La Rosa (National Library of Norway)
- Phillip Rust (University of Copenhagen)
- Egil Rønnestad (University of Oslo)
- David Samuel (University of Oslo)
- Diana Santos (University of Oslo)
- Teven Le Scao (Hugging Face)
- Yves Scherrer (University of Helsinki)
- Edoardo Signoroni (Masaryk University)
- Michal Štefánik (Masaryk University)
- Pedro Ortiz Suarez (University of Mannheim and DFKI)
- Zeerak Talat (Simon Fraser University)
- Jörg Tiedemann (University of Helsinki)
- Samia Touileb (University of Bergen)
- Teemu Vahtola (University of Helsinki)
- Thomas Vakili (Stockholm University)
- Dušan Variš (Charles University in Prague)
- Tea Vojtěchová (Charles University in Prague)
- Ivan Vulić (University of Cambridge)
- Nicholas Walker (Norsk regnesentral)
- Sondre Wold (University of Oslo)
- Jaume Zaragoza-Bernabeu (Prompsit)