Revision as of 16:34, 14 January 2023

HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data

Background

After a two-year pandemic hiatus, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) join forces to re-launch the successful winter school series on large-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among Nordic and European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. The 2023 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, and 2020 NLPL Winter Schools.

For early 2023, HPLT will hold its winter school from Monday, February 6, to Wednesday, February 8, 2023, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:30 on Monday morning and returning there around 17:30 on Wednesday afternoon.

The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accomodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3190 (NOK 2790 per person in a shared double room), to be paid to the hotel directly.

Programme

The 2023 winter school will have a thematic focus on Large-Scale Language Modeling and Neural Machine Translation with Web Data. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) from, among others, the BigScience and Common Crawl initiatives, but also include critical reflections on working with massive, uncurated language data. The programme may be complemented with an evening ‘research bazar’ (by participants) to stimulate academic socializing and a ‘walk-through’ of available infrastructure on the shared EuroHPC LUMI supercomputer.

Confirmed presenters include:

Mehdi Ali, Fraunhofer IAIS
Emily M. Bender, University of Washington
Towards Responsible Development and Application of Large Language Models
This session will begin with a problematization of the rush for scale in language models and the "foundational model" conceptualization and exploration of the risks of ever larger language models. I will then turn to some discussion of what can be done better, drawing on value sensitive design and with a focus on evaluation grounded in specific use cases and on thorough documentation. Finally, I will reflect on the dangerous and responsibilities that come with working in an area with intense media and corporate interest.
Philipp Koehn, Johns Hopkins University
Teven Le Scao, Hugging Face
Nikola Ljubešić, Jožef Stefan Institute & University of Ljubljana
MaCoCu Corpora: Why Top-Level-Domain Crawling and Web Data Enrichment Matter
Exploitation of huge crawl dumps seems not to be the most economical approach to obtaining data for smaller languages. While one might argue that the "needle in the haystack" problem of smaller languages in crawl dumps can be circumvented with "gathering all the different needles at the same time", in practice this approach often fails due to various reasons, one of which is the fact that language identification tools that cover many languages do not perform well enough on smaller languages. In our talk we will present the MaCoCu way of collecting web data which, beyond focusing on crawling top-level domains in the quest for high-quality up-to-date data, also encompasses various forms of data enrichment, crucial ingredients for understanding what kind of data we include in our language and translation models.
Sebastian Nagel, Common Crawl
Common Crawl: Data Collection and use Cases for NLP
The Common Crawl data sets are sample collections of web pages made accessible free of charge to everyone interested in running machine-scale analysis on web data. The presentation starts with a short outline of data collection, the crawlers and technologies used from 2008 until today with an emphasis on the challenges to achieve a balanced, both diverse and representative sample of web sites while operating an efficient and polite crawler. After an overview of the data formats used to store the primary web page captures, but also text and metadata extracts, indexes, hyperlink graphs, we showcase how Common Crawl data can be processed. We put the focus on three use cases for NLP: bulk processing of plain text and HTML pages, exploration and statistics based on the URL and metadata index, and the "vertical" use of data from specific sites or by content language.
Anna Rogers, University of Copenhagen
Big Corpus Linguistics: Lessons from the BigScience Workshop
The continued growth of large language models and their wide-scale adoption in commercial applications make it increasingly important to investigate their training data, both for research and ethical reasons. However, inspecting such large corpora has been problematic due to difficulties with data access, and the need for large-scale infrastructure. This talk will discuss some lessons learned during the BigScience workshop, as well as an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.
Pedro Ortiz Suarez, University of Mannheim and DFKI
Zeerak Talat, Simon Fraser University
Ivan Vulić, Cambridge University

Monday, February 6, 2023
13:00	14:00	Lunch
14:00	15:30	Session 1
15:30	15:50	Coffee Break
15:50	17:20	Session 2
17:20	17:40	Coffee Break
17:40	19:10	Session 3
19:30		Dinner
21:00		Evening Session 1

Tuesday, February 7, 2022
Breakfast is available from 07:30
08:30	10:00	Session 4
Lunch is available between 13:00 and 14:30
15:00	16:20	Session 5
16:20	16:40	Coffee Break
16:40	18:00	Session 6
18:00	18:10	Coffee Break
18:10	19:30	Session 7
19:30		Dinner
21:00		Evening Session 2

Wednesday, February 8, 2020
Breakfast is available from 07:30
08:30	10:00	Session 8
10:00	10:30	Coffee Break
10:30	12:00	Session 9
12:30	13:30	Lunch

Registration

Registration is now closed. The 2023 winter school was heavily over-subscribed.

In total, we anticipate up to 60 participants in the 2023 Winter School. Please register your intent of participation through our on-line registration form. We will process requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who have submitted the registration form will be confirmed in three batches, one on December 5, another one on December 12, and finally after the closing date for registration, which is Thursday, December 15, 2022.

Once confirmed by the organizing team, participant names will be published on this page, and registration will establish a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.

Logistics

With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:30 CET on Monday, February 6. Thus, please meet up at 9:15 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).

The group will gather near the bus and taxi information booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the right, as one exits the customs area: The yellow dot numbered (17) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport by 9:30. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.

The winter school will end with lunch on Wednesday, February 8, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET.

Organization

The 2023 Winter School is organized by a team of volunteers from the NLPL and HPLT networks, please see below. For all inquiries regarding registration, the programme, logistics, or such, please contact hplt-training@ifi.uio.no.

The programme committee is comprised of (regrettably lacking in diversity)

Hans Eide (Uninett Sigma2, Norway)
Filip Ginter (University of Turku, Finland)
Barry Haddow (University of Edinburgh, UK)
Jan Hajič (Charles University in Prague, Czech Republic)
Daniel Hershcovich (University of Copenhagen, Denmark)
Marco Kuhlmann (Linköping University, Sweden)
Andrey Kutuzov (University of Oslo, Norway)
Joakim Nivre (RISE and Uppsala University, Sweden)
Stephan Oepen (University of Oslo, Norway)
Sampo Pyysalo (University of Turku, Finland)
Gema Ramirez (Prompsit Language Engineering, Spain)
Magnus Sahlgreen (AI Sweden)
David Samuel (University of Oslo, Norway)
Jörg Tiedemann (University of Helsinki, Finland)

Participants

Mehdi Ali (Fraunhofer IAIS)
Chantal Amrhein (University of Zurich)
Nikolay Arefev (University of Oslo)
Mikko Aulamo (University of Helsinki)
Elisa Bassignana (IT University of Copenhagen)
Emily M. Bender (University of Washington)
Vladimír Benko (Slovak Academy of Sciences)
Nikolay Bogoychev (Edinburgh University)
Dhairya Dalal (University of Galway)
Annerose Eichel (University of Stuttgart)
Kenneth Enevoldsen (Aarhus University)
Mehrdad Farahani (Chalmers University of Technology)
Ona de Gibert (University of Helsinki)
Janis Goldzycher (University of Zurich)
Jan Hajič (Charles University in Prague)
Jindřich Helcl (Charles University in Prague)
Oskar Holmström (Linköping University)
Sami Itkonen (University of Helsinki)
Shaoxiong Ji (University of Helsinki)
Antonia Karamolegkou (University of Copenhagen)
Marco Kuhlmann (Linköping University)
Nina Khairova (Umeå universitet)
Philipp Koehn (Johns Hopkins University)
Andrey Kutuzov (University of Oslo)
Jelmer van der Linde (Edinburgh University)
Pierre Lison (Norsk regnesentral)
Nikola Ljubešić (Jožef Stefan Institute & University of Ljubljana)
Yan Meng (University of Amsterdam)
Max Müller-Eberstein (IT University of Copenhagen)
Sebastian Nagel (Common Crawl)
Graeme Nail (Edinburgh University)
Anna Nikiforovskaja (Université de Lorraine)
Irina Nikishina (Universität Hamburg)
Joakim Nivre (RISE and Uppsala University)
Stephan Oepen (University of Oslo)
Anders Jess Pedersen (Alexandra Institute)
Laura Cabello Piqueras (University of Copenhagen)
Myrthe Reuver (Vrije Universiteit Amsterdam)
Anna Rogers (University of Copenhagen)
Frankie Robertson (University of Jyväskylä)
Phillip Rust (University of Copenhagen)
Egil Rønnestad (University of Oslo)
David Samuel (University of Oslo)
Diana Santos (University of Oslo)
Teven Le Scao (Hugging Face)
Yves Scherrer (University of Helsinki)
Edoardo Signoroni (Masaryk University)
Michal Štefánik (Masaryk University)
Pedro Ortiz Suarez (University of Mannheim and DFKI)
Zeerak Talat (Simon Fraser University)
Jörg Tiedemann (University of Helsinki)
Samia Touileb (University of Bergen)
Teemu Vahtola (University of Helsinki)
Thomas Vakili (Stockholm University)
Dušan Variš (Charles University in Prague)
Tea Vojtěchová (Charles University in Prague)
Ivan Vulić (University of Cambridge)
Nicholas Walker (Norsk regnesentral)
Sondre Wold (University of Oslo)
Jaume Zaragoza-Bernabeu (Prompsit)

@@ Line 50: / Line 50: @@
 * Mehdi Ali, Fraunhofer IAIS
-* [https://faculty.washington.edu/ebender/ Emily M. Bender, University of Washington]
+* [https://faculty.washington.edu/ebender/ Emily M. Bender, University of Washington]</br><b>Towards Responsible Development and Application of Large Language Models</b></br>This session will begin with a problematization of the rush for scale in language models and the "foundational model" conceptualization and exploration of the risks of ever larger language models. I will then turn to some discussion of what can be done better, drawing on value sensitive design and with a focus on evaluation grounded in specific use cases and on thorough documentation. Finally, I will reflect on the dangerous and responsibilities that come with working in an area with intense media and corporate interest.
 * [https://www.cs.jhu.edu/~phi/ Philipp Koehn, Johns Hopkins University]
 * [https://huggingface.co/teven Teven Le Scao, Hugging Face]
-* [https://nljubesi.github.io Nikola Ljubešić, Jožef Stefan Institute & University of Ljubljana]
+* [https://nljubesi.github.io Nikola Ljubešić, Jožef Stefan Institute & University of Ljubljana]</br><b>MaCoCu Corpora: Why Top-Level-Domain Crawling and Web Data Enrichment Matter</b></br>Exploitation of huge crawl dumps seems not to be the most economical approach to obtaining data for smaller languages. While one might argue that the "needle in the haystack" problem of smaller languages in crawl dumps can be circumvented with "gathering all the different needles at the same time", in practice this approach often fails due to various reasons, one of which is the fact that language identification tools that cover many languages do not perform well enough on smaller languages. In our talk we will present the MaCoCu way of collecting web data which, beyond focusing on crawling top-level domains in the quest for high-quality up-to-date data, also encompasses various forms of data enrichment, crucial ingredients for understanding what kind of data we include in our language and translation models.
-* [https://commoncrawl.org/about/team/#headshot-14714 Sebastian Nagel, Common Crawl]
+* [https://commoncrawl.org/about/team/#headshot-14714 Sebastian Nagel, Common Crawl]</br><b>Common Crawl: Data Collection and use Cases for NLP</b></br>The Common Crawl data sets are sample collections of web pages made accessible free of charge to everyone interested in running machine-scale analysis on web data.  The presentation starts with a short outline of data collection, the crawlers and technologies used from 2008 until today with an emphasis on the challenges to achieve a balanced, both diverse and representative sample of web sites while operating an efficient and  polite crawler.  After an overview of the data formats used to store the primary web page captures, but also text and metadata extracts, indexes, hyperlink graphs, we showcase how Common Crawl data can be processed. We put the focus on three use cases for NLP: bulk processing of plain text and HTML pages, exploration and statistics based on the URL and metadata index, and the "vertical" use of data from specific sites or by content language.
-* [https://annargrs.github.io Anna Rogers, University of Copenhagen]
+* [https://annargrs.github.io Anna Rogers, University of Copenhagen]</br><b>Big Corpus Linguistics: Lessons from the BigScience Workshop</b></br>The continued growth of large language models and their wide-scale adoption in commercial applications make it increasingly important to investigate their training data, both for research and ethical reasons.  However, inspecting such large corpora has been problematic due to difficulties with data access, and the need for large-scale infrastructure. This talk will discuss some lessons learned during the BigScience workshop,  as well as an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.
 * [https://portizs.eu/#about Pedro Ortiz Suarez, University of Mannheim and DFKI]
 * Zeerak Talat, Simon Fraser University

Difference between revisions of "Community/training"

Revision as of 16:34, 14 January 2023

Contents

Background

Programme

Registration

Logistics

Organization

Participants

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools