Community/training

From Nordic Language Processing Laboratory
Revision as of 19:05, 19 January 2023 by Andreku (talk | contribs) (Participants)
Jump to: navigation, search

HPLT & NLPL Winter School on Large-Scale Language Modeling and Neural Machine Translation with Web Data

Skeikampen.2020.png

Background

After a two-year pandemic hiatus, the NLPL network and Horizon Europe project High-Performance Language Technologies (HPLT) join forces to re-launch the successful winter school series on large-scale NLP. The winter school seeks to stimulate community formation, i.e. strengthening interaction and collaboration among Nordic and European research teams in NLP and advancing a shared level of knowledge and experience in using high-performance e-infrastructures for large-scale NLP research. The 2023 edition of the winter school puts special emphasis on NLP researchers from countries who participate in the EuroHPC LUMI consortium. For additional background, please see the archival pages from the 2018, 2019, and 2020 NLPL Winter Schools.

For early 2023, HPLT will hold its winter school from Monday, February 6, to Wednesday, February 8, 2023, at a mountain-side hotel (with skiing and walking opportunities) about two hours north of Oslo. The project will organize group bus transfer from and to the Oslo airport Gardermoen, leaving the airport at 9:30 on Monday morning and returning there around 17:30 on Wednesday afternoon.

The winter school is subsidized by the HPLT project: there is no fee for participants and no charge for the bus transfer to and from the conference hotel. All participants will have to cover their own travel and accomodation at Skeikampen, however. Two nights at the hotel, including all meals, will come to NOK 3190 (NOK 2790 per person in a shared double room), to be paid to the hotel directly.

Programme

The 2023 winter school will have a thematic focus on Large-Scale Language Modeling and Neural Machine Translation with Web Data. The programme will be comprised of in-depth technical presentations (possibly including some hands-on elements) from, among others, the BigScience and Common Crawl initiatives, but also include critical reflections on working with massive, uncurated language data. The programme may be complemented with an evening ‘research bazar’ (by participants) to stimulate academic socializing and a ‘walk-through’ of available infrastructure on the shared EuroHPC LUMI supercomputer.

Confirmed presenters include:

  • Mehdi Ali, Fraunhofer IAIS
  • Emily M. Bender, University of Washington
    Towards Responsible Development and Application of Large Language Models
    This session will begin with a problematization of the rush for scale in language models and the "foundation model" conceptualization and exploration of the risks of ever larger language models. I will then turn to some discussion of what can be done better, drawing on value sensitive design and with a focus on evaluation grounded in specific use cases and on thorough documentation. Finally, I will reflect on the dangers and responsibilities that come with working in an area with intense media and corporate interest.
  • Philipp Koehn, Johns Hopkins University
  • Teven Le Scao, Hugging Face
    Large Language Models: A How-To Starting Guide
    The new capabilities of large language models (LLMs) have prompted a paradigm change in NLP. However, most are developed by resource-rich organizations and kept from the public. In the framework of the BigScience workshop, a collaboration of hundreds of researchers dedicated to democratizing this powerful technology, we created BLOOM, a 176B-parameter open-access multilingual language model. This talk will be a tutorial to share the learnings of this project and make it easier for others to build their own large language models.
  • Nikola Ljubešić, Jožef Stefan Institute & University of Ljubljana
    MaCoCu Corpora: Why Top-Level-Domain Crawling and Web Data Enrichment Matter
    Exploitation of huge crawl dumps seems not to be the most economical approach to obtaining data for smaller languages. While one might argue that the "needle in the haystack" problem of smaller languages in crawl dumps can be circumvented with "gathering all the different needles at the same time", in practice this approach often fails due to various reasons, one of which is the fact that language identification tools that cover many languages do not perform well enough on smaller languages. In our talk we will present the MaCoCu way of collecting web data which, beyond focusing on crawling top-level domains in the quest for high-quality up-to-date data, also encompasses various forms of data enrichment, crucial ingredients for understanding what kind of data we include in our language and translation models.
  • Sebastian Nagel, Common Crawl
    Common Crawl: Data Collection and use Cases for NLP
    The Common Crawl data sets are sample collections of web pages made accessible free of charge to everyone interested in running machine-scale analysis on web data. The presentation starts with a short outline of data collection, the crawlers and technologies used from 2008 until today with an emphasis on the challenges to achieve a balanced, both diverse and representative sample of web sites while operating an efficient and polite crawler. After an overview of the data formats used to store the primary web page captures, but also text and metadata extracts, indexes, hyperlink graphs, we showcase how Common Crawl data can be processed. We put the focus on three use cases for NLP: bulk processing of plain text and HTML pages, exploration and statistics based on the URL and metadata index, and the "vertical" use of data from specific sites or by content language.
  • Anna Rogers, University of Copenhagen
    Big Corpus Linguistics: Lessons from the BigScience Workshop
    The continued growth of large language models and their wide-scale adoption in commercial applications make it increasingly important to investigate their training data, both for research and ethical reasons. However, inspecting such large corpora has been problematic due to difficulties with data access, and the need for large-scale infrastructure. This talk will discuss some lessons learned during the BigScience workshop, as well as an ongoing effort for investigating the 1.6 Tb multilingual ROOTS corpus.
  • Pedro Ortiz Suarez, University of Mannheim and DFKI
    The OSCAR Project: Improving Data Quality in Multilingual Heterogeneous Web-Based Corpora
    In this talk we will introduce the OSCAR project and present our recent efforts in overcoming the difficulties posed by the heterogeneity, noisiness and size of web resources; in order to produce higher quality textual data for as many languages as possible. We will also discuss recent developments in the project, including our data-processing pipelines to annotate and classify large amounts of textual data in constrained infrastructures, as well as our first attempts to become a fully open-source project and manage our growing community. Finally, we will present how the OSCAR initiative is currently collaborating with other projects in order to improve data quality and availability.
  • Zeerak Talat, Simon Fraser University
  • Ivan Vulić, Cambridge University
    Modular and Parameter-Efficient Adaptation of Multilingual NLP Models
    A key challenge in multilingual NLP is developing general language-independent architectures that will be equally applicable to any language. However, this ambition is hindered by the large variation in 1) structural and semantic properties of the world’s languages, as well as 2) raw and task data scarcity for many different languages, tasks, and application domains. As a consequence, existing language technology is still largely limited to a handful of resource-rich languages, leaving the vast majority of the world’s 7,000+ languages and their speakers behind, thus amplifying the problem of the “digital language divide”. In this lecture, we will demonstrate that modularity enables widening the reach of multilingual NLP to minor and low-resource languages and communities, also boosting efficiency and reusability of models' constituent components: modules. We will introduce a range of recent modular and parameter-efficient techniques, additionally pointing to their high-level similarities and differences, that aim to deal with large cross-language variations and low-data learning regimes. We will also demonstrate that low-resource languages, despite very positive research trends and results achieved in recent years, still lag behind major languages in terms of performance, resources, overall representation in NLP research and other key aspects, and will outline several crucial challenges for future research in this area.
Monday, February 6, 2023
13:00 14:00 Lunch
14:00 15:30 Session 1
15:30 15:50 Coffee Break
15:50 17:20 Session 2
17:20 17:40 Coffee Break
17:40 19:10 Session 3
19:30 Dinner
21:00 Evening Session 1
Tuesday, February 7, 2022
Breakfast is available from 07:30
08:30 10:00 Session 4
Lunch is available between 13:00 and 14:30
15:00 16:20 Session 5
16:20 16:40 Coffee Break
16:40 18:00 Session 6
18:00 18:10 Coffee Break
18:10 19:30 Session 7
19:30 Dinner
21:00 Evening Session 2


Wednesday, February 8, 2020
Breakfast is available from 07:30
08:30 10:00 Session 8
10:00 10:30 Coffee Break
10:30 12:00 Session 9
12:30 13:30 Lunch

Registration

Registration is now closed. The 2023 winter school was heavily over-subscribed.

In total, we anticipate up to 60 participants in the 2023 Winter School. Please register your intent of participation through our on-line registration form. We will process requests for participation on a first-come, first-served basis, with an eye toward regional balance. Interested parties who have submitted the registration form will be confirmed in three batches, one on December 5, another one on December 12, and finally after the closing date for registration, which is Thursday, December 15, 2022.

Once confirmed by the organizing team, participant names will be published on this page, and registration will establish a binding agreement with the hotel. Therefore, a cancellation fee will be incurred (unless we can find someone else to ‘take over’ last-minute spaces), and no-shows will be charged the full price for at least one night by the hotel.

Logistics

With a few exceptions, winter school participants travel to and from the conference hotel jointly on a chartered bus (the HPLT shuttle). The bus will leave OSL airport no later than 9:30 CET on Monday, February 6. Thus, please meet up at 9:15 and make your arrival known to your assigned ‘tour guide’ (who will introduce themselves to you by email beforehand).

The group will gather near the bus and taxi information booth in the downstairs arrivals area, just outside the international arrivals luggage claims and slightly to the right, as one exits the customs area: The yellow dot numbered (17) on the OSL arrivals map. The group will then walk over to the bus terminal, to leave the airport by 9:30. The drive to the Skeikampen conference hotel will take us about three hours, and the bus will make one stop along the way to stretch our legs and fill up on coffee.

The winter school will end with lunch on Wednesday, February 8, before the group returns to OSL airport on the HPLT shuttle. The bus will leave Skeikampen at 14:00 CET, with an expected arrival time at OSL around 17:00 to 17:30 CET.

Organization

The 2023 Winter School is organized by a team of volunteers from the NLPL and HPLT networks, please see below. For all inquiries regarding registration, the programme, logistics, or such, please contact hplt-training@ifi.uio.no.

The programme committee is comprised of (regrettably lacking in diversity)

  • Hans Eide (Uninett Sigma2, Norway)
  • Filip Ginter (University of Turku, Finland)
  • Barry Haddow (University of Edinburgh, UK)
  • Jan Hajič (Charles University in Prague, Czech Republic)
  • Daniel Hershcovich (University of Copenhagen, Denmark)
  • Marco Kuhlmann (Linköping University, Sweden)
  • Andrey Kutuzov (University of Oslo, Norway)
  • Joakim Nivre (RISE and Uppsala University, Sweden)
  • Stephan Oepen (University of Oslo, Norway)
  • Sampo Pyysalo (University of Turku, Finland)
  • Gema Ramirez (Prompsit Language Engineering, Spain)
  • Magnus Sahlgreen (AI Sweden)
  • David Samuel (University of Oslo, Norway)
  • Jörg Tiedemann (University of Helsinki, Finland)

Participants

  1. Mehdi Ali (Fraunhofer IAIS)
  2. Chantal Amrhein (University of Zurich)
  3. Mark Anderson (Norsk regnesentral)
  4. Nikolay Arefev (University of Oslo)
  5. Mikko Aulamo (University of Helsinki)
  6. Elisa Bassignana (IT University of Copenhagen)
  7. Emily M. Bender (University of Washington)
  8. Vladimír Benko (Slovak Academy of Sciences)
  9. Nikolay Bogoychev (Edinburgh University)
  10. Dhairya Dalal (University of Galway)
  11. Annerose Eichel (University of Stuttgart)
  12. Kenneth Enevoldsen (Aarhus University)
  13. Mehrdad Farahani (Chalmers University of Technology)
  14. Ona de Gibert (University of Helsinki)
  15. Janis Goldzycher (University of Zurich)
  16. Jan Hajič (Charles University in Prague)
  17. Jindřich Helcl (Charles University in Prague)
  18. Oskar Holmström (Linköping University)
  19. Sami Itkonen (University of Helsinki)
  20. Shaoxiong Ji (University of Helsinki)
  21. Antonia Karamolegkou (University of Copenhagen)
  22. Nina Khairova (Umeå universitet)
  23. Marco Kuhlmann (Linköping University)
  24. Per Egil Kummervold (National Library of Norway)
  25. Andrey Kutuzov (University of Oslo)
  26. Jelmer van der Linde (Edinburgh University)
  27. Pierre Lison (Norsk regnesentral)
  28. Nikola Ljubešić (Jožef Stefan Institute & University of Ljubljana)
  29. Yan Meng (University of Amsterdam)
  30. Max Müller-Eberstein (IT University of Copenhagen)
  31. Sebastian Nagel (Common Crawl)
  32. Graeme Nail (Edinburgh University)
  33. Anna Nikiforovskaja (Université de Lorraine)
  34. Irina Nikishina (Universität Hamburg)
  35. Joakim Nivre (RISE and Uppsala University)
  36. Stephan Oepen (University of Oslo)
  37. Anders Jess Pedersen (Alexandra Institute)
  38. Laura Cabello Piqueras (University of Copenhagen)
  39. Myrthe Reuver (Vrije Universiteit Amsterdam)
  40. Anna Rogers (University of Copenhagen)
  41. Frankie Robertson (University of Jyväskylä)
  42. Javier De La Rosa (National Library of Norway)
  43. Phillip Rust (University of Copenhagen)
  44. Egil Rønnestad (University of Oslo)
  45. David Samuel (University of Oslo)
  46. Diana Santos (University of Oslo)
  47. Teven Le Scao (Hugging Face)
  48. Yves Scherrer (University of Helsinki)
  49. Edoardo Signoroni (Masaryk University)
  50. Michal Štefánik (Masaryk University)
  51. Pedro Ortiz Suarez (University of Mannheim and DFKI)
  52. Zeerak Talat (Simon Fraser University)
  53. Jörg Tiedemann (University of Helsinki)
  54. Samia Touileb (University of Bergen)
  55. Teemu Vahtola (University of Helsinki)
  56. Thomas Vakili (Stockholm University)
  57. Dušan Variš (Charles University in Prague)
  58. Tea Vojtěchová (Charles University in Prague)
  59. Ivan Vulić (University of Cambridge)
  60. Nicholas Walker (Norsk regnesentral)
  61. Sondre Wold (University of Oslo)
  62. Jaume Zaragoza-Bernabeu (Prompsit)