Infrastructure/codface

From Nordic Language Processing Laboratory
Revision as of 08:27, 31 January 2025 by Oe (talk | contribs) (Model Repository)
Jump to: navigation, search

CodFace: LLM Infrastructure

This page is an internal discussion document. Working with LLMs requires some collaboration infrastructure that, preferably, should be uniform across user communities and compute environments. Major components include (a) sharing of training and evaluation data, (b) sharing of pre-trained models, and (c) hosting of inference endpoints. A current de-facto standard is the Hugging Face Hub. For technical and strategic reasons, it is tempting to explore reduced dependency on the Hugging Face ecosystem, which after all is a closed and commercially motivated service. Following on the NLPL notion of a “virtual laboratory”, it will be worthwhile to work towards infrastructure and services for data and model sharing across HPC systems in a Nordic or European perspective, such that LLM development increasingly can build on a unified environment and so that duplicative effort is reduced.

Data Sharing

LLM pre-training requires large volumes of textual data, often a few terabytes of compressed JSONlines files. Common data sets – like e.g. mC4, FineWeb or HPLT – provide metadata that can be used for selection, filtering, or inspection. Similarly (if to a lesser degree), model adaptation – fine-tuning, domain or task, adaptation, alignment, etc. – and evaluation also build on shared public data sets, if typically of much smaller size.

Currently, different projects (or individual users) tend to establish and manage their own copies of relevant data on the system where they work. An alternative solution could be a centralized repository that is made available (read-only) on relevant HPC systems, for example using a tiered, caching file system. To keep things simple, this could be implemented without any notion of access control rights, which would limit the scope of the repository to publicly available resources. To users, ideally, this should appear like a path in the local file system. On-line sharing of LLM data across EuroHPC systems is a task in WP3 of OpenEuroLLM (coordinated by UiO).

Model Repository

Abstractly parallel, but with a larger and more diverse target user group, pre-trained LLMs now are a part of much and varied AI work. MSc and PhD student projects often want to build on off-the-shelf models in a framework like Hugging Face Transformers.

The NLPL Vectors Repository was an early attempt at building and sharing a systematic collection of (old-school) language models, including some structured metadata. The existing repository contains about 220 models (most of them “classic” or contextualized word embeddings), available for download through a faceted search interface and for direct loading from the NLPL community directories on Saga and Puhti. However, packaging each model as a Zip file is not directly interoperable with the popular Hugging Face ecosystem.

The organization of models and metadata should be redesigned, hosting and synchronization across systems consolidated, and the “vectors repository” re-branded as something more modern, e.g. the Nordic Language Model Repository. This is a task in WP6 of HPLT (coordinated by UiO).

Implementation Notes

Management of metadata and versioned updates?

Master repository in Git, object storage, a managed directory collection, or something else?

How to facilitate shared responsibility (over time) and high degrees of “community self-help”?

Mechanisms for user contributions, community discussion, ticket management, and such?

Sharing across clusters and to compute nodes without external access, e.g. CernVMFS?

Additionally, one probably wants a download service and searchable index.

Inference Endpoints