Difference between revisions of "Infrastructure/codface"
(→(1) Sharing of Training (and Evaluation) Data) |
(→Design and Implementation) |
||
Line 35: | Line 35: | ||
The [https://vectors.nlpl.eu NLPL Vectors Repository] was an early attempt at building and sharing a systematic collection of (old-school) language models, including some structured metadata. | The [https://vectors.nlpl.eu NLPL Vectors Repository] was an early attempt at building and sharing a systematic collection of (old-school) language models, including some structured metadata. | ||
− | The existing repository contains about 220 models (most of them “classic” or contextualized word embeddings), available for download through a faceted search interface and for direct loading from the NLPL community directories on Saga and Puhti. | + | The existing repository contains about 220 models (most of them “classic” or contextualized word embeddings), available for download through a (simple) faceted search interface and for direct loading from the NLPL community directories on Saga and Puhti. |
However, packaging each model as a Zip file is not directly interoperable with the popular Hugging Face ecosystem. | However, packaging each model as a Zip file is not directly interoperable with the popular Hugging Face ecosystem. | ||
− | The organization of models and metadata should be redesigned, | + | The organization of models and metadata should be redesigned, storage, download, and synchronization across systems consolidated, and the “vectors repository” re-branded as something more modern, e.g. the ''NLPL Model Repository''. |
This is a task in WP6 of HPLT (coordinated by UiO, with involvement of Sigma2) that needs to be completed in the fall of 2025. | This is a task in WP6 of HPLT (coordinated by UiO, with involvement of Sigma2) that needs to be completed in the fall of 2025. | ||
Following the Hugging Face experience, LTG suggests organizing the core of the model repository in Git, with Large File Storage (LFS) support. | Following the Hugging Face experience, LTG suggests organizing the core of the model repository in Git, with Large File Storage (LFS) support. | ||
In this scheme, each model would be organized as one Git repository, which naturally supports core notions like ownership, management of access rights, versioning (aka revision management), and bundling of files and metadata that jointly comprise each model. | In this scheme, each model would be organized as one Git repository, which naturally supports core notions like ownership, management of access rights, versioning (aka revision management), and bundling of files and metadata that jointly comprise each model. | ||
− | For the HPLT model repository pilot, it | + | For the HPLT model repository pilot, it will be desirable that Sigma2 provide and maintain a Git service, that initially would need to support at least a few thousand repositories (amounting to a few terabytes of storage) and write access by a handful of HPLT users. |
For model distribution, a designated web service (for faceted browsing and download) would have to be developed. | For model distribution, a designated web service (for faceted browsing and download) would have to be developed. | ||
LTG and HPLT partner Prompsit could take part in the design and implementation of the web interface, but it should preferably be hosted and operated by Sigma2. | LTG and HPLT partner Prompsit could take part in the design and implementation of the web interface, but it should preferably be hosted and operated by Sigma2. | ||
− | Technically, one can imagine | + | Technically, one can imagine a solution where a dedicated server (e.g. a virtual machine) continually keeps an on-disk clone of all Git model repositories and runs a web service, serving dynamic pages (based on a suitable framework). |
+ | The [https://opus.nlpl.eu NLPL Opus | ||
− | + | How to facilitate shared responsibility (over time) and degrees of “community self-help”? | |
− | |||
− | |||
− | |||
− | How to facilitate shared responsibility (over time) and | ||
Mechanisms for user contributions, community discussion, ticket management, and such? | Mechanisms for user contributions, community discussion, ticket management, and such? |
Revision as of 21:43, 26 June 2025
Contents
(0) CodFace: LLM Infrastructure
This page is an internal discussion document. Working with LLMs requires some collaboration infrastructure that, preferably, should be uniform across user communities and compute environments. Major components include (a) sharing of training and evaluation data, (b) sharing of pre-trained models, and (c) hosting of inference endpoints. A current de-facto standard is the Hugging Face Hub. For technical and strategic reasons, it is tempting to explore reduced dependency on the Hugging Face ecosystem, which after all is a closed and commercially motivated service. Following on the NLPL notion of a “virtual laboratory”, it will be worthwhile to work towards infrastructure and services for data and model sharing across HPC systems in a Nordic or European perspective, such that LLM development increasingly can build on a unified environment and so that duplicative effort is reduced. The UiO Language Technology Group (LTG) coordinates data management in the European HPLT (where also Sigma2 is a consortium member) and OpenEuroLLM projects, where the work plan minimally calls for providing an LLM download repository. In a broader perspective, open and self-hosted LLM infrastructures will also be required in other initiatives, e.g. the Norwegian AI Factory and the Language Model Factory at the Norwegian National Library.
(1) Sharing of Training (and Evaluation) Data
LLM pre-training requires large volumes of textual data, often a few terabytes of compressed JSONlines files. Common data sets – like e.g. mC4, FineWeb or HPLT – provide metadata that can be used for selection, filtering, or inspection. Similarly (if to a lesser degree), model adaptation – fine-tuning, domain or task adaptation, alignment, etc. – and evaluation also build on shared public data sets, if typically of much smaller size.
Currently, different projects (or individual users) tend to establish and manage their own copies of relevant data on the system where they work. An alternative solution would be a centralized repository that is made available (read-only and world-readable) on relevant HPC systems, for example using a tiered, caching file system. To keep things simple, this could be implemented without any notion of access control rights, which would limit the scope of the repository to publicly available resources. To users, ideally, this should appear like a path in the local file system. Such on-line sharing of LLM data across EuroHPC systems is a task in WP3 of OpenEuroLLM (coordinated by UiO), initially in the form of a curated community directory mirrorred across systems.
(2) Large Language Model Repository
Abstractly parallel, but with a larger and more diverse target user group, pre-trained LLMs now are a part of much and varied AI work. MSc and PhD student projects often want to build on off-the-shelf models in a framework like Hugging Face Transformers. An independent model repository must minimally comprise two components: (i) structured and versioned storage of models, including metadata, support files (e.g. the tokenizer), and optionally interim checkpoints; and (ii) a web service for faceted browsing of available models (e.g. by language support, architecture, size) and download. For ease of use, the repository should be interoperable with standard LLM software stacks, notably the Transformers library.
To reduce duplicative work, (iii) a curated subset of available models should be locally available on relevant HPC systems, for example in a form similar to the OpenEuroLLM community directory for LLM training data (see above). Finally, in a longer-term perspective, the model repository should be complemented with (iv) hosted, on-demand inference services, for users to deploy models from the repository and in a cloud-based environment; since early 2024, LTG and Sigma2 have gathered preliminary experience through a limited inference pilot.
Design and Implementation
The NLPL Vectors Repository was an early attempt at building and sharing a systematic collection of (old-school) language models, including some structured metadata. The existing repository contains about 220 models (most of them “classic” or contextualized word embeddings), available for download through a (simple) faceted search interface and for direct loading from the NLPL community directories on Saga and Puhti. However, packaging each model as a Zip file is not directly interoperable with the popular Hugging Face ecosystem. The organization of models and metadata should be redesigned, storage, download, and synchronization across systems consolidated, and the “vectors repository” re-branded as something more modern, e.g. the NLPL Model Repository. This is a task in WP6 of HPLT (coordinated by UiO, with involvement of Sigma2) that needs to be completed in the fall of 2025.
Following the Hugging Face experience, LTG suggests organizing the core of the model repository in Git, with Large File Storage (LFS) support. In this scheme, each model would be organized as one Git repository, which naturally supports core notions like ownership, management of access rights, versioning (aka revision management), and bundling of files and metadata that jointly comprise each model. For the HPLT model repository pilot, it will be desirable that Sigma2 provide and maintain a Git service, that initially would need to support at least a few thousand repositories (amounting to a few terabytes of storage) and write access by a handful of HPLT users.
For model distribution, a designated web service (for faceted browsing and download) would have to be developed. LTG and HPLT partner Prompsit could take part in the design and implementation of the web interface, but it should preferably be hosted and operated by Sigma2. Technically, one can imagine a solution where a dedicated server (e.g. a virtual machine) continually keeps an on-disk clone of all Git model repositories and runs a web service, serving dynamic pages (based on a suitable framework). The [https://opus.nlpl.eu NLPL Opus
How to facilitate shared responsibility (over time) and degrees of “community self-help”?
Mechanisms for user contributions, community discussion, ticket management, and such?
Sharing across clusters and to compute nodes without external access, e.g. CernVMFS?
(3) Inference Endpoints
The current limited LTG API Pilot is built on the Gradio framework and hosted at NIRD. Functionality and scalability are severly limited, as the current instance only has a static allocation of two V100 gpus available.