Difference between revisions of "Infrastructure/codface"
(→CodFace: Infrastructure for LLM Development & Use) |
(→Implementation Notes) |
||
Line 40: | Line 40: | ||
Sharing across clusters and to compute nodes without external access, e.g. [https://cvmfs.readthedocs.io/en/stable/ CernVMFS]? | Sharing across clusters and to compute nodes without external access, e.g. [https://cvmfs.readthedocs.io/en/stable/ CernVMFS]? | ||
+ | Additionally, one probably wants a download service and searchable index. | ||
= Inference Endpoints = | = Inference Endpoints = |
Revision as of 16:17, 30 January 2025
Contents
CodFace: LLM Infrastructure
This page is an internal discussion document. Working with LLMs requires some collaboration infrastructure that, preferably, should be shared across user communities and compute environments. Major components include (a) sharing of training and evaluation data, (b) sharing of pre-trained models, and (c) hosting of inference endpoints. A current de-facto standard is the Hugging Face Hub. For technical and strategic reasons, it is tempting to explore reduced dependency on the Hugging Face ecosystem, which after all is a closed and commercially operated service. Following on the NLPL notion of a “virtual laboratory”, it could be worthwhile to work towards infrastructure and services for data and model sharing across HPC systems (in a Nordic or European perspective), such that LLM development increasingly can build on a uniform environment and so that duplicative effort is reduced.
Data Sharing
LLM pre-training requires large volumes of textual data, often a few terabytes of compressed JSONlines files. Common data sets – like e.g. mC4, FineWeb or HPLT – provide metadata that can be used for selection, filtering, or inspection. Similarly (if to a lesser degree), model adaptation – fine-tuning, domain or task, adaptation, alignment, etc. – and evaluation also build on shared public data sets, if typically of much smaller size.
Currently, different projects (or individual users) tend to establish and manage their own copies of relevant data on the system where they work. An alternative solution could be a centralized repository that is made available (read-only) on relevant HPC systems, for example using a tiered, caching file system. To keep things simple, this could be implemented without any notion of access control rights, which would limit the scope of the repository to publicly available resources. To users, ideally, this should appear like a path in the local file system.
Model Repository
Abstractly parallel, but with a larger and more diverse target user group, pre-trained LLMs now are a part of much and varied AI work. The NLPL Vectors Repository was an early attempt at building and sharing a systematic collection of (old-school) language model, including structured metadata. The existing repository contains some 200 models (most of them “classic” or contextualized word embeddings), available for download through a faceted search interface and for direct loading from the NLPL community directories on Saga and Puhti. However, packaging each model as a Zip file is not directly interoperable with the popular Hugging Face Transformers library.
The organization of models and metadata should be redesigned, hosting and synchronization across systems consolidated, and the “vectors repository” re-branded as something more modern, e.g. the Nordic Language Model Repository.
Implementation Notes
Management of metadata and versioned updates?
Master repository in Git, object storage, a managed directory collection, or something else?
How to facilitate shared responsibility (over time) and high degrees of “community self-help”?
Mechanisms for user contributions, community discussion, ticket management, and such?
Sharing across clusters and to compute nodes without external access, e.g. CernVMFS?
Additionally, one probably wants a download service and searchable index.