Revision as of 16:17, 30 January 2025

CodFace: LLM Infrastructure

This page is an internal discussion document. Working with LLMs requires some collaboration infrastructure that, preferably, should be shared across user communities and compute environments. Major components include (a) sharing of training and evaluation data, (b) sharing of pre-trained models, and (c) hosting of inference endpoints. A current de-facto standard is the Hugging Face Hub. For technical and strategic reasons, it is tempting to explore reduced dependency on the Hugging Face ecosystem, which after all is a closed and commercially operated service. Following on the NLPL notion of a “virtual laboratory”, it could be worthwhile to work towards infrastructure and services for data and model sharing across HPC systems (in a Nordic or European perspective), such that LLM development increasingly can build on a uniform environment and so that duplicative effort is reduced.

Data Sharing

LLM pre-training requires large volumes of textual data, often a few terabytes of compressed JSONlines files. Common data sets – like e.g. mC4, FineWeb or HPLT – provide metadata that can be used for selection, filtering, or inspection. Similarly (if to a lesser degree), model adaptation – fine-tuning, domain or task, adaptation, alignment, etc. – and evaluation also build on shared public data sets, if typically of much smaller size.

Currently, different projects (or individual users) tend to establish and manage their own copies of relevant data on the system where they work. An alternative solution could be a centralized repository that is made available (read-only) on relevant HPC systems, for example using a tiered, caching file system. To keep things simple, this could be implemented without any notion of access control rights, which would limit the scope of the repository to publicly available resources. To users, ideally, this should appear like a path in the local file system.

Model Repository

Abstractly parallel, but with a larger and more diverse target user group, pre-trained LLMs now are a part of much and varied AI work. The NLPL Vectors Repository was an early attempt at building and sharing a systematic collection of (old-school) language model, including structured metadata. The existing repository contains some 200 models (most of them “classic” or contextualized word embeddings), available for download through a faceted search interface and for direct loading from the NLPL community directories on Saga and Puhti. However, packaging each model as a Zip file is not directly interoperable with the popular Hugging Face Transformers library.

The organization of models and metadata should be redesigned, hosting and synchronization across systems consolidated, and the “vectors repository” re-branded as something more modern, e.g. the Nordic Language Model Repository.

Implementation Notes

Management of metadata and versioned updates?

Master repository in Git, object storage, a managed directory collection, or something else?

How to facilitate shared responsibility (over time) and high degrees of “community self-help”?

Mechanisms for user contributions, community discussion, ticket management, and such?

Sharing across clusters and to compute nodes without external access, e.g. CernVMFS?

Additionally, one probably wants a download service and searchable index.

@@ Line 40: / Line 40: @@
 Sharing across clusters and to compute nodes without external access, e.g. [https://cvmfs.readthedocs.io/en/stable/ CernVMFS]?
+Additionally, one probably wants a download service and searchable index.
 = Inference Endpoints =

Difference between revisions of "Infrastructure/codface"

Revision as of 16:17, 30 January 2025

Contents

CodFace: LLM Infrastructure

Data Sharing

Model Repository

Implementation Notes

Inference Endpoints

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools