Infrastructure/codface

From Nordic Language Processing Laboratory
Revision as of 11:53, 30 January 2025 by Oe (talk | contribs) (work in progress)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

CodFace: Infrastructure for LLM Development & Use

This page is an internal discussion document. Working with LLMs requires some collaboration infrastructure that, preferably, should be shared across user communities and compute environments. Major components include (a) sharing of training and evaluation data, (b) sharing of pre-trained models, and (c) hosting of inference endpoints. A current de-facto standard is the Hugging Face Hub. For technical and strategic reasons, it is tempting to explore reduced dependency on the Hugging Face ecosystem, which after all is a closed and commercially operated service. Following on the NLPL notion of a “virtual laboratory”, it could be worthwhile to work towards infrastructure and services for data and model sharing across HPC systems (in a Nordic or European perspective), such that LLM development increasingly can build on a uniform environment and so that duplicative effort is reduced.

Data Sharing

LLM pre-training requires large volumes of textual data, often a few terabytes of compressed JSONlines files. Common data sets – like e.g. FineWeb or HPLT – provide metadata that can be used for selection, filtering, or inspection. Similarly (if to a lesser degree), model adaptation – fine-tuning, domain or task, adaptation, alignment, etc. – and evaluation also build on shared public data sets, if typically of much smaller size.

Currently, different projects (or individual users) tend to establish and manage their own copies of relevant data on the system where they work.

An alternative solution could be a centralized


Model Repository

Inference Endpoints