Infrastructure/codface
Contents
CodFace: Infrastructure for LLM Development & Use
This page is an internal discussion document. Working with LLMs requires some collaboration infrastructure that, preferably, should be shared across user communities and compute environments. Major components include (a) sharing of training and evaluation data, (b) sharing of pre-trained models, and (c) hosting of inference endpoints. A current de-facto standard is the Hugging Face Hub. For technical and strategic reasons, it is tempting to explore reduced dependency on the Hugging Face ecosystem, which after all is a closed and commercially operated service. Following on the NLPL notion of a “virtual laboratory”, it could be worthwhile to work towards infrastructure and services for data and model sharing across HPC systems (in a Nordic or European perspective), such that LLM development increasingly can build on a uniform environment and so that duplicative effort is reduced.
Data Sharing
LLM pre-training requires large volumes of textual data, often a few terabytes of compressed JSONlines files. Common data sets – like e.g. FineWeb or HPLT – provide metadata that can be used for selection, filtering, or inspection. Similarly (if to a lesser degree), model adaptation – fine-tuning, domain or task, adaptation, alignment, etc. – and evaluation also build on shared public data sets, if typically of much smaller size.
Currently, different projects (or individual users) tend to establish and manage their own copies of relevant data on the system where they work.
An alternative solution could be a centralized