Background

The goal is to organize provisioning of software (for NLP research) in a manner that makes it possible and cost-efficient to maintain the exact same software stack on multiple systems. Here, systems initially means different superclusters, e.g. Puhti in Finland and Saga in Norway; sometime in 2021, we anticipate to additionally support the LUMI environment. In principle, As part of the NLPL use case in EOSC-Nordic, we are evaluating EasyBuild for this purpose.

Desk Pilot

To get started, we set out to re-create one common stack of NLPL modules in a fully automated EasyBuild configuration, viz. Python 3.7.4, NumPy 1.18.1, the SciPy Bundle (SciPy 1.4.1, SciKit-Learn 0.22.1, iPython 7.11.1, MatPlotLib 3.1.2, Pandas 0.23.1), and TensorFlow 1.15.2. For additional thrill, there should be two versions of NumPy, one installed with the MKL backend, the other without (using the default, which we believe is OpenBLAS). All modules should be maximally optimized for the available hardware, and TensorFlow should be built on top of CUDA/10.0 and cuDNN/7.6.4.

Ideally, this choice near the bottom of the dependency tree should not propagate into the higher-level modules, i.e. we would hope to have only once instance of the SciPy bundle or TensorFlow, and they would interoperate seamlessly with either choice for NumPy. Furthermore, we are interested in re-using system-wide modules on Saga, i.e. preferably the NLPL add-on module stack should not include its own version of the core Python intepreter, nor of the MKL, CUDA, or cuDNN libraries. To distinguish system-wide from NLPL-specific modules, we would want to prefix the names of our own modules with 'nlpl-'. At the same time, module identities should not be unnecessarily specific: for example, CUDA versions are independent of toolchains, so their modules should be toolchain-agnostic.

Toward a Blueprint

pilot very successful; conclusion is that EasyBuild is a good tool.

currently ongoing trial usage by UiO research group, including in teaching

validation on second systems remains to be done

now need to decide on design for production usage, i.e. module naming, whether to use hierarchical LMod, or maybe have multiple independent module collections

in principle, we want to be able to provide, say, TensorFlow with different versions of Python, different versions of CUDA, and different tool chains (with or without MKL); in other words, there are some parameters that we want to be able to vary flexibly, yielding a multi-dimensional matrix of different specific configurations. need to decide which dimensions to support.

not all dimensions apply to all modules: Python not to non-python code; MKL only to codes that requires BLAS-like optimization; CUDA only to gpu-enabled code.

putting all of these parameters into the module name, i.e. 'fully explicit naming', is not the most direct path to a happy user experience. that's almost what we currently have, though CUDA is missing, for this philosohpy.

we should consider alternative ways of naming, versioning, and organizing collections of modules. one scenario is for a user to activate a specific python version, then a CUDA version, and then the set of available modules becomes available.

Hierarchical naming scheme?