Eosc/horovod
Contents
Background
To evaluate different approaches to provisioning software collections to NLPL users in a uniform manner across different systems, this page provides a high-level description of a current request (in early 2020). This software goal shall be discussed from various perspectives, including different approaches to software provisioning, interfaces to the host system and 'core' software modules, and requirements on the target compute systems (e.g. support for EasyBuild or container creation and execution).
High-Level Goal: Multi-GPU Training
Building on top of either PyTorch or TensorFlow, multiple NLPL users want to train models that require running on multiple gpus (8 or 16, say; i.e. multiple nodes) for at least several days. One fashionable framework these days is Horovod, which among other things combines MPI and NCCL for multi-gpu and multi-node communication.
Providing a functional and effective Horovod installation is no small feast, and even the more technically sophisticated NLPL users may struggle with putting together all the right pieces. Instead, the following components should be pre-installed in the NLPL virtual laboratory, for users to activate and run with minimal effort: Horovod 0.19.0, with either PyTorch 1.4.0 or TensorFlow 1.15.2 as its deep learning (DL) backend, all in a Python 3.7.2 environment. Horovod has a number of dependencies, including a basic C++ tool chain, NCCL 2, and a suitable MPI implementation (e.g. OpenMPI 3.1.2 or 4.0.0, but not 3.1.3). The DL backends each bring their own set of dependencies, currently among other things CUDA 10.0 and cuDNN 7.6 for TensorFlow, and CUDA 10.1 for PyTorch. Additionally, irrespective of the choice of DL framework, users will need a range of discipline-specific Python add-ons, say Gensim 3.8.1 and spaCy 2.2.3 (including all its pre-trained models). It is required that all of these components can be loaded into one Python universe, i.e. the same process.
Two Weeks Later: Gensim Updates
At last, Gensim 4.0 is released. Two of the most active Horovod users (one using PyTorch, the other a loyal TensorFlow user) desperately want to update their software environment, keeping everything as before but swapping out the older version of Gensim for the new release. But Gensim 4.0 does not maintain backwards compatibility with the 3.x releases, hence other NLPL users plead to not have their software environment altered until after they have submitted their doctoral theses.
Four Weeks Later: Performance Optimization
A team of NLPL developers sets out to adapt for Norwegian a large-scale experiment originally published by Google AI researchers, who report that they cumulatively used two TPU years (two weeks of exclusive access to 64 units) on this computation. From colleagues in Finland, they understand that running Horovod on top of the Intel MPI Library (instead of OpenMPI) can give a performance gain of up to 40 percent. Reliable Intel MPI support, however, requires installation of the most recent Horovod 0.20.0 release candidate (while keeping TensorFlow, Gensim, spaCy, et al. versions as before).
The Inevitable: Backgrading
At about the same time, a new NLPL user reaches out because they fail to get their code running in TensorFlow 1.15 and Python 3.7 (for reasons beyond their control). They ask for an installation of TensorFlow 1.12 in a Python 3.5 environment. They can make do without Horovod, but they badly need Gensim and spaCy.
Reflections: Responsibilities
The software stack sketched in the above combines a large number of modules. These can be categorized into three distinct layers, according to how closely they are tied to a specific HPC system and to what degree they serve specific subsets of users. In broad terms, these can be characterized as (a) core, (b) intermediate, or (c) custom modules.
Core components like the C++ tool chain, CUDA and associated extensions, different MPI implementations, or just a vanilla Python 3.7 interpreter arguably should be installed and maintained by the system administrators. They can be intricately linked to available hardware (cpu and gpu types, and the available interconnect) and should be expertly supported irrespective of individual disciplines or user groups.
Although less intricately tied to the host environment, general-purpose DL frameworks
are not discipline-specific either and could in principle be considered part of the
core software inventory provided with each HPC system.
However, both PyTorch and TensorFlow provide optional and contributed extensions that
are tailored for NLP (e.g. torchtext
, keras-preprocessing
,
or the contributed CRF implementation in TensorFlow).
Another deep learning framework (DyNet) appears
near-exclusively used for natural language processing.
Also, because the NLPL community in recent years appears to have been the maybe
most active user community for these frameworks on the Abel and Taito systems, it
has been both convenient and efficient for the community to maintain its own
installations of these DL frameworks.
Finally, tools and libraries like Gensim, spaCy, and others are specific to NLP and are in some cases co-developed by NLPL community members. There are about two dozens such tools that are widely used, i.e. will be required by multiple users. Some are easy to install, others less so; often, these tools provide pre-trained modules (for a variety of languages) that need to be installed separately and can be somewhat space-consuming. These tools and their associated data should not be installed (redundantly) by users, but rather motivate the community ‘self-help’ approach of the NLPL virtual laboratory: expert members of the community are best positioned to install these tools and assist others in using them effectively.
Questions: Installation vs. Deployment
Provisioning the above software stack on multiple HPC systems (e.g. the Puhti and Saga superclusters) raises question of modularization, automated installation, package management, portability, and replicability. So far, project partners have discussed three broad types of technologies: (a) automated compilation, building on a common foundation of core components, using frameworks like EasyBuild or Spack; (b) a package manager and repository, e.g. establishing something like an NLPConda channel; and (c) containerization, e.g. using Singularity images for portable deployment across different HPC (and possibly also cloud) systems.
A full solution will possibly require more than one of these technologies. While attractive for portability, it is not immediately clear how container construction and management can support the fine-grained modularization and ‘mixing and matching’ of components under user control required in the above Horovod scenario. Containers in and of themselves do not address the build and installation requirements, and in early 2020 at least it is unclear what levels of support are available on the current superclusters, and what to expect on future systems (notably the LUMI environment). For these reasons, containerization appears more like a candidate mid-term deployment vehicle than the primary solution for software maintenance per se.