Difference between revisions of "Eosc/horovod"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Four Weeks Later: Performance Optimization)
(The Inevitable: Backgrading)
 
(7 intermediate revisions by the same user not shown)
Line 20: Line 20:
 
users to activate and run with minimal effort: Horovod 0.19.0, with either PyTorch 1.4.0 or
 
users to activate and run with minimal effort: Horovod 0.19.0, with either PyTorch 1.4.0 or
 
TensorFlow 1.15.2 as its deep learning (DL) backend, all in a Python 3.7.2 environment.
 
TensorFlow 1.15.2 as its deep learning (DL) backend, all in a Python 3.7.2 environment.
Horovod has a number of dependencies, including a basic tool chain, NCCL 2 and a suitable MPI
+
Horovod has a number of dependencies, including a basic C++ tool chain, NCCL 2, and a suitable MPI
 
implementation (e.g. OpenMPI 3.1.2 or 4.0.0, but not 3.1.3).
 
implementation (e.g. OpenMPI 3.1.2 or 4.0.0, but not 3.1.3).
 
The DL backends each bring their own set of dependencies, currently among other things
 
The DL backends each bring their own set of dependencies, currently among other things
Line 48: Line 48:
 
From colleagues in Finland, they understand that running Horovod on top of the
 
From colleagues in Finland, they understand that running Horovod on top of the
 
Intel MPI Library (instead of OpenMPI) can give a performance gain of up to 40 percent.
 
Intel MPI Library (instead of OpenMPI) can give a performance gain of up to 40 percent.
Reliable Intel MPI utilization, however, requires installation of the most recent
+
Reliable Intel MPI support, however, requires installation of the most recent
 
Horovod 0.20.0 release candidate (while keeping TensorFlow, Gensim, spaCy, et al.
 
Horovod 0.20.0 release candidate (while keeping TensorFlow, Gensim, spaCy, et al.
 
versions as before).
 
versions as before).
 +
 +
= The Inevitable: Backgrading =
 +
 +
At about the same time, a new NLPL user reaches out because they fail to get their
 +
code running in TensorFlow 1.15 and Python 3.7 (for reasons beyond their control).
 +
They ask for an installation of TensorFlow 1.12 in a Python 3.5 environment.
 +
They can maybe make do without Horovod, but they badly need Gensim and spaCy.
  
 
= Reflections: Responsibilities =
 
= Reflections: Responsibilities =
 +
 +
The software stack sketched in the above combines a large number of modules.
 +
These can be categorized into three distinct layers, according to how closely
 +
they are tied to a specific HPC system and to what degree they serve specific
 +
subsets of users.
 +
In broad terms, these can be characterized as (a) core, (b) intermediate, or
 +
(c) custom modules.
  
 
Core components like the C++ tool chain, CUDA and associated extensions, different
 
Core components like the C++ tool chain, CUDA and associated extensions, different
Line 61: Line 75:
 
disciplines or user groups.
 
disciplines or user groups.
  
A similar argument could be made for general-purpose DL frameworks like
+
Although less intricately tied to the host environment, general-purpose DL frameworks
PyTorch and TensorFlow, but
+
are not discipline-specific either and could in principle be considered part of the
 +
core software inventory provided with each HPC system.
 +
However, both PyTorch and TensorFlow provide optional and contributed extensions that
 +
are tailored for NLP (e.g. <code>torchtext</code>, <code>keras-preprocessing</code>,
 +
or the contributed CRF implementation in TensorFlow).
 +
Another deep learning framework ([https://github.com/clab/dynet DyNet]) appears
 +
near-exclusively used for natural language processing.
 +
Also, because the NLPL community in recent years appears to have been the maybe
 +
most active user community for these frameworks on the Abel and Taito systems, it
 +
has been both convenient and efficient for the community to maintain its own
 +
installations of these DL frameworks.
 +
 
 +
Finally, tools and libraries like Gensim, spaCy, and others are specific to NLP and
 +
are in some cases co-developed by NLPL community members.
 +
There are about two dozens such tools that are widely used, i.e. will be required
 +
by multiple users.
 +
Some are easy to install, others less so; often, these tools provide pre-trained
 +
modules (for a variety of languages) that need to be installed separately and can
 +
be somewhat space-consuming.
 +
These tools and their associated data should not be installed (redundantly) by users,
 +
but rather motivate the community ‘self-help’ approach of the NLPL virtual
 +
laboratory: expert members of the community are best positioned to install these
 +
tools and assist others in using them effectively.
 +
 
 +
= Questions: Installation vs. Deployment =
 +
 
 +
Provisioning the above software stack on multiple HPC systems (e.g. the Puhti and Saga
 +
superclusters) raises question of modularization, automated installation, package management,
 +
portability, and replicability.
 +
So far, project partners have discussed three broad types of technologies:
 +
(a) automated compilation, building on a common foundation of core components,
 +
using frameworks like [https://easybuild.readthedocs.io/en/latest/ EasyBuild]
 +
or [https://spack.io/ Spack];
 +
(b) a package manager and repository, e.g. establishing something like
 +
an NLPConda channel; and
 +
(c) containerization, e.g. using Singularity images for portable deployment
 +
across different HPC (and possibly also cloud) systems.
 +
 
 +
A full solution will possibly require more than one of these technologies.
 +
While attractive for portability, it is not immediately clear how container
 +
construction and management can support the fine-grained modularization and
 +
‘mixing and matching’ of components under user control required in the above
 +
Horovod scenario.
 +
Containers in and of themselves do not address the build and installation
 +
requirements, and in early 2020 at least it is unclear what levels of
 +
support are available on the current superclusters, and what to expect
 +
on future systems (notably the LUMI environment).
 +
For these reasons, containerization appears more like a candidate mid-term
 +
deployment vehicle than the primary solution for software maintenance per se.

Latest revision as of 23:03, 1 February 2020

Background

To evaluate different approaches to provisioning software collections to NLPL users in a uniform manner across different systems, this page provides a high-level description of a current request (in early 2020). This software goal shall be discussed from various perspectives, including different approaches to software provisioning, interfaces to the host system and 'core' software modules, and requirements on the target compute systems (e.g. support for EasyBuild or container creation and execution).

High-Level Goal: Multi-GPU Training

Building on top of either PyTorch or TensorFlow, multiple NLPL users want to train models that require running on multiple gpus (8 or 16, say; i.e. multiple nodes) for at least several days. One fashionable framework these days is Horovod, which among other things combines MPI and NCCL for multi-gpu and multi-node communication.

Providing a functional and effective Horovod installation is no small feast, and even the more technically sophisticated NLPL users may struggle with putting together all the right pieces. Instead, the following components should be pre-installed in the NLPL virtual laboratory, for users to activate and run with minimal effort: Horovod 0.19.0, with either PyTorch 1.4.0 or TensorFlow 1.15.2 as its deep learning (DL) backend, all in a Python 3.7.2 environment. Horovod has a number of dependencies, including a basic C++ tool chain, NCCL 2, and a suitable MPI implementation (e.g. OpenMPI 3.1.2 or 4.0.0, but not 3.1.3). The DL backends each bring their own set of dependencies, currently among other things CUDA 10.0 and cuDNN 7.6 for TensorFlow, and CUDA 10.1 for PyTorch. Additionally, irrespective of the choice of DL framework, users will need a range of discipline-specific Python add-ons, say Gensim 3.8.1 and spaCy 2.2.3 (including all its pre-trained models). It is required that all of these components can be loaded into one Python universe, i.e. the same process.

Two Weeks Later: Gensim Updates

At last, Gensim 4.0 is released. Two of the most active Horovod users (one using PyTorch, the other a loyal TensorFlow user) desperately want to update their software environment, keeping everything as before but swapping out the older version of Gensim for the new release. But Gensim 4.0 does not maintain backwards compatibility with the 3.x releases, hence other NLPL users plead to not have their software environment altered until after they have submitted their doctoral theses.

Four Weeks Later: Performance Optimization

A team of NLPL developers sets out to adapt for Norwegian a large-scale experiment originally published by Google AI researchers, who report that they cumulatively used two TPU years (two weeks of exclusive access to 64 units) on this computation. From colleagues in Finland, they understand that running Horovod on top of the Intel MPI Library (instead of OpenMPI) can give a performance gain of up to 40 percent. Reliable Intel MPI support, however, requires installation of the most recent Horovod 0.20.0 release candidate (while keeping TensorFlow, Gensim, spaCy, et al. versions as before).

The Inevitable: Backgrading

At about the same time, a new NLPL user reaches out because they fail to get their code running in TensorFlow 1.15 and Python 3.7 (for reasons beyond their control). They ask for an installation of TensorFlow 1.12 in a Python 3.5 environment. They can maybe make do without Horovod, but they badly need Gensim and spaCy.

Reflections: Responsibilities

The software stack sketched in the above combines a large number of modules. These can be categorized into three distinct layers, according to how closely they are tied to a specific HPC system and to what degree they serve specific subsets of users. In broad terms, these can be characterized as (a) core, (b) intermediate, or (c) custom modules.

Core components like the C++ tool chain, CUDA and associated extensions, different MPI implementations, or just a vanilla Python 3.7 interpreter arguably should be installed and maintained by the system administrators. They can be intricately linked to available hardware (cpu and gpu types, and the available interconnect) and should be expertly supported irrespective of individual disciplines or user groups.

Although less intricately tied to the host environment, general-purpose DL frameworks are not discipline-specific either and could in principle be considered part of the core software inventory provided with each HPC system. However, both PyTorch and TensorFlow provide optional and contributed extensions that are tailored for NLP (e.g. torchtext, keras-preprocessing, or the contributed CRF implementation in TensorFlow). Another deep learning framework (DyNet) appears near-exclusively used for natural language processing. Also, because the NLPL community in recent years appears to have been the maybe most active user community for these frameworks on the Abel and Taito systems, it has been both convenient and efficient for the community to maintain its own installations of these DL frameworks.

Finally, tools and libraries like Gensim, spaCy, and others are specific to NLP and are in some cases co-developed by NLPL community members. There are about two dozens such tools that are widely used, i.e. will be required by multiple users. Some are easy to install, others less so; often, these tools provide pre-trained modules (for a variety of languages) that need to be installed separately and can be somewhat space-consuming. These tools and their associated data should not be installed (redundantly) by users, but rather motivate the community ‘self-help’ approach of the NLPL virtual laboratory: expert members of the community are best positioned to install these tools and assist others in using them effectively.

Questions: Installation vs. Deployment

Provisioning the above software stack on multiple HPC systems (e.g. the Puhti and Saga superclusters) raises question of modularization, automated installation, package management, portability, and replicability. So far, project partners have discussed three broad types of technologies: (a) automated compilation, building on a common foundation of core components, using frameworks like EasyBuild or Spack; (b) a package manager and repository, e.g. establishing something like an NLPConda channel; and (c) containerization, e.g. using Singularity images for portable deployment across different HPC (and possibly also cloud) systems.

A full solution will possibly require more than one of these technologies. While attractive for portability, it is not immediately clear how container construction and management can support the fine-grained modularization and ‘mixing and matching’ of components under user control required in the above Horovod scenario. Containers in and of themselves do not address the build and installation requirements, and in early 2020 at least it is unclear what levels of support are available on the current superclusters, and what to expect on future systems (notably the LUMI environment). For these reasons, containerization appears more like a candidate mid-term deployment vehicle than the primary solution for software maintenance per se.