Difference between revisions of "Eosc/horovod"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= Background = To evaluate different approaches to provisioning software collections to NLPL users in a uniform manner across different systems, this page provides a high-lev...")
 
(Reflections: Responsibilities)
Line 56: Line 56:
 
installed and maintained by the system administrators.
 
installed and maintained by the system administrators.
 
They can be intricately linked to available hardware (cpu and gpu types, and the
 
They can be intricately linked to available hardware (cpu and gpu types, and the
available interconnect) and should be expertly supported independent of individual
+
available interconnect) and should be expertly supported irrespective of individual
 
disciplines or user groups.
 
disciplines or user groups.
 +
 +
A similar argument could be made for general-purpose DL frameworks like
 +
PyTorch and TensorFlow, but

Revision as of 22:23, 30 January 2020

Background

To evaluate different approaches to provisioning software collections to NLPL users in a uniform manner across different systems, this page provides a high-level description of a current request (in early 2020). This software goal shall be discussed from various perspectives, including different approaches to software provisioning, interfaces to the host system and 'core' software modules, and requirements on the target compute systems (e.g. support for EasyBuild or container creation and execution).

High-Level Goal: Multi-GPU Training

Building on top of either PyTorch or TensorFlow, multiple NLPL users want to train models that require running on multiple gpus (8 or 16, say; i.e. multiple nodes) for at least several days. One fashionable framework these days is Horovod, which among other things combines MPI and NCCL for multi-gpu and multi-node communication.

Providing a functional and effective Horovod installation is no small feast, and even the more technically sophisticated NLPL users may struggle with putting together all the right pieces. Instead, the following components should be pre-installed in the NLPL virtual laboratory, for users to activate and run with minimal effort: Horovod 0.19.0, with either PyTorch 1.4.0 or TensorFlow 1.15.2 as its deep learning (DL) backend, all in a Python 3.7.2 environment. Horovod has a number of dependencies, including a basic tool chain, NCCL 2 and a suitable MPI implementation (e.g. OpenMPI 3.1.2 or 4.0.0, but not 3.1.3). The DL backends each bring their own set of dependencies, currently among other things CUDA 10.0 and cuDNN 7.6 for TensorFlow, and CUDA 10.1 for PyTorch. Additionally, irrespective of the choice of DL framework, users will need a range of discipline-specific Python add-ons, say Gensim 3.8.1 and spaCy 2.2.3 (including all its pre-trained models). It is required that all of these components can be loaded into one Python universe, i.e. the same process.

Two Weeks Later: Gensim Updates

At last, Gensim 4.0 is released, and two NLPL users (one using PyTorch, the other a loyal TensorFlow user) desperately want to update their software environment, keeping everything as before but swapping out the older version of Gensim for the new release. But Gensim 4.0 does not provide backwards compatibility with the 3.8 releases, hence other NLPL users plead to not have their software environment altered until after they have submitted their doctoral theses.

Four Weeks Later: Performance Optimization

A team of NLPL developers sets out to adapt for Norwegian a large-scale experiment originally published by Google AI researchers, who report that they cumulatively used two TPU years (two weaks with exclusive access to 64 units) on this computation. From colleagues in Finland, they understand that running Horovod on top of the Intel MPI Library (instead of OpenMPI) can give a performance gain of up to 40 percent. Reliable Intel MPI utilization, however, requires installation of the most recent Horovod 0.20.0 release candidate.

Reflections: Responsibilities

Core components like the C++ tool chain, CUDA and associated extensions, different MPI implementations, or just a vanilla Python 3.7 interpreter arguably should be installed and maintained by the system administrators. They can be intricately linked to available hardware (cpu and gpu types, and the available interconnect) and should be expertly supported irrespective of individual disciplines or user groups.

A similar argument could be made for general-purpose DL frameworks like PyTorch and TensorFlow, but