Background

This page provides an informal, technically-oriented survey over available (and commonly used) architectures and implementations for large-scale pre-training (and fine-tuning) of contextualized neural language models.

The NLPL use case, will install, validate, and maintain a selection of these implementations, in an automated and uniform manner, on multiple HPC systems.

Design

Which systems to target? At least two of the following would seem desirable: Saga, Puhti, eX3, the quad-V100 Power9 node at UiO.

Inclusion of Saga would allow comparison to (older) P100 cards; they do support half-precision operations and XLA. The Power9 node may be interesting because of its non-Intel cpu architecture.

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a deep language model jointly conditioned on both left and right context in all layers. It is based on the Transformer neural architecture (Devlin et al 2019).

The de-facto standard for contextualized representations in modern NLP.

Available implementations

Reference implementation in TensorFlow.

Requirements: 1.11 <= TensorFlow < 2.0.

Developed by Google.

Multi-GPU training: Not officially supported, but supposedly can be achieved with Distributed training or with Horovod

Multi-node training: Not officially supported, but supposedly can be achieved with Distributed training or with Horovod

Training time: training on 3.3 billion words for 40 epochs takes "four days on 4 to 16 Cloud TPUs".

HuggingFace Transformers implementation.

Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.

Developed by HuggingFace (no corporations involved :)).

Multi-GPU training: Yes, PyTorch+NCCL

Multi-node training: Yes, PyTorch+NCCL

Training time: training on 160 million words for 2 epochs takes 8-9 days on 4 NVIDIA P100 GPUs.

NVIDIA BERT for TF.

Add multi-node, multi-gpu support and XLA and mixed precision; recommended by our role models. Requirements: Docker, tensorflow >= 1.11, networkx.

Developed by NVIDIA.

Multi-GPU training: Yes, TensorFlow+Horovod+NCCL

Multi-node training: Yes, TensorFlow+Horovod, requires Enroot and Pyxis.

Training time: training on 3.3 billion words for 40 epochs takes 3 days with 16 NVIDIA V100 GPUs or 12 days with 8 NVIDIA V100 GPUs.

NVIDIA BERT for PyTorch.

Add multi-node, multi-gpu support and XLA and mixed precision. Requirements: Docker, PyTorch NGC container from Nvidia.

Developed by NVIDIA.

Multi-GPU training: Yes, PyTorch+NCCL

Multi-node training: Yes, PyTorch+NCCL, requires Enroot and Pyxis.

Training time: training on 3.3 billion words for 40 epochs takes 3 days with 16 NVIDIA V100 GPUs

Chainer implementation.

Not much interesting to us, since it does not support training, only inference.

ELMo

Embeddings from Language Models (ELMo) use bidirectional LSTM language models to produce contextualized word token representations (Peters et al 2018).

The only architecture in the list to use recurrent neural networks, not Transformers. Despite being much less computationally demanding, often performs on par with BERT.

Available implementations

Reference TensorFlow implementation.

Requirements: Python >=3.5, 1.2 < TensorFlow < 1.13 (later versions produce too many deprecation warnings), h5py.

Created (but not much maintained) by Allen AI.

Multi-GPU training: Yes (TensorFlow native support)

Multi-node training: unknown

Training time: one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192)

LTG implementation.

Based on the reference implementation, but with improved data loading, hyper-parameter handling, and the code updated to more recent versions of TensorFlow. Requirements: Python >=3.5, 1.15 <= TensorFlow < 2.0 (2.0 version is planned), h5py, smart_open. Tutorial is available. A PyPi module is planned.

Created by UiO LTG.

Multi-GPU training: Yes (TensorFlow native support)

Multi-node training: unknown

Training time: one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192)

PyTorch implementation in AllenNLP.

Not much interesting to us, since it does not support training, only inference. Requirements: Python >= 3.6, 1.6 <= PyTorch < 1.7.

RoBERTa

Robustly Optimized BERT (RoBERTa) is a BERT variation by Facebook. The most important changes are removing the next sentence prediction objective and dynamically changing the masking pattern applied to the training data. Otherwise, it is just BERT on steroids (training longer, bigger batches, longer sequences). Interestingly, the RoBERTa paper was rejected by ICLR 2020.

Available implementations

Reference implementation in Fairseq.

Requirements: Python >= 3.6, PyTorch >= 1.4, NCCL.

Multi-GPU training:

Multi-node training:

Training time:

HuggingFace Transformers implementation.

Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.

Multi-GPU training:

Multi-node training:

Training time:

ELECTRA

In ELECTRA, a discriminator model tries to detect which tokens in the input were replaced by a small generator language model. It is claimed to be computationally efficient in comparison to other Transformer models (Clark et al 2019).

Available implementations

Reference Google implementation in TensorFlow.

Single-GPU training only. Requirements: Python 3, 1.15 <= TensorFlow < 2.0.

Multi-GPU training:

Multi-node training:

Training time:

HuggingFace Transformers implementation.

Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.

Multi-GPU training:

Multi-node training:

Training time:

Eosc/pretraining

Contents

Background

Design

BERT

Available implementations

ELMo

Available implementations

RoBERTa

Available implementations

ELECTRA

Available implementations

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools