Eosc/pretraining

From Nordic Language Processing Laboratory
Revision as of 10:48, 23 September 2020 by Andreku (talk | contribs) (Tokenization)
Jump to: navigation, search

Background

This page provides an informal, technically-oriented survey over available (and commonly used) architectures and implementations for large-scale pre-training (and fine-tuning) of contextualized neural language models.

The NLPL use case, will install, validate, and maintain a selection of these implementations, in an automated and uniform manner, on multiple HPC systems.

Tokenization

There are several established tokenization workflows for large pre-trained language models. We are describing them here.

  • ELMo does not use any sub-word tokenization per se.

It splits tokens by white spaces, and then represents each token as a sequence of UTF-8 code units (max 50 by default, 8 bit each). The final (non-contextual) token embedding is produced by running a simple CNN over this sequence. This naturally handles OOV words, since they are composed of the same UTF-8 code units.

  • BERT and the company employ sub-word tokenization.

The original BERT uses WordPiece: an implementation of the standard character-level BPE encoding with some form of language modeling employed to select sub-words.

The English BERT model from Google employs a vocabulary of 30 000 "word pieces".

Google does not provide the code they used to learn a new WordPiece vocabulary (neither does NVIDIA). Instead, they both suggest using Google's open source SentencePiece library. SentencePiece is even superior to WordPiece in some respect: for example, it does not require pre-segmentation of words. After training on a text corpus, it produces the ".model and ".vocab" files, where the former contains character merges, and the latter contains the sub-word vocabulary itself. The SentencePiece output can be converted to a BERT-compatible vocabulary using https://github.com/spyysalo/sent2wordpiece.

This is what Turku folks did to train the FinBERT. Actually, they provide a SentencePiece-generated BERT-compatible vocabulary trained on Norwegian Wikipedia. We can use this vocabulary or train our own. Note that SentencePiece does not remove diacritics from tokens, and the vocabulary provided by Turku also contains diacritics. In their presentation at the NLPL Winter school, they hinted this can be a problem for BERT, but I do not see why.

Finally, HuggingFace provides its own fast Tokenizers library. It implements:

  • CharBPETokenizer: The original char-level Byte Pair Encoding; training on Norwegian Wikipedia takes 13 minutes;
  • SentencePieceBPETokenizer: A BPE implementation from SentencePiece; training on Norwegian Wikipedia takes 25 minutes;
  • BertWordPieceTokenizer: Reimplementation of the BERT WordPiece tokenizer; removes diacritics; training on Norwegian Wikipedia takes 15 minutes;
  • ByteLevelBPETokenizer: The byte level version of the BPE (recommended by HuggingFace, because it ensures that all tokens will always be known); training on Norwegian Wikipedia takes 15 minutes.

It seems that the Tokenizers library is the best choice, being well integrated into the widely used Transformers package. As for the particular tokenizer for future Norwegian BERT, SentencePieceBPETokenizer looks like the best option. The problem with BertWordPieceTokenizer is that it removes diacritics which is critical for Norwegian. But it is yet to see whether diacritics cause problems to BERT training. The problem with ByteLevelBPETokenizer is that its output can be difficult to examine and interpret, since it works with bytes, not characters. Workarounds can be designed, but this will introduce an additional layer of non-transparency, especially for those not IT-savvy.

  • RoBERTa uses a byte level BPE tokenizer (seemingly identical to the one used in the GPT-2 and the ByteLevelBPETokenizer from HuggingFace) with the vocabulary of 50 000 sub-word units.
  • ELECTRA models from Google simply use the vocabulary from Google's BERT (thus, WordPiece). Should work with other sub-word tokenization schemes as well.

Design

Which systems to target? At least two of the following would seem desirable: Saga, Puhti, eX3, the quad-V100 Power9 node at UiO.

Inclusion of Saga would allow comparison to (older) P100 cards; they do support half-precision operations and XLA. The Power9 node may be interesting because of its non-Intel cpu architecture.

BERT

Bidirectional Encoder Representations from Transformers (BERT) is a deep language model jointly conditioned on both left and right context in all layers. It is based on the Transformer neural architecture (Devlin et al 2019).

The de-facto standard for contextualized representations in modern NLP.

Available implementations

Requirements: 1.11 <= TensorFlow < 2.0.

Developed by Google.

Multi-GPU training: Not officially supported, but supposedly can be achieved with Distributed training or with Horovod

Multi-node training: Not officially supported, but supposedly can be achieved with Distributed training or with Horovod

Training time: training on 3.3 billion words for 40 epochs takes "four days on 4 to 16 Cloud TPUs".

Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.

Developed by HuggingFace (no corporations involved :)).

Multi-GPU training: Yes, PyTorch+NCCL

Multi-node training: Yes, PyTorch+NCCL

Training time: training on 160 million words for 2 epochs takes 8-9 days on 4 NVIDIA P100 GPUs.

Add multi-node, multi-gpu support and XLA and mixed precision; recommended by our role models. Requirements: Docker, tensorflow >= 1.11, networkx.

Developed by NVIDIA.

Multi-GPU training: Yes, TensorFlow+Horovod+NCCL

Multi-node training: Yes, TensorFlow+Horovod, requires Enroot and Pyxis.

Training time: training on 3.3 billion words for 40 epochs takes 3 days with 16 NVIDIA V100 GPUs or 12 days with 8 NVIDIA V100 GPUs.

Add multi-node, multi-gpu support and XLA and mixed precision. Requirements: Docker, PyTorch NGC container from Nvidia.

Developed by NVIDIA.

Multi-GPU training: Yes, PyTorch+NCCL

Multi-node training: Yes, PyTorch+NCCL, requires Enroot and Pyxis.

Training time: training on 3.3 billion words for 40 epochs takes 3 days with 16 NVIDIA V100 GPUs

Not much interesting to us, since it does not support training, only inference.

ELMo

Embeddings from Language Models (ELMo) use bidirectional LSTM language models to produce contextualized word token representations (Peters et al 2018).

The only architecture in the list to use recurrent neural networks, not Transformers. Despite being much less computationally demanding, often performs on par with BERT.

Available implementations

Requirements: Python >=3.5, 1.2 < TensorFlow < 1.13 (later versions produce too many deprecation warnings), h5py.

Developed (but not much maintained) by Allen AI.

Multi-GPU training: Yes (TensorFlow native support)

Multi-node training: unknown (arguably not required for ELMo)

Training time: one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192). 3 epochs already give reasonable performance in NLP tasks.

Based on the reference implementation, but with improved data loading, hyper-parameter handling, and the code updated to more recent versions of TensorFlow. Requirements: Python >=3.5, 1.15 <= TensorFlow < 2.0 (2.0 version is planned), h5py, smart_open.

Tutorial is available. A PyPi module is planned.

Developed by UiO LTG.

Multi-GPU training: Yes (TensorFlow native support)

Multi-node training: unknown (arguably not required for ELMo)

Training time: one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192). 3 epochs already give reasonable performance in NLP tasks.

Not much interesting to us, since it does not support training, only inference. Requirements: Python >= 3.6, 1.6 <= PyTorch < 1.7.

RoBERTa

Robustly Optimized BERT (RoBERTa) is a BERT variation by Facebook. The most important changes are removing the next sentence prediction objective and dynamically changing the masking pattern applied to the training data. Otherwise, it is just BERT on steroids (training longer, bigger batches, longer sequences).

Interestingly, the RoBERTa paper was rejected by ICLR 2020.

Available implementations

Requirements: Python >= 3.6, PyTorch >= 1.4, NCCL.

Developed by Facebook.

Multi-GPU training: Yes, PyTorch + NCCL

Multi-node training: Yes, PyTorch + NCCL

Training time: not reported (essentially, they just recommend to train for as long as you can)

Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.

Developed by HuggingFace.

Multi-GPU training: Yes, PyTorch + NCCL

Multi-node training: Yes, PyTorch + NCCL

Training time: unknown

ELECTRA

In ELECTRA, a discriminator model tries to detect which tokens in the input were replaced by a small generator language model. It is claimed to be computationally efficient in comparison to other Transformer models (Clark et al 2019).

Available implementations

Requirements: Python 3, 1.15 <= TensorFlow < 2.0.

Developed by Google.

Multi-GPU training: Not supported.

Multi-node training: Not officially supported, but supposedly can be achieved with Distributed training or with Horovod

Training time: training on 18 billion words takes 4 days on 1 NVIDIA V100 GPU.

Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.

Developed by HuggingFace (well, strictly speaking it is still in development)

Multi-GPU training: Yes, PyTorch + NCCL

Multi-node training: Yes, PyTorch + NCCL

Training time: Should be approximately the same as the reference implementation, but not directly reported.