Difference between revisions of "Eosc/pretraining"
(→ELMo) |
(lists) |
||
Line 17: | Line 17: | ||
== Available implementations == | == Available implementations == | ||
− | + | * [https://github.com/google-research/bert Reference Google implementation in TensorFlow]. | |
Requirements: 1.11 <= TensorFlow < 2.0. | Requirements: 1.11 <= TensorFlow < 2.0. | ||
− | + | * [https://huggingface.co/transformers/model_doc/bert.html HuggingFace Transformers implementation]. | |
Can train either with TensorFlow or with PyTorch. | Can train either with TensorFlow or with PyTorch. | ||
Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1. | Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1. | ||
− | + | * [https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/BERT NVIDIA BERT for TF]. | |
Add multi-node, multi-gpu support and XLA and mixed precision; recommended by our [https://github.com/TurkuNLP/FinBERT/blob/master/nlpl_tutorial/training_bert.md role models]. | Add multi-node, multi-gpu support and XLA and mixed precision; recommended by our [https://github.com/TurkuNLP/FinBERT/blob/master/nlpl_tutorial/training_bert.md role models]. | ||
Requirements: Docker, tensorflow >= 1.11, networkx, [https://github.com/NVIDIA/enroot Enroot] and [https://github.com/NVIDIA/pyxis Pyxis] for multi-node training. | Requirements: Docker, tensorflow >= 1.11, networkx, [https://github.com/NVIDIA/enroot Enroot] and [https://github.com/NVIDIA/pyxis Pyxis] for multi-node training. | ||
− | + | * [https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT NVIDIA BERT for PyTorch]. | |
Add multi-node, multi-gpu support and XLA and mixed precision. | Add multi-node, multi-gpu support and XLA and mixed precision. | ||
Requirements: Docker, PyTorch NGC container from Nvidia, [https://github.com/NVIDIA/enroot Enroot] and [https://github.com/NVIDIA/pyxis Pyxis] for multi-node training. | Requirements: Docker, PyTorch NGC container from Nvidia, [https://github.com/NVIDIA/enroot Enroot] and [https://github.com/NVIDIA/pyxis Pyxis] for multi-node training. | ||
− | + | * [https://github.com/soskek/bert-chainer Chainer implementation]. | |
Not much interesting to us, since it does not support training, only inference. | Not much interesting to us, since it does not support training, only inference. | ||
Line 42: | Line 42: | ||
== Available implementations == | == Available implementations == | ||
− | + | * [https://github.com/allenai/bilm-tf Reference TensorFlow implementation]. | |
Requirements: Python >=3.5, 1.2 < TensorFlow < 1.13 (later versions produce too many deprecation warnings), h5py. | Requirements: Python >=3.5, 1.2 < TensorFlow < 1.13 (later versions produce too many deprecation warnings), h5py. | ||
Line 51: | Line 51: | ||
''Training time'': one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192) | ''Training time'': one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192) | ||
− | + | * [https://github.com/ltgoslo/simple_elmo_training LTG implementation]. | |
Based on the reference implementation, but with improved data loading, hyper-parameter handling, and the code updated to more recent versions of TensorFlow. | Based on the reference implementation, but with improved data loading, hyper-parameter handling, and the code updated to more recent versions of TensorFlow. | ||
Requirements: Python >=3.5, 1.15 <= TensorFlow < 2.0 (2.0 version is planned), h5py, smart_open. | Requirements: Python >=3.5, 1.15 <= TensorFlow < 2.0 (2.0 version is planned), h5py, smart_open. | ||
Line 62: | Line 62: | ||
''Training time'': one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192) | ''Training time'': one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192) | ||
− | + | * [https://docs.allennlp.org/master/api/data/token_indexers/elmo_indexer/ PyTorch implementation in AllenNLP]. | |
Not much interesting to us, since it does not support training, only inference. | Not much interesting to us, since it does not support training, only inference. | ||
Requirements: Python >= 3.6, 1.6 <= PyTorch < 1.7. | Requirements: Python >= 3.6, 1.6 <= PyTorch < 1.7. | ||
Line 73: | Line 73: | ||
== Available implementations == | == Available implementations == | ||
− | + | * [https://github.com/pytorch/fairseq/tree/master/examples/roberta Reference implementation in Fairseq]. | |
Requirements: Python >= 3.6, PyTorch >= 1.4, [https://github.com/NVIDIA/nccl NCCL]. | Requirements: Python >= 3.6, PyTorch >= 1.4, [https://github.com/NVIDIA/nccl NCCL]. | ||
− | + | * [https://huggingface.co/transformers/model_doc/roberta.html HuggingFace Transformers implementation]. | |
Can train either with TensorFlow or with PyTorch. | Can train either with TensorFlow or with PyTorch. | ||
Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1. | Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1. | ||
Line 85: | Line 85: | ||
== Available implementations == | == Available implementations == | ||
− | + | * [https://github.com/google-research/electra Reference Google implementation in TensorFlow]. | |
Single-GPU training only. | Single-GPU training only. | ||
Requirements: Python 3, 1.15 <= TensorFlow < 2.0. | Requirements: Python 3, 1.15 <= TensorFlow < 2.0. | ||
− | + | * [https://huggingface.co/transformers/model_doc/electra.html HuggingFace Transformers implementation]. | |
Can train either with TensorFlow or with PyTorch. | Can train either with TensorFlow or with PyTorch. | ||
Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1. | Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1. |
Revision as of 23:12, 1 September 2020
Contents
Background
This page provides an informal, technically-oriented survey over available (and commonly used) architectures and implementations for large-scale pre-training (and fine-tuning) of contextualized neural language models.
The NLPL use case, will install, validate, and maintain a selection of these implementations, in an automated and uniform manner, on multiple HPC systems.
BERT
Bidirectional Encoder Representations from Transformers (BERT) is a deep language model jointly conditioned on both left and right context in all layers. It is based on the Transformer neural architecture (Devlin et al 2019).
The de-facto standard for contextualized representations in modern NLP.
Available implementations
Requirements: 1.11 <= TensorFlow < 2.0.
Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.
Add multi-node, multi-gpu support and XLA and mixed precision; recommended by our role models. Requirements: Docker, tensorflow >= 1.11, networkx, Enroot and Pyxis for multi-node training.
Add multi-node, multi-gpu support and XLA and mixed precision. Requirements: Docker, PyTorch NGC container from Nvidia, Enroot and Pyxis for multi-node training.
Not much interesting to us, since it does not support training, only inference.
ELMo
Embeddings from Language Models (ELMo) use bidirectional LSTM language models to produce contextualized word token representations (Peters et al 2018).
The only architecture in the list to use recurrent neural networks, not Transformers. Despite being much less computationally demanding, often performs on par with BERT.
Available implementations
Requirements: Python >=3.5, 1.2 < TensorFlow < 1.13 (later versions produce too many deprecation warnings), h5py.
Created (but not much maintained) by Allen AI.
Multi-node training: unknown
Training time: one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192)
Based on the reference implementation, but with improved data loading, hyper-parameter handling, and the code updated to more recent versions of TensorFlow. Requirements: Python >=3.5, 1.15 <= TensorFlow < 2.0 (2.0 version is planned), h5py, smart_open. Tutorial is available. A PyPi module is planned.
Created by UiO LTG.
Multi-node training: unknown
Training time: one epoch over 100 million word tokens takes 3 hours with 2 NVIDIA P100 GPUs (batch size 192)
Not much interesting to us, since it does not support training, only inference. Requirements: Python >= 3.6, 1.6 <= PyTorch < 1.7.
RoBERTa
Robustly Optimized BERT (RoBERTa) is a BERT variation by Facebook. The most important changes are removing the next sentence prediction objective and dynamically changing the masking pattern applied to the training data. Otherwise, it is just BERT on steroids (training longer, bigger batches, longer sequences). Interestingly, the RoBERTa paper was rejected by ICLR 2020.
Available implementations
Requirements: Python >= 3.6, PyTorch >= 1.4, NCCL.
Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.
ELECTRA
In ELECTRA, a discriminator model tries to detect which tokens in the input were replaced by a small generator language model. It is claimed to be computationally efficient in comparison to other Transformer models (Clark et al 2019).
Available implementations
Single-GPU training only. Requirements: Python 3, 1.15 <= TensorFlow < 2.0.
Can train either with TensorFlow or with PyTorch. Requirements: Python >=3.6, TensorFlow >= 2.0, PyTorch >=1.3.1.