Eosc/pretraining/nvidia

From Nordic Language Processing Laboratory
Jump to: navigation, search

Background

This page provides a recipe to large-scale pre-training of a BERT neural language model, using the high-efficiency NVIDIA BERT implementation (which is based on TensorFlow and NCCL, among others, in contrast to the NVIDIA Megatron code).

Software Installation

Our software (so called "NLPL Virtual Laboratory") is installed with the help of EasyBuild. For this, we provide configuration files and scripts for uniform software deployment across different HPC clusters.

All the modules are provided in two mutually exclusive versions: for the foss toolchain and the gomkl toolchain (these strings are always in the name of the module). The foss toolchain uses OpenBLAS 0.3.7; the gomkl toolchain uses Intel Math Kernel Library (IMKL) 2019.1.144. If your system is equipped with AMD CPUs, your only option is foss; if your system is equipped with Intel CPUs, you can try both toolchains. As a rule, gomkl is somewhat faster than foss for typical NLP loads.

Prerequisites

We assume that EasyBuild and Lmod are already installed on the host machine.

If this is not the case, we recommend to install EasyBuild using the bootstrapping procedure.

We also assume that core software (compilers, most toolchains, CUDA drivers, etc) are already installed system-wide as well, or at least that their easyconfigs (module description files) are available to the system-wide EasyBuild installation. If your system already provides exactly the same version of some software package that we need, we will use it. If not, it will be built from scratch.

Finally, the host machine must have Internet connection.

Important note for NVIDIA A100 GPUs

NVIDIA A100 GPUs (any GPUs with CUDA compute capability 8) are designed to work best with CUDA 11 and cuDNN 8. Applications built with CUDA 10 and cuDNN 7 in principle can be run on A100 (by JIT-compiling PTX code into GPU binary code on every run), but this is not recommended by NVIDIA.

TensorFlow supports CUDA 11 and cuDNN 8 only starting from TF 2.4. Earlier TensorFlow versions (and definitely TF 1) are not guaranteed to compile with CUDA 11. In practice, our attempts to do this indeed failed, and the same is true for other practitioners. It might still be possible to build TensorFlow 1.15 with CUDA 11, but this will arguably require a significant amount of tinkering.

BERT training code below is based on the NVIDIA BERT implementation which uses TensorFlow 1 (for example, 1.15.2 is well tested by us). It is not functional with TensorFlow 2 without significant rewriting. Thus, it is bound to the libraries built with CUDA 10. When this software stack is run on A100 GPUs, unexpected behavior can occur: warnings, errors and failures.

We are still looking for ways to cope with this, but as of now the BERT training recipe below is 100% guaranteed to run only on NVIDIA P100 (Pascal) and V100 (Volta) architectures.

Setting things up

  • Clone our repository: git clone https://source.coderefinery.org/nlpl/easybuild.git
  • Its directory ('easybuild') will serve as your building factory. Rename it to whatever you think fits well. Change to this directory.
  • To use the same procedure across different systems we provide a custom preparation script.
  • To prepare directories and get the path settings, run it: ./setdir.sh
  • If your EasyBuild version is different from 4.3.3, change the version in the setdir.sh script.
  • By default, your custom modules will be stored in the modules subdirectory, and the compiled software itself will be stored in the software subdirectory. If needed, you can change the EASYBUILD_INSTALLPATH_MODULES and EASYBUILD_INSTALLPATH_SOFTWARE variables in the setdir.sh script.
  • The script will create the SETUP.local file with the settings you are going to use in the future (by simply running source SETUP.local after loading EasyBuild).
  • It will also print a command for your users to run if they want to be able to load your custom modules (module use -a PATH-TO-MODULES). You can add it to your .bashrc file, or put it in all your SLURM scripts.

Building software and installing custom modules

  • Load EasyBuild (e.g., module load EasyBuild/4.3.3)
  • Load your settings: source SETUP.local
  • Check your settings: eb --show-config
  • Imitate NVIDIA BERT installation by running

eb --robot nlpl-nvidia-bert-tf-20.06.08-gomkl-2019b-Python3.7.4.eb --dry-run

  • or eb --robot nlpl-nvidia-bert-tf-20.06.08-foss-2019b-Python3.7.4.eb --dry-run if your CPU architecture is different from Intel (for example, AMD)
  • EasyBuild will show the list of required modules, marking those which have to be installed from scratch (by downloading and building the corresponding software).
  • If no warning or errors were shown, build everything required by NVIDIA BERT:

eb --robot nlpl-nvidia-bert-tf-20.06.08-gomkl-2019b-Python3.7.4.eb

  • or eb --robot nlpl-nvidia-bert-tf-20.06.08-foss-2019b-Python3.7.4.eb if your CPU architecture is different from Intel (for example, AMD)
  • After the process is finished, your modules will be visible along with the system-provided ones via module avail, and can be loaded with module load.
  • Approximate building time for all the required modules is several hours (from 3 to 5).

Fine-tuning

An important parameter that you should tune to your needs is the CUDA compute capabilities level for TensorFlow and PyTorch. By default, our easyconfigs perform building with levels 6.0 and 7.0. But if your system uses only one particular type of GPUs (which is most probably the case), you should choose only the corresponding level, to decrease the building time and the resulting binaries size. Simply choose the level suitable for your GPUs and change this line in the TensorFlow or PyTorch easyconfigs to your choice:

cuda_compute_capabilities = ["6.0", "7.0"]

Data Preparation

To train BERT, 3 data pieces are required:

  • a training corpus (CORPUS), a collection of plain text files (can be gzip-compressed)
  • a WordPiece vocabulary (VOCAB), a plain text file
  • a BERT configuration (CONFIG), a JSON file defining the model hyperparameters

Ready-to-use toy examples data can be found in the tests/text_data subdirectory:

  • no_wiki/: a directory with texts from Norwegian Wikipedia (about 1.2 million words)
  • norwegian_wordpiece_vocab_20k.txt: Norwegian WordPiece vocabulary (20 000 entries)
  • norbert_config.json: BERT configuration file replicating BERT-Small for English (adapted to the number of entries in the vocabulary)

Training Example

  • Extend your $MODULEPATH with the path to your custom modules, using the command suggested by the setdir.sh script:
  • module use -a PATH_TO_YOUR_REPOSITORY/modules/all/
  • Load the NVIDIA BERT module:
  • module load nlpl-nvidia-bert/20.06.8-gomkl-2019b-tensorflow-1.15.2-Python-3.7.4
  • or module load nlpl-nvidia-bert/20.06.8-foss-2019b-tensorflow-1.15.2-Python-3.7.4 (if not using Intel CPUs)
  • Run the training script:
  • train_bert.sh CORPUS VOCAB CONFIG
  • This will convert your text data into TF Record files (stored in data/tfrecords/) and then train a BERT model with batch size 48 and 1000 train steps (the model will be saved in model/)
  • Training on the toy data above takes not more than an hour on 4 Saga GPUs.
  • We use 4 GPUs by default (to test hardware and software as much as possible). Modify the train_bert.sh script to change this or other BERT training parameters.

Testing

It is important to make sure that Numpy is built with accelerated linear algebra libraries (OpenBLAS or MKL). To test this, we provide the numpy_test.py script. Its execution time with our nlpl-numpy* modules should not be more than 10 minutes (usually much less).

It is also important to make sure that TensorFlow is built with proper GPU support. To test this, we provide the tensorflow_test.py script. It will tell you whether TensorFlow sees your GPU and will run simple calculations. The execution time should not be more than 10 minutes (usually much less).

Possible building issues

Too restrictive login nodes

On most HPC clusters, our modules can be built on login nodes. However, in some systems, login nodes might be too restrictive in terms of maximum thread number or memory limit. In this case, we recommend to build heavy modules (Java, Bazel, TensorFlow, PyTorch) on compute nodes.

Distributed filesystems

There is a known problem with building some software (GCC, Java) on a distributed file system like Lustre. If you hit this, simply change the EasyBuild build directory to any local path. For example, on Puhti cluster, one can use $TMPDIR (this is a local ext4 partition). When building on a compute node, this should be set to $LOCAL_SCRATCH.

Another issue than can arise on distributed filesystems is that they might not support the fallocate() syscall, leading to various weird errors when compiling software (usually at the linking stage). The solution here is to change the GCC easyconfig (GCCcore-8.3.0.eb) by adding this line:

use_gold_linker = False

After this, GCC must be rebuilt.

Another solution is to add the following line to the binutils easyconfig (binutils-2.32-GCCcore-8.3.0.eb):

configopts = ' --enable-gold=no'

Binary blobs

Most modules in our virtual laboratory will be built in a completely unsupervised way, and all the necessary source archives for them will be downloaded automatically. However, some software packages require manually downloading binary blobs from their websites after registration. Our scripts can't do that for you. Make sure to put the downloaded archives into the blobs subdirectory. Some of the examples include:

  • jdk-8u212-linux-x64.tar.gz is required to build Java to build Bazel (for TensorFlow),
  • cudnn-10.1-linux-x64-v7.6.4.38.tgz is required to build cuDNN (for TensorFlow),
  • nccl_2.6.4-1+cuda10.1_x86_64.txz is required to build NCCL (for TensorFlow).

Your system might already provide the necessary versions of Java, cuDNN and NCCL as modules. In this case, you do not have to download anything.