This page provides a recipe to large-scale pre-training of a BERT neural language model, using the high-efficiency NVIDIA BERT implementation (which is based on TensorFlow, in contrast to the NVIDIA Megatron code).
We also assume that core software (compilers, most toolchains, CUDA drivers, etc) are also already installed system-wide, or at least that their easyconfigs are available to the system-wide EasyBuild installation.
Finally, the host machine must have Internet connection.
Setting things up
- Clone our repository: git clone firstname.lastname@example.org:nlpl/easybuild.git
- Its directory ('easybuild') will serve as your building factory. Rename it to whatever you think fits well. Change to this directory.
- To use the same procedure across different systems we provide a custom preparation script.
- To prepare directories and get the path settings run it:
- It will create the file SETUP.local with the settings you are going to use in the future (by simply running source SETUP.local after loading EasyBuild).
- It will also print a command for your users to run if they want to be able to load your custom modules.
Building software and installing custom modules
- Load EasyBuild (e.g., module load EasyBuild/4.3.0)
- Load your settings: source SETUP.local
- Check your settings: eb --show-config
- Imitate NVIDIA BERT installation by running
eb --robot nlpl-nvidia-bert-tf-20.06.08-gomkl-2019b-Python3.7.4.eb --dry-run
- EasyBuild will show the list of required modules, marking those which have to be installed from scratch (by downloading and building the corresponding software).
- If no warning or errors were shown, build everything required by NVIDIA BERT:
eb --robot nlpl-nvidia-bert-tf-20.06.08-gomkl-2019b-Python3.7.4.eb
- After the process is finished, your modules will be visible along with the system-provided ones via module avail, and can be loaded with module load.
- Approximate building time for all the required modules is several hours (from 3 to 5).
To train BERT, 3 data pieces are required:
- a training corpus (CORPUS), a collection of plain text files (can be gzip-compressed)
- a WordPiece vocabulary (VOCAB), a plain text file
- a BERT configuration (CONFIG), a JSON file defining the model hyperparameters
Ready-to-use toy examples data can be found in the tests/text_data subdirectory:
- no_wiki/: a directory with texts from Norwegian Wikipedia (about 1.2 million words)
- norwegian_wordpiece_vocab_20k.txt: Norwegian WordPiece vocabulary (20 000 entries)
- norbert_config.json: BERT configuration file replicating BERT-Small for English (adapted to the number of entries in the vocabulary)
- Extend your $MODULEPATH with the path to your custom modules, using the command suggested by the setdir.sh script:
- module use -a PATH_TO_YOUR_REPOSITORY/easybuild/install/modules/all/
- Load the NVIDIA BERT module:
- module load nlpl-nvidia-bert/20.06.8-gomkl-2019b-tensorflow-1.15.2-Python-3.7.4
- Run the training script:
- train_bert.sh CORPUS VOCAB CONFIG
- This will convert your text data into TF Record files (stored in data/tfrecords/) and then train a BERT model with batch size 48 and 1000 train steps (the model will be saved in model/)
- Training on the toy data above takes not more than an hour on 4 Saga GPUs.
- We use 4 GPUs by default (to test hardware and software as much as possible). Modify the train_bert.sh script to change this or other BERT training parameters.