Important stuff to remember
module load EasyBuild/4.3.0
Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak
Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/
(or just source PATH.local)
03/11/2020: successfully built cython-0.29.21-foss-2019b-Python-3.7.4, numpy-1.18.1-foss-2019b-Python-3.7.4, SciPy-bundle-2020.03-foss-2019b-Python-3.7.4, Bazel-0.26.1-foss-2019b, h5py-2.10.0-foss-2019b-Python-3.7.4.
04/11/2020: TensorFlow 1.15.2 successfully built and installed, using CUDA 10.1.243
19/11/2020: gomkl toolchain built with Intel MKL 2019.1.144
21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the gomkl toolchain.
22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.
24/11/2020: built the NVIDIA BERT module.
25/11/2020: solved the branding issue.
27/11/2020: documentation to reproduce from scratch ('LUMI challenge') is ready
28/11/2020: benchmarking finalized
module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/
module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4
- TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter failed. Should find a way to make EasyBuild look for a non-standard CUDA location.
Deployment on Puhti
- Install EasyBuild (we need its easyconfigs in robot-paths)
- module use -a /projappl/nlpl/software/andreku/modules/all
- module load EasyBuild
- source SETUP.local
We are building everything from scratch. We borrow the necessary binary blobs from Saga, they are stored on Puhti in /projappl/nlpl/software/andreku/extdownloads. In fact, only three such tarballs are needed to eventually build the NVIDIA BERT implementation:
- jdk-8u212-linux-x64.tar.gz to build Java to build Bazel to build TensorFlow,
- cudnn-10.1-linux-x64-v184.108.40.206.tgz to build cuDNN to build TensorFlow,
- nccl_2.6.4-1+cuda10.1_x86_64.txz to build NCCL to build TensorFlow.
All the required easyconfigs, scripts, etc can be found in the `scratch` branch.
There was an issue with GCC building, fixed by manual intervention in the MPC tarball. A more general solution (applicable not only to GCC but also to Java, and other packages as well) is to change the EasyBuild build directory to any location not on Lustre distributed file system. One can simply use $TMPDIR (this is a local ext4 partition).
Bazel, TensorFlow and PyTorch must be built on a compute node; it seems that the Puhti login nodes are too restrictive in terms of maximum thread number. In this case, the build directory should be set to $LOCAL_SCRATCH: