Important stuff to remember
module load EasyBuild/4.3.0
Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak
Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/
(or just source PATH.local)
03/11/2020: successfully built cython-0.29.21-foss-2019b-Python-3.7.4, numpy-1.18.1-foss-2019b-Python-3.7.4, SciPy-bundle-2020.03-foss-2019b-Python-3.7.4, Bazel-0.26.1-foss-2019b, h5py-2.10.0-foss-2019b-Python-3.7.4.
04/11/2020: TensorFlow 1.15.2 successfully built and installed, using CUDA 10.1.243
19/11/2020: gomkl toolchain built with Intel MKL 2019.1.144
21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the gomkl toolchain.
22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.
24/11/2020: built the NVIDIA BERT module.
25/11/2020: solved the branding issue.
27/11/2020: documentation to reproduce from scratch ('LUMI challenge') is ready
28/11/2020: benchmarking finalized
module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/
module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4
- TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter failed. Should find a way to make EasyBuild look for a non-standard CUDA location.
Deployment on Puhti
- module use -a /projappl/nlpl/software/andreku/modules/all
- module load EasyBuild
- source SETUP.local
We are building everything from scratch. We borrow the necessary binary blobs (for example the JDK tarball) from Saga, they are stored on Puhti in /projappl/nlpl/software/andreku/extdownloads
There is an issue with GCC building, fixed by manual intervention in the MPC tarball. A more general solution (applicable not only to Java, but to other packages as well) is to change the EasyBuild build directory to any location not on Lustre distributed file system. I simply use /tmp/andreku/, which is local ext4.
Bazel and TensorFlow must be built on a compute node; it seems that the Puhti login nodes are too restrictive in terms of maximum thread number.