Difference between revisions of "Eosc/easybuild/andreku"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Status)
(Deployment on Puhti)
 
(18 intermediate revisions by the same user not shown)
Line 7: Line 7:
 
Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak
 
Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak
  
'''export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/shared/nlpl/software/easybuild_ak'''
+
Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/
 +
 
 +
'''export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/projects/nn9851k/software/'''
  
 
(or just '''source PATH.local''')
 
(or just '''source PATH.local''')
Line 23: Line 25:
  
 
22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.
 
22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.
 +
 +
24/11/2020: built the [https://source.coderefinery.org/nlpl/easybuild/-/issues/12 NVIDIA BERT module].
 +
 +
25/11/2020: solved the [https://source.coderefinery.org/nlpl/easybuild/-/issues/4 branding issue].
 +
 +
27/11/2020: documentation to reproduce from scratch ('LUMI challenge') [https://source.coderefinery.org/nlpl/easybuild/-/issues/13 is ready]
 +
 +
28/11/2020: [https://source.coderefinery.org/nlpl/easybuild/-/issues/11 benchmarking finalized]
  
 
= To use: =
 
= To use: =
'''module use -a /cluster/shared/nlpl/software/easybuild_ak/easybuild/install/modules/all/'''
+
'''module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/'''
  
 
'''module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4'''
 
'''module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4'''
  
 
= Remaining issues =
 
= Remaining issues =
* [https://source.coderefinery.org/nlpl/easybuild/-/issues/12 NVIDIA Bert implementation packaged as a module]
 
* In order to keep the modules NLPL-branded, [https://source.coderefinery.org/nlpl/easybuild/-/issues/4#note_13895 environment variables must be added manually to the module files]. Without that, modules load fine, but cannot be used as dependencies in building other modules.
 
 
* TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter [https://source.coderefinery.org/nlpl/easybuild/-/issues/9 failed]. Should find a way to make EasyBuild look for a non-standard CUDA location.
 
* TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter [https://source.coderefinery.org/nlpl/easybuild/-/issues/9 failed]. Should find a way to make EasyBuild look for a non-standard CUDA location.
* Check whether using '''gompi''' instead of '''gompic''' (with CUDA) [https://source.coderefinery.org/nlpl/easybuild/-/issues/9#note_13894 leads to problems with multi-node training]. Multi-GPU training on a single node is confirmed to work.
+
 
* [https://source.coderefinery.org/nlpl/easybuild/-/issues/13 Documentation]
+
= Deployment on Puhti =
 +
Location: /projappl/nlpl/software/andreku
 +
 
 +
Initial setup:
 +
 
 +
* Install EasyBuild (we need its easyconfigs in '''robot-paths''')
 +
* '''module use -a /projappl/nlpl/software/andreku/modules/all'''
 +
* '''module load EasyBuild'''
 +
* '''source SETUP.local'''
 +
 
 +
We are building everything from scratch. We borrow the necessary binary blobs from Saga, they are stored on Puhti in /projappl/nlpl/software/andreku/extdownloads. In fact, only three such tarballs are needed to eventually build the NVIDIA BERT implementation:
 +
# jdk-8u212-linux-x64.tar.gz to build Java to build Bazel to build TensorFlow,
 +
# cudnn-10.1-linux-x64-v7.6.4.38.tgz to build cuDNN to build TensorFlow,
 +
# nccl_2.6.4-1+cuda10.1_x86_64.txz to build NCCL to build TensorFlow.
 +
 
 +
All the required easyconfigs, scripts, etc can be found in the [https://source.coderefinery.org/nlpl/easybuild/-/tree/scratch `scratch` branch].
 +
 
 +
There was an [https://github.com/easybuilders/easybuild-easyconfigs/issues/12321 issue with GCC building], fixed by manual intervention in the MPC tarball. A more general solution (applicable not only to GCC but also to Java, and other packages as well) is to change the EasyBuild build directory to any location not on Lustre distributed file system. One can simply use '''$TMPDIR''' (this is a local ext4 partition).
 +
 
 +
Bazel, TensorFlow and PyTorch must be built on a compute node; it seems that the Puhti login nodes are too restrictive in terms of maximum thread number. In this case, the build directory should be set to $LOCAL_SCRATCH:
 +
 
 +
'''export EASYBUILD_BUILDPATH=$LOCAL_SCRATCH'''

Latest revision as of 18:55, 15 April 2021

Important stuff to remember

export EB_PYTHON=python3

module load EasyBuild/4.3.0

Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak

Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/

export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/projects/nn9851k/software/

(or just source PATH.local)

Repository: https://source.coderefinery.org/nlpl/easybuild/-/tree/ak-dev

Status

03/11/2020: successfully built cython-0.29.21-foss-2019b-Python-3.7.4, numpy-1.18.1-foss-2019b-Python-3.7.4, SciPy-bundle-2020.03-foss-2019b-Python-3.7.4, Bazel-0.26.1-foss-2019b, h5py-2.10.0-foss-2019b-Python-3.7.4.

04/11/2020: TensorFlow 1.15.2 successfully built and installed, using CUDA 10.1.243

19/11/2020: gomkl toolchain built with Intel MKL 2019.1.144

21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the gomkl toolchain.

22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.

24/11/2020: built the NVIDIA BERT module.

25/11/2020: solved the branding issue.

27/11/2020: documentation to reproduce from scratch ('LUMI challenge') is ready

28/11/2020: benchmarking finalized

To use:

module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/

module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4

Remaining issues

  • TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter failed. Should find a way to make EasyBuild look for a non-standard CUDA location.

Deployment on Puhti

Location: /projappl/nlpl/software/andreku

Initial setup:

  • Install EasyBuild (we need its easyconfigs in robot-paths)
  • module use -a /projappl/nlpl/software/andreku/modules/all
  • module load EasyBuild
  • source SETUP.local

We are building everything from scratch. We borrow the necessary binary blobs from Saga, they are stored on Puhti in /projappl/nlpl/software/andreku/extdownloads. In fact, only three such tarballs are needed to eventually build the NVIDIA BERT implementation:

  1. jdk-8u212-linux-x64.tar.gz to build Java to build Bazel to build TensorFlow,
  2. cudnn-10.1-linux-x64-v7.6.4.38.tgz to build cuDNN to build TensorFlow,
  3. nccl_2.6.4-1+cuda10.1_x86_64.txz to build NCCL to build TensorFlow.

All the required easyconfigs, scripts, etc can be found in the `scratch` branch.

There was an issue with GCC building, fixed by manual intervention in the MPC tarball. A more general solution (applicable not only to GCC but also to Java, and other packages as well) is to change the EasyBuild build directory to any location not on Lustre distributed file system. One can simply use $TMPDIR (this is a local ext4 partition).

Bazel, TensorFlow and PyTorch must be built on a compute node; it seems that the Puhti login nodes are too restrictive in terms of maximum thread number. In this case, the build directory should be set to $LOCAL_SCRATCH:

export EASYBUILD_BUILDPATH=$LOCAL_SCRATCH