Difference between revisions of "Eosc/easybuild/andreku"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= Important stuff to remember = '''export EB_PYTHON=python3''' '''module load EasyBuild/4.3.0''' Playground on Saga: /cluster/shared/nlpl/software/easybuild4 '''export EA...")
 
(Remaining issues)
 
(24 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
= Important stuff to remember =
 
= Important stuff to remember =
 
  
 
'''export EB_PYTHON=python3'''
 
'''export EB_PYTHON=python3'''
Line 6: Line 5:
 
'''module load EasyBuild/4.3.0'''
 
'''module load EasyBuild/4.3.0'''
  
Playground on Saga: /cluster/shared/nlpl/software/easybuild4
+
Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak
 +
 
 +
Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/
 +
 
 +
'''export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/projects/nn9851k/software/'''
 +
 
 +
(or just '''source PATH.local''')
 +
 
 +
Repository: https://source.coderefinery.org/nlpl/easybuild/-/tree/ak-dev
 +
 
 +
= Status =
 +
03/11/2020: successfully built cython-0.29.21-foss-2019b-Python-3.7.4, numpy-1.18.1-foss-2019b-Python-3.7.4, SciPy-bundle-2020.03-foss-2019b-Python-3.7.4, Bazel-0.26.1-foss-2019b, h5py-2.10.0-foss-2019b-Python-3.7.4.
 +
 
 +
04/11/2020: TensorFlow 1.15.2 successfully built and installed, using CUDA 10.1.243
 +
 
 +
19/11/2020: '''gomkl''' toolchain built with Intel MKL 2019.1.144
 +
 
 +
21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the '''gomkl''' toolchain.
 +
 
 +
22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.
 +
 
 +
24/11/2020: built the [https://source.coderefinery.org/nlpl/easybuild/-/issues/12 NVIDIA BERT module].
 +
 
 +
25/11/2020: solved the [https://source.coderefinery.org/nlpl/easybuild/-/issues/4 branding issue].
 +
 
 +
27/11/2020: documentation to reproduce from scratch ('LUMI challenge') [https://source.coderefinery.org/nlpl/easybuild/-/issues/13 is ready]
 +
 
 +
28/11/2020: [https://source.coderefinery.org/nlpl/easybuild/-/issues/11 benchmarking finalized]
 +
 
 +
= To use: =
 +
'''module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/'''
  
'''export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/shared/nlpl/software/easybuild4'''
+
'''module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4'''
  
Repository: https://source.coderefinery.org/nlpl/easybuild
+
= Remaining issues =
 +
* TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter [https://source.coderefinery.org/nlpl/easybuild/-/issues/9 failed]. Should find a way to make EasyBuild look for a non-standard CUDA location.

Latest revision as of 13:18, 28 January 2021

Important stuff to remember

export EB_PYTHON=python3

module load EasyBuild/4.3.0

Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak

Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/

export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/projects/nn9851k/software/

(or just source PATH.local)

Repository: https://source.coderefinery.org/nlpl/easybuild/-/tree/ak-dev

Status

03/11/2020: successfully built cython-0.29.21-foss-2019b-Python-3.7.4, numpy-1.18.1-foss-2019b-Python-3.7.4, SciPy-bundle-2020.03-foss-2019b-Python-3.7.4, Bazel-0.26.1-foss-2019b, h5py-2.10.0-foss-2019b-Python-3.7.4.

04/11/2020: TensorFlow 1.15.2 successfully built and installed, using CUDA 10.1.243

19/11/2020: gomkl toolchain built with Intel MKL 2019.1.144

21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the gomkl toolchain.

22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.

24/11/2020: built the NVIDIA BERT module.

25/11/2020: solved the branding issue.

27/11/2020: documentation to reproduce from scratch ('LUMI challenge') is ready

28/11/2020: benchmarking finalized

To use:

module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/

module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4

Remaining issues

  • TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter failed. Should find a way to make EasyBuild look for a non-standard CUDA location.