Difference between revisions of "Eosc/easybuild/andreku"
(→Status) |
(→Important stuff to remember) |
||
(16 intermediate revisions by the same user not shown) | |||
Line 7: | Line 7: | ||
Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak | Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak | ||
− | '''export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/ | + | Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/ |
+ | |||
+ | '''export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/projects/nn9851k/software/''' | ||
(or just '''source PATH.local''') | (or just '''source PATH.local''') | ||
Line 20: | Line 22: | ||
19/11/2020: '''gomkl''' toolchain built with Intel MKL 2019.1.144 | 19/11/2020: '''gomkl''' toolchain built with Intel MKL 2019.1.144 | ||
− | 21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the '''gomkl'' toolchain. | + | 21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the '''gomkl''' toolchain. |
+ | |||
+ | 22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model. | ||
+ | |||
+ | 24/11/2020: built the [https://source.coderefinery.org/nlpl/easybuild/-/issues/12 NVIDIA BERT module]. | ||
+ | |||
+ | 25/11/2020: solved the [https://source.coderefinery.org/nlpl/easybuild/-/issues/4 branding issue]. | ||
+ | |||
+ | 27/11/2020: documentation to reproduce from scratch ('LUMI challenge') [https://source.coderefinery.org/nlpl/easybuild/-/issues/13 is ready] | ||
+ | |||
+ | 28/11/2020: [https://source.coderefinery.org/nlpl/easybuild/-/issues/11 benchmarking finalized] | ||
= To use: = | = To use: = | ||
− | '''module use -a /cluster/ | + | '''module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/''' |
+ | |||
+ | '''module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4''' | ||
+ | |||
+ | = Remaining issues = | ||
+ | * TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter [https://source.coderefinery.org/nlpl/easybuild/-/issues/9 failed]. Should find a way to make EasyBuild look for a non-standard CUDA location. | ||
+ | * Add easyconfigs for TensorFlow 2.0 and Transformers |
Revision as of 12:58, 5 January 2021
Important stuff to remember
export EB_PYTHON=python3
module load EasyBuild/4.3.0
Playground on Saga: /cluster/shared/nlpl/software/easybuild_ak
Real module location: /cluster/projects/nn9851k/software/easybuild/install/modules/all/
export EASYBUILD_ROBOT_PATHS=/cluster/software/EasyBuild/4.3.0/easybuild/easyconfigs:/cluster/projects/nn9851k/software/
(or just source PATH.local)
Repository: https://source.coderefinery.org/nlpl/easybuild/-/tree/ak-dev
Status
03/11/2020: successfully built cython-0.29.21-foss-2019b-Python-3.7.4, numpy-1.18.1-foss-2019b-Python-3.7.4, SciPy-bundle-2020.03-foss-2019b-Python-3.7.4, Bazel-0.26.1-foss-2019b, h5py-2.10.0-foss-2019b-Python-3.7.4.
04/11/2020: TensorFlow 1.15.2 successfully built and installed, using CUDA 10.1.243
19/11/2020: gomkl toolchain built with Intel MKL 2019.1.144
21/11/2020: successfully built everything (including TensorFlow 1.15.2) with the gomkl toolchain.
22/11/2020: built Horovod and made sure the TensorFlow+Horovod combination is able to train a Bert model.
24/11/2020: built the NVIDIA BERT module.
25/11/2020: solved the branding issue.
27/11/2020: documentation to reproduce from scratch ('LUMI challenge') is ready
28/11/2020: benchmarking finalized
To use:
module use -a /cluster/projects/nn9851k/software/easybuild/install/modules/all/
module load NLPL-TensorFlow/1.15.2-gomkl-2019b-Python-3.7.4
Remaining issues
- TensorFlow is built with CUDA 10.1.243, not CUDA 10.0.130. Attempts to use the latter failed. Should find a way to make EasyBuild look for a non-standard CUDA location.
- Add easyconfigs for TensorFlow 2.0 and Transformers