Difference between revisions of "Infrastructure/software/python"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Usage Example)
(Mixing and Matching)
 
(18 intermediate revisions by the same user not shown)
Line 19: Line 19:
 
of which some are near-universally used (e.g.
 
of which some are near-universally used (e.g.
 
[http://www.numpy.org/ NumPy] or
 
[http://www.numpy.org/ NumPy] or
[https://matplotlib.org/ MatPlotLib].
+
[https://matplotlib.org/ MatPlotLib]).
 
For these reasons, maintaining a Python environment for NLPL
 
For these reasons, maintaining a Python environment for NLPL
 
that is to some degree standardized and parallel across the
 
that is to some degree standardized and parallel across the
Line 27: Line 27:
  
 
NLPL standardizes on Python 3.x.
 
NLPL standardizes on Python 3.x.
In mid-2018, we use Python version 3.5 as the default starting
+
In 2018, we have used Python version 3.5 as the default starting
 
point for NLPL-specific installations of add-on modules, on both Abel and Taito.
 
point for NLPL-specific installations of add-on modules, on both Abel and Taito.
 +
In early 2019, we are adding experimental support for multiple
 +
Python versions, notably 3.7 (as the new default) in addition to 3.5.
 
The general philosophy is to emphasize (a) ''parallelism'',
 
The general philosophy is to emphasize (a) ''parallelism'',
 
(b) ''modularity'', and (c) ''replicability''.
 
(b) ''modularity'', and (c) ''replicability''.
Line 58: Line 60:
 
[https://keras.io/ Keras] abstraction layer is tightly coupled to TensorFlow;
 
[https://keras.io/ Keras] abstraction layer is tightly coupled to TensorFlow;
 
hence, Keras is not currently isolated as a separate module.
 
hence, Keras is not currently isolated as a separate module.
Likewise, there are some add-on modules that are tightly coupled to the
+
Likewise, there are some add-on modules that are intricately linked to the
 
core language, almost to the point of forming an ‘extended standard library’;
 
core language, almost to the point of forming an ‘extended standard library’;
 
NumPy is a prime example in this category.
 
NumPy is a prime example in this category.
A middle-ground candidate for modularization, arguably, is what is at times
+
Middle-ground candidates for modularization, arguably, are SciKitLearn
 +
and what is at times
 
called the [https://www.scipy.org/ SciPy] ecosystem, which (besides NumPy)
 
called the [https://www.scipy.org/ SciPy] ecosystem, which (besides NumPy)
 
bundles, among others, [http://matplotlib.org/ MatPlotLib],
 
bundles, among others, [http://matplotlib.org/ MatPlotLib],
Line 73: Line 76:
  
 
<pre>
 
<pre>
module load nlpl-nltk nlpl-gensim nlpl-tensorflow
+
module load nlpl-nltk/3.4/3.7 nlpl-gensim/3.7.3/3.7 nlpl-pytorch/1.1.0/3.7
 
</pre>
 
</pre>
  
= Random Ideas =
+
= Packaging Ideas =
 +
 
 +
To not ‘balkanize’ the NLPL module inventory too much, we plan on maintaining
 +
a few ‘natural’ bundles, for example the so-called SciPy ecosystem.
 +
As the SciPy bundle is comprised of a number of independent components that
 +
are each developed and released individually, an abstract scheme is required
 +
to assign version numbers to the bundle at large.
 +
Aiming for somewhat regular updates, say in four-month intervals, version
 +
numbers will be composed of six-digit ''YYYYMM'' identifiers, i.e. four
 +
digits for the calendar year followed by a two-digit number for the calendar
 +
month.
 +
 
 +
{| class="wikitable"
 +
|-
 +
! Component !! 201810 !! 201901 !! 201906
 +
|-
 +
| NumPy || 1.14.6 || — ||
 +
|-
 +
| SciPy Library || 1.1. 0 || 1.2.0 ||
 +
|-
 +
| MatPlotLib || 3.0.0 || 3.0.2 ||
 +
|-
 +
| IPython || 7.0.1 || 7.2.0 ||
 +
|-
 +
| Pandas || 0.23.4 || 0.24.0 ||
 +
|-
 +
| SciKit-Learn || — || 0.20.2 ||
 +
|}
 +
 
 +
= Automation =
 +
 
 +
On Abel, there is [http://svn.nlpl.eu/operation/python emerging support]
 +
for largely automated installation
 +
of Python modules, in a modular manner that allows support against
 +
multiple base Python versions.
 +
 
 +
Following is an example NLPL-specific module specification, from
 +
<tt>/projects/nlpl/operations/python/pytorch.txt</tt> (in May 2019):
 +
 
 +
<pre>
 +
#
 +
#$ module load gcc/4.9.2 cuda/9.0
 +
#$ module load nlpl-numpy/1.16.3/${dialect}
 +
#$ module load nlpl-scipy/201901/${dialect}
 +
#
 +
torch
 +
torchvision
 +
torchtext
 +
</pre>
 +
 
 +
The above can be installed using a (somewhat elaborate) shell
 +
script, which will (a) create a virtual environment for the
 +
new module; (b) make sure the base Python version is up-to-date
 +
(e.g. for <tt>pip</tt>); (c) execute the shell command included
 +
as comments in the specification file; (d) create a new module
 +
definition, including module dependencies; (e) activate the new
 +
module and <tt>pip</tt> install the requirements listed in the
 +
module specification; and, finally, (f) adjust the contents of
 +
the <tt>bin/</tt> sub-directory in the new module to move all
 +
commands and script into a lower sub-directory reflecting the
 +
base Python version.
 +
For example:
 +
 
 +
<pre>
 +
for i in 3.5.5 3.7.0; do
 +
  module purge; module load python3/$i;
 +
  /projects/nlpl/operation/python/initialize --version 1.1.0 pytorch
 +
done
 +
</pre>
 +
 
 +
= Challenges and Ideas =
 +
 
 +
The division of labor between central IT support and NLPL remains
 +
hard to define in a principled way.
 +
In mid-2018, we build on system-wide ‘core’ Python installations
 +
(on both Abel and Taito) and maintain everything else ourselves;
 +
so far, we have succeeded in only installing pre-built wheels.
 +
However, the core Python environments include some add-on modules
 +
that end up superseded with newer versions in NLPL add-ons,
 +
notably NumPy.
 +
On Abel at least (and possibly on Taito too; I have yet to check),
 +
the core NumPy was built locally, against the MKL implementation
 +
of BLAS and LAPACK; this link is broken in the newer NLPL modules
 +
that include NumPy (e.g. the SciPy bundle and TensorFlow):
 +
<pre>
 +
module purge; module load nlpl-tensorflow
 +
python3 -c "import numpy as np; print(np.__version__); np.__config__.show();"
 +
</pre>
  
 
In principle, one could reduce most of the virtual environments
 
In principle, one could reduce most of the virtual environments
 
to just the wheels that they provide, i.e. remove everything else
 
to just the wheels that they provide, i.e. remove everything else
from the standard virtual environment creation, including the
+
from the standard virtual environment creation, including most
binaries.
+
binaries (that are not module-specific).
As the wheels are organized in a directory hierarchy relative to
+
Because wheels are organized in a directory hierarchy relative to
 
each Python minor version, this should make it possible to install
 
each Python minor version, this should make it possible to install
 
the code for add-on components (like MatPlotLib, GenSim, or NLTK)
 
the code for add-on components (like MatPlotLib, GenSim, or NLTK)
 
for multiple Python versions into the same module.
 
for multiple Python versions into the same module.
 +
Some modules also install their own binaries, of course, so for
 +
this scheme to scale (over time) it might make sense to establish
 +
separate <tt>bin/</tt> sub-folders for each ‘base’ Python version.
 +
 +
As the NLPL collection of modules grows, it may well be helpful to
 +
define common ‘environment’ modules, e.g. a combination of individual
 +
components (in specific versions) that are often used in conjunction
 +
and are known to be mutually interoperable.
 +
 +
The module definitions (in Tcl or Lua) should in principle be able
 +
to inspect the (module) environment and select which specific version
 +
of a Python add-on to activate based on the active ‘base’ Python
 +
interpreter.

Latest revision as of 22:19, 11 May 2019

Background

In mid-2018 at least, the Python programming language takes a central role in much data science and machine learning work. Comprehensive programming environments like SciKitLearn, PyTorch, or TensorFlow often make it possible to deploy advanced machine learning techniques with a few lines of ‘glue’ code in Python. The Python ecosystem is characterized by high degrees of evolution and fragmentation: The core language still develops, there are two mutually incompatible major versions still in common use, and even inbetween minor versions there have at times been incompatible changes. On top of this dynamically evolving core, there is a vast landscape of independently developed add-on modules, of which some are near-universally used (e.g. NumPy or MatPlotLib). For these reasons, maintaining a Python environment for NLPL that is to some degree standardized and parallel across the various instances of the virtual laboratory is no small challenge.

Basic Principles

NLPL standardizes on Python 3.x. In 2018, we have used Python version 3.5 as the default starting point for NLPL-specific installations of add-on modules, on both Abel and Taito. In early 2019, we are adding experimental support for multiple Python versions, notably 3.7 (as the new default) in addition to 3.5. The general philosophy is to emphasize (a) parallelism, (b) modularity, and (c) replicability. The NLPL software environment should behave the same across different systems on which the virtual laboratory is instantiated. Within reason, it should be possible to ‘mix and match’ modules from the NLPL software (and data) repository. Finally, a specific version of a module should not change (in a user-visible way) once installed and announced. From these principles, it follows that NLPL avoids module ‘bundling’ to the largest possible degree; individual add-on components are provisioned as separate modules, each with its own version history.

Typically, users may not care deeply about specific module versions and just ask for the (current) default version. However, as newer versions are installed, what is the default version will change over time. Preserving older modules unchanged (rather than revising incrementally) means that users are offered full control to select (combinations of) specific module versions. With great power comes great responsibility.

Mixing and Matching

The ‘right’ way of balancing modularity vs. user convenience remains yet to be determined. In mid-2018, for example, the Keras abstraction layer is tightly coupled to TensorFlow; hence, Keras is not currently isolated as a separate module. Likewise, there are some add-on modules that are intricately linked to the core language, almost to the point of forming an ‘extended standard library’; NumPy is a prime example in this category. Middle-ground candidates for modularization, arguably, are SciKitLearn and what is at times called the SciPy ecosystem, which (besides NumPy) bundles, among others, MatPlotLib, IPython, and Pandas. The NLPL infrastructure task force will be grateful for feedback on how to best cut this pie!

module use -a /proj*/nlpl/software/modulefilkes
module load nlpl-nltk/3.4/3.7 nlpl-gensim/3.7.3/3.7 nlpl-pytorch/1.1.0/3.7

Packaging Ideas

To not ‘balkanize’ the NLPL module inventory too much, we plan on maintaining a few ‘natural’ bundles, for example the so-called SciPy ecosystem. As the SciPy bundle is comprised of a number of independent components that are each developed and released individually, an abstract scheme is required to assign version numbers to the bundle at large. Aiming for somewhat regular updates, say in four-month intervals, version numbers will be composed of six-digit YYYYMM identifiers, i.e. four digits for the calendar year followed by a two-digit number for the calendar month.

Component 201810 201901 201906
NumPy 1.14.6
SciPy Library 1.1. 0 1.2.0
MatPlotLib 3.0.0 3.0.2
IPython 7.0.1 7.2.0
Pandas 0.23.4 0.24.0
SciKit-Learn 0.20.2

Automation

On Abel, there is emerging support for largely automated installation of Python modules, in a modular manner that allows support against multiple base Python versions.

Following is an example NLPL-specific module specification, from /projects/nlpl/operations/python/pytorch.txt (in May 2019):

#
#$ module load gcc/4.9.2 cuda/9.0
#$ module load nlpl-numpy/1.16.3/${dialect}
#$ module load nlpl-scipy/201901/${dialect}
#
torch
torchvision
torchtext

The above can be installed using a (somewhat elaborate) shell script, which will (a) create a virtual environment for the new module; (b) make sure the base Python version is up-to-date (e.g. for pip); (c) execute the shell command included as comments in the specification file; (d) create a new module definition, including module dependencies; (e) activate the new module and pip install the requirements listed in the module specification; and, finally, (f) adjust the contents of the bin/ sub-directory in the new module to move all commands and script into a lower sub-directory reflecting the base Python version. For example:

for i in 3.5.5 3.7.0; do
  module purge; module load python3/$i;
  /projects/nlpl/operation/python/initialize --version 1.1.0 pytorch
done

Challenges and Ideas

The division of labor between central IT support and NLPL remains hard to define in a principled way. In mid-2018, we build on system-wide ‘core’ Python installations (on both Abel and Taito) and maintain everything else ourselves; so far, we have succeeded in only installing pre-built wheels. However, the core Python environments include some add-on modules that end up superseded with newer versions in NLPL add-ons, notably NumPy. On Abel at least (and possibly on Taito too; I have yet to check), the core NumPy was built locally, against the MKL implementation of BLAS and LAPACK; this link is broken in the newer NLPL modules that include NumPy (e.g. the SciPy bundle and TensorFlow):

module purge; module load nlpl-tensorflow
python3 -c "import numpy as np; print(np.__version__); np.__config__.show();"

In principle, one could reduce most of the virtual environments to just the wheels that they provide, i.e. remove everything else from the standard virtual environment creation, including most binaries (that are not module-specific). Because wheels are organized in a directory hierarchy relative to each Python minor version, this should make it possible to install the code for add-on components (like MatPlotLib, GenSim, or NLTK) for multiple Python versions into the same module. Some modules also install their own binaries, of course, so for this scheme to scale (over time) it might make sense to establish separate bin/ sub-folders for each ‘base’ Python version.

As the NLPL collection of modules grows, it may well be helpful to define common ‘environment’ modules, e.g. a combination of individual components (in specific versions) that are often used in conjunction and are known to be mutually interoperable.

The module definitions (in Tcl or Lua) should in principle be able to inspect the (module) environment and select which specific version of a Python add-on to activate based on the active ‘base’ Python interpreter.