Difference between revisions of "Infrastructure/software/python"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Vague Thoughts)
(Basic Principles)
Line 28: Line 28:
 
NLPL standardizes on Python 3.x.
 
NLPL standardizes on Python 3.x.
 
In mid-2018, we use Python version 3.5 as the default starting
 
In mid-2018, we use Python version 3.5 as the default starting
point for NLPL-specific installations of add-on modules.
+
point for NLPL-specific installations of add-on modules, on both Abel and Taito.
 +
The general philosophy is to emphasize (a) ''parallelism'',
 +
(b) ''modularity'', and (c) ''replicability''.
 +
The NLPL software environment should behave the same across different systems
 +
on which the virtual laboratory is instantiated.
 +
Within reason, it should be possible to ‘mix and match’ modules from
 +
the NLPL software (and data) repository.
 +
Finally, a specific version of a module should not change (in a
 +
user-visible way) once installed and announced.
 +
From these principles, it follows that NLPL avoids module
 +
‘bundling’ to the largest possible degree; individual add-on
 +
components are provisioned as separate modules, each with its
 +
own version history.
  
 +
Typically, users may not care deeply about specific module versions
 +
and just ask for the (current) default version.
 +
However, as newer versions are installed, what is the default version
 +
will change over time.
 +
Preserving older modules unchanged (rather than revising incrementally)
 +
means that users are offered full control to select (combinations of)
 +
specific module versions.
 +
With great power comes great responsibility.
  
 
= Usage Example =
 
= Usage Example =

Revision as of 14:01, 2 October 2018

Background

In mid-2018 at least, the Python programming language takes a central role in much data science and machine learning work. Comprehensive programming environments like SciKitLearn, PyTorch, or TensorFlow often make it possible to deploy advanced machine learning techniques with a few lines of ‘glue’ code in Python. The Python ecosystem is characterized by high degrees of evolution and fragmentation: The core language still develops, there are two mutually incompatible major versions still in common use, and even inbetween minor versions there have at times been incompatible changes. On top of this dynamically evolving core, there is a vast landscape of independently developed add-on modules, of which some are near-universally used (e.g. NumPy or MatPlotLib. For these reasons, maintaining a Python environment for NLPL that is to some degree standardized and parallel across the various instances of the virtual laboratory is no small challenge.

Basic Principles

NLPL standardizes on Python 3.x. In mid-2018, we use Python version 3.5 as the default starting point for NLPL-specific installations of add-on modules, on both Abel and Taito. The general philosophy is to emphasize (a) parallelism, (b) modularity, and (c) replicability. The NLPL software environment should behave the same across different systems on which the virtual laboratory is instantiated. Within reason, it should be possible to ‘mix and match’ modules from the NLPL software (and data) repository. Finally, a specific version of a module should not change (in a user-visible way) once installed and announced. From these principles, it follows that NLPL avoids module ‘bundling’ to the largest possible degree; individual add-on components are provisioned as separate modules, each with its own version history.

Typically, users may not care deeply about specific module versions and just ask for the (current) default version. However, as newer versions are installed, what is the default version will change over time. Preserving older modules unchanged (rather than revising incrementally) means that users are offered full control to select (combinations of) specific module versions. With great power comes great responsibility.

Usage Example

module use -a /proj*/nlpl/software/modulefilkes
module load nlpl-nltk nlpl-gensim nlpl-tensorflow

Random Ideas

In principle, one could reduce most of the virtual environments to just the wheels that they provide, i.e. remove everything else from the standard virtual environment creation, including the binaries. As the wheels are organized in a directory hierarchy relative to each Python minor version, this should make it possible to install the code for add-on components (like MatPlotLib, GenSim, or NLTK) for multiple Python versions into the same module.