Difference between revisions of "Infrastructure/installation/guide"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Software)
(Open Questions)
 
(20 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
= Software and Data Installation Guide =
 
= Software and Data Installation Guide =
  
= Version history
 
  
Version
+
= Purpose and General Principles =
Date
 
Note
 
0.1
 
2017-04-26
 
Initial draft by trz
 
0.2
 
2017-05-20
 
Added information on modules and accessing shared storage space, cleaned up doc (moving future topics to section at the end)
 
0.3
 
2017-05-31
 
Small fixes; 1st version to be used as guidelines
 
  
= Purpose =
+
This page provides common guidelines to install software and data at resources provided within the Nordic Language Processing Laboratory.
  
Guidelines to install software and data at resources provided within the Nordic Language Processing Laboratory.
+
The NLPL software and data installations comprise the core of the virtual laboratory; it is being developed incrementally and in a highly
 +
distributed manner, involving a dozen or so maintainers.
 +
To make this feasible, it is mandatory to respect certain organizational principles.
 +
 
 +
Duplication of data or names must be avoided; ''one fact, one place!''
 +
Accordingly, soft links shall not be used to allow multiple entry points to the same filesystem location.
 +
 
 +
References to (natural) languages use three-letter [http://www.loc.gov/standards/iso639-2/php/code_list.php ISO 639.2 codes], e.g. ENG or NOB for English and Norwegian Bokmål, respectively. 
 +
 
 +
NLPL documentation standardizes on American English, avoidance of contractions, and Oxford commas.
  
 
= Resources =
 
= Resources =
Line 34: Line 31:
 
<tt>infrastructure@nlpl.eu</tt> task force.
 
<tt>infrastructure@nlpl.eu</tt> task force.
  
Laboratory-specific software is stored in a location referred to as <tt>$NLPL_SOFTWARE</tt> which points to <tt>/proj/nlpl/software/</tt>
+
Laboratory-specific software is stored in a location referred to as <pre>$NLPLCODE</pre> which points to <pre>/proj/nlpl/software/</pre>
and <tt>/projects/nlpl/software/</tt> for Taito and Abel, respectively.
+
and <pre>/projects/nlpl/software/</pre> for Taito and Abel, respectively.
 
 
Under <tt>$NLPL_SOFTWARE</tt> on each system, applications and libraries are stored with the following directory layout:
 
  
: $NLPL_SOFTWARE/__APPNAME__/__APPVERSION__
+
Under <pre>$NLPLCODE</pre> on each system, software modules (e.g. binaries and and libraries) are stored with the following directory layout:
  
where ''__APPNAME__'' contains only lower case symbols (with some exceptions such as R) and ''__APPVERSION__'' contains the version of the software (as provided by its vendor).
+
: $NLPLCODE/modules/''__name__''/''__version__''
  
For each installation, there should be a page on the wiki documenting the exact versions used (and how to obtain them) and steps used to build the software.
+
where ''__name__'' contains only lower case symbols (with some exceptions such as R) and ''__version__'' contains the version of the software (as provided by its vendor).
  
Optionally, for each software that is installed the sources and build scripts (if adjusted to the NLPL environment) can be stored under <tt>$NLPL_SOFTWARE/sources/<.tt>.  This directory should be ‘cleaned up’ after the installation and possibly compressed. If possible, it may be useful to create a log file of the complete installation session and preserve that file (in compressed form).
+
For each installation, there should be a separate page on the NLPL wiki (or minimally a <tt>README</tt> file) documenting the exact versions used (and how to obtain them) and the steps used to build the software.
 +
The infrastructure task force will soon make a proposal for how to manage installation scripts and notes in an NLPL-associated version control system.
  
 +
Optionally, for each software that is installed the sources and build scripts (if adjusted to the NLPL environment) can be stored under <tt>$NLPLCODE/build/</tt>.  This directory should be ‘cleaned up’ after the installation and possibly compressed.  If possible, it may be useful to create a log file of the complete installation session and preserve that file (in compressed form).
  
 
The user environment on Abel and Taito is configured using the so-called ''modules'' system which abstracts actual installation paths and sets certain environment variables in a standardized form.
 
The user environment on Abel and Taito is configured using the so-called ''modules'' system which abstracts actual installation paths and sets certain environment variables in a standardized form.
For each installed software component, there should be a module definition file under <tt>$NLPL_SOFTWARE/modulefiles/__APPNAME__/__APPVERSION__</tt>
+
For each installed software component, there should be a module definition file under <tt>$NLPLCODE/modules/etc/nlpl-''__name__''/''__version__''</tt>
 
See the separate section on the modules system below for further details on the definition files.
 
See the separate section on the modules system below for further details on the definition files.
  
Line 57: Line 54:
 
= Data =
 
= Data =
  
The data sets provided by NLPL shall be presented as uniformly as possible from a user point perspective. The data set environment is predominantly comprised of freely available data. Restricted data sets are not considered at the current stage (see section Future guide topics).
+
The data sets provided by NLPL shall be presented as uniformly as possible from a user point perspective. The data set environment is predominantly comprised of freely available data. Restricted data sets are not considered at the current stage (see the section on Open Questions below).
  
Data sets shall be installed under $NLPL_DATA which points to /proj/nlpl/data and /projects/nlpl/data for Taito and Abel, respectively.
+
Data sets shall be installed under <tt>$NLPLDATA</tt> which points to <tt>/proj/nlpl/data/</tt> and <tt>/projects/nlpl/data/</tt> for Taito and Abel, respectively.
The top-level data directory is sub-divided by NLPL activities, with sub-directories translation, parsing, corpora, vectors, and opus.
+
The top-level data directory is sub-divided by NLPL activities, with sub-directories <tt>translation</tt>, <tt>parsing</tt>, <tt>corpora</tt>, <tt>vectors</tt>, and <tt>opus</tt>.
 
In principle, each activity is free to decide on the directory layout within their sub-directory, where it may or may not be
 
In principle, each activity is free to decide on the directory layout within their sub-directory, where it may or may not be
most practicaly to organize according to languages.
+
most practicaly to organize according to languages, for example.
 
Where language identifiers are part of directory or file names, they should follow the
 
Where language identifiers are part of directory or file names, they should follow the
 
[https://www.loc.gov/standards/iso639-2/php/code_list.php three-letter ISO codes].
 
[https://www.loc.gov/standards/iso639-2/php/code_list.php three-letter ISO codes].
 
Here as elsewhere in the NLPL directory space, upper-case letters shall be avoided in file and directory names
 
Here as elsewhere in the NLPL directory space, upper-case letters shall be avoided in file and directory names
(except maybe for de-facto standard names like Makefile or README).
+
(except maybe for de-facto standard names like <tt>Makefile</tt> or <tt>README</tt>).
  
 
= The Modules System =
 
= The Modules System =
Line 72: Line 69:
 
At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation.
 
At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation.
 
Modules
 
Modules
Modules (command line tool module) allow for a simple and common way of configuring a user's environment to get access to executables, libraries, development files, data sets, etc.
+
Modules (command line tool module) allow for a simple and common way of configuring the user environment to get access to executables, libraries, development files, data sets, etc.
  
 
The most essential commands are
 
The most essential commands are
  
module avail       Shows which modules are available.
+
; module avail
module list   Shows which modules have been loaded.
+
: Shows which modules are available.
module load name/version   Loads module with name name and version version.
+
; module list
module rm name/version   Unloads module with name name and version version.
+
: Shows which modules have been loaded.
module purge   Unloads all modules.
+
; module load name ''or'' module load name/version
module show name/version   Shows the definition of the module.
+
: Loads module with name name and version version.
module whatis name/version   Shows a brief information on what this module provides (for Abel).
+
; module rm name/version
module help name/version   Shows a brief information on what this module provides (for Taito).
+
: Unloads module with name name and version version.
module use -a PATH   Append PATH to the search path for module definition files.
+
; module purge
 +
: Unloads all modules.
 +
; module show name/version
 +
: Shows the definition of the module.
 +
; module whatis name/version
 +
: Shows a brief information on what this module provides (for Abel).
 +
; module help name/version
 +
: Shows a brief information on what this module provides (for Taito).
 +
; module use -a PATH
 +
: Append <tt>PATH</tt> to the search path for module definition files.
  
Modules are defined with (short) text files using Tcl and Lua on Abel and Taito, respectively. For modules specific to the laboratory, these files shall be stored under the above mentioned directory layout (see sections on Software and Data sets). Integrating modules in the system requires the directory containing these files be included in the search path of the module system (see module use -a above). For example, include the command
+
Modules are defined with (short) text files using Tcl and Lua on Abel and Taito, respectively.
 +
For modules specific to the laboratory, module names shall be prefixed with <tt>nlpl-</tt> and module definition files shall be stored in the abovementioned directory layout (see the sections on Software and Data Sets).
 +
The Moses software environment, for example, should be installed into <tt>$NLPL_SOFTWARE/moses/</tt>, and its module definition
 +
below the <tt>$NLPLCODE/nlpl-moses/</tt> directory.
 +
Users will activate it (in its ‘default’ version, in case multiple versions are available) using the
 +
command
 +
module load nlpl-moses
  
module use -a $NLPL_SOFTWARE/modulefiles
+
Integrating modules in the system requires the directory containing these files be included in the search path of the module system (see <tt>module use -a</tt> above). For example, include the command
  
in your .bashrc on Abel and Taito which requires that $NLPL_SOFTWARE is set (see section Software above).
+
  module use -a $NLPLCODE/modules/etc
 +
 
 +
in your <tt>.bashrc</tt> on Abel and Taito with the correct, system-specific value for <tt>$NLPLCODE</tt> (see above).
  
 
= Open Questions =
 
= Open Questions =
  
Where and how to provide documentation (likely on www.nlpl.eu)
+
* How to provide documentation (on http://www.nlpl.eu)
How to test/validate a software that it actually works (e.g., by someone else than the one who has installed it ... based on the provided documentation)
+
* How to test/validate a software that it actually works (e.g. by someone else than the one who has installed it ... based on the provided documentation)
How to test software to ensure functional identity across systems
+
* How to test software to ensure functional identity across systems
How to sync data sets across systems
+
* Pick VCS to maintain installation scripts and notes.
Handling of data sets with access restrictions
 
  
 
= Version History =
 
= Version History =
Line 112: Line 125:
 
|-
 
|-
 
| 0.4 || 2017-11-06 || oe || updates from the infrastructure task force
 
| 0.4 || 2017-11-06 || oe || updates from the infrastructure task force
 +
|-
 +
| 0.5 || 2019-09-17 || oe || preparing to re-create the infrastructure on new systems
 
|}
 
|}

Latest revision as of 12:05, 17 September 2019

Software and Data Installation Guide

Purpose and General Principles

This page provides common guidelines to install software and data at resources provided within the Nordic Language Processing Laboratory.

The NLPL software and data installations comprise the core of the virtual laboratory; it is being developed incrementally and in a highly distributed manner, involving a dozen or so maintainers. To make this feasible, it is mandatory to respect certain organizational principles.

Duplication of data or names must be avoided; one fact, one place! Accordingly, soft links shall not be used to allow multiple entry points to the same filesystem location.

References to (natural) languages use three-letter ISO 639.2 codes, e.g. ENG or NOB for English and Norwegian Bokmål, respectively.

NLPL documentation standardizes on American English, avoidance of contractions, and Oxford commas.

Resources

Taito is a cluster operated at CSC (Finland). User documentation is available at https://research.csc.fi/taito-user-guide. Each user has access to different areas for storing files: $HOME with a quota of 50 GB, $USERAPPL with a quota of 50 GB, and a project area (see below) under /proj/nlpl/ (access via CSC storage project for NLPL managed by Stephan Oepen).

Abel is a cluster operated at UiO (Norway). User documentation is available at http://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/. Each user has access to different areas for storing files: $HOME with a quota of 500 GB, and a project area under /projects/nlpl/ (access via UNIX group hpc-nlpl managed by Stephan Oepen).

Software

The software environment of the laboratory shall be as uniform as possible from a user perspective. The software environment is comprised of libraries and applications installed system-wide or in a space specific for the laboratory. System-wide libraries and applications are not considered at the current stage; their installation can be requested through the infrastructure@nlpl.eu task force.

Laboratory-specific software is stored in a location referred to as

$NLPLCODE

which points to

/proj/nlpl/software/

and

/projects/nlpl/software/

for Taito and Abel, respectively. Under

$NLPLCODE

on each system, software modules (e.g. binaries and and libraries) are stored with the following directory layout:

$NLPLCODE/modules/__name__/__version__

where __name__ contains only lower case symbols (with some exceptions such as R) and __version__ contains the version of the software (as provided by its vendor).

For each installation, there should be a separate page on the NLPL wiki (or minimally a README file) documenting the exact versions used (and how to obtain them) and the steps used to build the software. The infrastructure task force will soon make a proposal for how to manage installation scripts and notes in an NLPL-associated version control system.

Optionally, for each software that is installed the sources and build scripts (if adjusted to the NLPL environment) can be stored under $NLPLCODE/build/. This directory should be ‘cleaned up’ after the installation and possibly compressed. If possible, it may be useful to create a log file of the complete installation session and preserve that file (in compressed form).

The user environment on Abel and Taito is configured using the so-called modules system which abstracts actual installation paths and sets certain environment variables in a standardized form. For each installed software component, there should be a module definition file under $NLPLCODE/modules/etc/nlpl-__name__/__version__ See the separate section on the modules system below for further details on the definition files.

At start only limited documentation will be available through the modules system. A later version of this guide will recommend best practices for documentation.

Data

The data sets provided by NLPL shall be presented as uniformly as possible from a user point perspective. The data set environment is predominantly comprised of freely available data. Restricted data sets are not considered at the current stage (see the section on Open Questions below).

Data sets shall be installed under $NLPLDATA which points to /proj/nlpl/data/ and /projects/nlpl/data/ for Taito and Abel, respectively. The top-level data directory is sub-divided by NLPL activities, with sub-directories translation, parsing, corpora, vectors, and opus. In principle, each activity is free to decide on the directory layout within their sub-directory, where it may or may not be most practicaly to organize according to languages, for example. Where language identifiers are part of directory or file names, they should follow the three-letter ISO codes. Here as elsewhere in the NLPL directory space, upper-case letters shall be avoided in file and directory names (except maybe for de-facto standard names like Makefile or README).

The Modules System

At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation. Modules Modules (command line tool module) allow for a simple and common way of configuring the user environment to get access to executables, libraries, development files, data sets, etc.

The most essential commands are

module avail
Shows which modules are available.
module list
Shows which modules have been loaded.
module load name or module load name/version
Loads module with name name and version version.
module rm name/version
Unloads module with name name and version version.
module purge
Unloads all modules.
module show name/version
Shows the definition of the module.
module whatis name/version
Shows a brief information on what this module provides (for Abel).
module help name/version
Shows a brief information on what this module provides (for Taito).
module use -a PATH
Append PATH to the search path for module definition files.

Modules are defined with (short) text files using Tcl and Lua on Abel and Taito, respectively. For modules specific to the laboratory, module names shall be prefixed with nlpl- and module definition files shall be stored in the abovementioned directory layout (see the sections on Software and Data Sets). The Moses software environment, for example, should be installed into $NLPL_SOFTWARE/moses/, and its module definition below the $NLPLCODE/nlpl-moses/ directory. Users will activate it (in its ‘default’ version, in case multiple versions are available) using the command

module load nlpl-moses

Integrating modules in the system requires the directory containing these files be included in the search path of the module system (see module use -a above). For example, include the command

 module use -a $NLPLCODE/modules/etc

in your .bashrc on Abel and Taito with the correct, system-specific value for $NLPLCODE (see above).

Open Questions

  • How to provide documentation (on http://www.nlpl.eu)
  • How to test/validate a software that it actually works (e.g. by someone else than the one who has installed it ... based on the provided documentation)
  • How to test software to ensure functional identity across systems
  • Pick VCS to maintain installation scripts and notes.

Version History

Version Date Author Note
0.1 2017-04-26 trz initial draft
0.2 2017-05-20 trz added information on modules and accessing shared storage space, cleaned up doc
0.3 2017-05-31 trz small fixes; first version to be used as guidelines
0.4 2017-11-06 oe updates from the infrastructure task force
0.5 2019-09-17 oe preparing to re-create the infrastructure on new systems