Difference between revisions of "Infrastructure/installation/guide"

From Nordic Language Processing Laboratory
Jump to: navigation, search
(Created page with "= Software and Data Installation Guide = Version history Version Date Note 0.1 2017-04-26 Initial draft by trz 0.2 2017-05-20 Added information on modules and accessing shar...")
 
(Software)
Line 47: Line 47:
  
 
At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation.
 
At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation.
Data sets
 
The data sets provided by the laboratory shall be as identical as possible from a user point perspective. The data set environment is made of freely available and restricted data. Restricted data sets are not considered at the current stage (see section Future guide topics).
 
  
Data sets shall be installed under $NLPL_DATASETS which points to /proj/nlpl/datasets and /projects/nlpl/datasets for Taito and Abel respectively.
 
  
Under $NLPL_DATASETS data sets are stored with the following directory layout
+
= Data =
  
$NLPL_SOFTWARE/__DATASETNAME__/__DATASETVERSION__
+
The data sets provided by NLPL shall be presented as uniformly as possible from a user point perspective. The data set environment is predominantly comprised of freely available data. Restricted data sets are not considered at the current stage (see section Future guide topics).
  
where __DATASETNAME__ contains only lower case symbols and __DATASETVERSION__ contains the version of the data set (as provided by its "vendor").
+
Data sets shall be installed under $NLPL_DATA which points to /proj/nlpl/data and /projects/nlpl/data for Taito and Abel, respectively.
 +
The top-level data directory is sub-divided by NLPL activities, with sub-directories translation, parsing, corpora, vectors, and opus.
 +
In principle, each activity is free to decide on the directory layout within their sub-directory, where it may or may not be
 +
most practicaly to organize according to languages.
 +
Where language identifiers are part of directory or file names, they should follow the
 +
[https://www.loc.gov/standards/iso639-2/php/code_list.php three-letter ISO codes].
 +
Here as elsewhere in the NLPL directory space, upper-case letters shall be avoided in file and directory names
 +
(except maybe for de-facto standard names like Makefile or README).
 +
 
 +
= The Modules System =
  
 
At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation.
 
At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation.
Line 79: Line 85:
  
 
in your .bashrc on Abel and Taito which requires that $NLPL_SOFTWARE is set (see section Software above).
 
in your .bashrc on Abel and Taito which requires that $NLPL_SOFTWARE is set (see section Software above).
Demonstration
 
For demonstration, I have set $NLPL_SOFTWARE to /cluster/home/thomarob/nlpl/software and /homeappl/home/troblitz/nlpl/software on Abel and Taito, respectively. The example software package is dynet_nlpl_demo/1.0. Run
 
  
module use -a $NLPL_SOFTWARE/modulefiles
+
= Open Questions =
module avail dynet
 
module load dynet_nlpl_demo/1.0
 
  
and try to use the software. The software is essentially copied from the existing installation, and only slightly updated to accommodate the different name and version (just to illustrate the possibilities). The same has been done with the module definition files. You may spot the differences with the command line tool diff.
 
Topics for future versions of the guide
 
 
Where and how to provide documentation (likely on www.nlpl.eu)
 
Where and how to provide documentation (likely on www.nlpl.eu)
 
How to test/validate a software that it actually works (e.g., by someone else than the one who has installed it ... based on the provided documentation)
 
How to test/validate a software that it actually works (e.g., by someone else than the one who has installed it ... based on the provided documentation)

Revision as of 19:56, 6 November 2017

Software and Data Installation Guide

Version history

Version Date Note 0.1 2017-04-26 Initial draft by trz 0.2 2017-05-20 Added information on modules and accessing shared storage space, cleaned up doc (moving future topics to section at the end) 0.3 2017-05-31 Small fixes; 1st version to be used as guidelines

Purpose

Guidelines to install software and data at resources provided within the Nordic Language Processing Laboratory.

Resources

Taito is a cluster operated at CSC (Finland). User documentation is available at https://research.csc.fi/taito-user-guide. Each user has access to different areas for storing files: `$HOME` with a quota of 50 GB, `$USERAPPL` with a quota of 50 GB, and a project area under `/proj/nlpl/` (access via storage project NLPL managed by Stephan Oepen, oe@ifi.uio.no).

Abel is a cluster operated at UiO (Norway). User documentation is available at http://www.uio.no/english/services/it/research/hpc/abel/help/user-guide/. Each user has access to different areas for storing files: `$HOME` with a quota of 500 GB, and a project area under `/projects/nlpl/` (access via UNIX group hpc-nlpl managed by Stephan Oepen, oe@ifi.uio.no).

Software

The software environment of the laboratory shall be as identical as possible from a user point perspective. The software environment is made of libraries and applications installed system wide or in a space specific for the laboratory. System wide libraries and applications are not considered at the current stage (see section Future topics).

Laboratory specific software is stored under $NLPL_SOFTWARE which points to /proj/nlpl/software and /projects/nlpl/software for Taito and Abel respectively.

Under $NLPL_SOFTWARE applications and libraries are stored with the following directory layout

$NLPL_SOFTWARE/__APPNAME__/__APPVERSION__

where __APPNAME__ contains only lower case symbols (with some exceptions such as R) and __APPVERSION__ contains the version of the software (as provided by its "vendor").

For each installation, there should be a page on the wiki documenting the exact versions used (and how to obtain them) and steps used to build the software.

Optionally, for each software that is installed the sources and build scripts (if adjusted to the lab environment) can be stored under $NLPL_SOFTWARE/sources. This directory should be ‘cleaned up’ after the installation, possibly compressed Possibly keep log files?

A user's environment is configured using the module system which abstracts actual installation paths and sets certain environment variables in a standardised form. The necessary module files are stored under $NLPL_SOFTWARE/modulefiles/__APPNAME__/__APPVERSION__

See section Modules below for further details on defining module files.

At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation.


Data

The data sets provided by NLPL shall be presented as uniformly as possible from a user point perspective. The data set environment is predominantly comprised of freely available data. Restricted data sets are not considered at the current stage (see section Future guide topics).

Data sets shall be installed under $NLPL_DATA which points to /proj/nlpl/data and /projects/nlpl/data for Taito and Abel, respectively. The top-level data directory is sub-divided by NLPL activities, with sub-directories translation, parsing, corpora, vectors, and opus. In principle, each activity is free to decide on the directory layout within their sub-directory, where it may or may not be most practicaly to organize according to languages. Where language identifiers are part of directory or file names, they should follow the three-letter ISO codes. Here as elsewhere in the NLPL directory space, upper-case letters shall be avoided in file and directory names (except maybe for de-facto standard names like Makefile or README).

The Modules System

At start only limited documentation will be available through the module system. A subsequent version of this guide will recommend best practices for documentation. Modules Modules (command line tool module) allow for a simple and common way of configuring a user's environment to get access to executables, libraries, development files, data sets, etc.

The most essential commands are

module avail Shows which modules are available. module list Shows which modules have been loaded. module load name/version Loads module with name name and version version. module rm name/version Unloads module with name name and version version. module purge Unloads all modules. module show name/version Shows the definition of the module. module whatis name/version Shows a brief information on what this module provides (for Abel). module help name/version Shows a brief information on what this module provides (for Taito). module use -a PATH Append PATH to the search path for module definition files.

Modules are defined with (short) text files using Tcl and Lua on Abel and Taito, respectively. For modules specific to the laboratory, these files shall be stored under the above mentioned directory layout (see sections on Software and Data sets). Integrating modules in the system requires the directory containing these files be included in the search path of the module system (see module use -a above). For example, include the command

module use -a $NLPL_SOFTWARE/modulefiles

in your .bashrc on Abel and Taito which requires that $NLPL_SOFTWARE is set (see section Software above).

Open Questions

Where and how to provide documentation (likely on www.nlpl.eu) How to test/validate a software that it actually works (e.g., by someone else than the one who has installed it ... based on the provided documentation) How to test software to ensure functional identity across systems How to sync data sets across systems Handling of data sets with access restrictions