Difference between revisions of "Infrastructure/replication"
(Created page with "= Background = The NLPL virtual laboratory (in mid-2018) is distributed over two superclusters, viz. the Abel and Taito systems in Norway and Finland, respectively. Furthermo...") |
(→Replication between Abel and Taito) |
||
(5 intermediate revisions by the same user not shown) | |||
Line 10: | Line 10: | ||
systems on Abel and Taito and the ‘off-line’ storage on NIRD. | systems on Abel and Taito and the ‘off-line’ storage on NIRD. | ||
− | = Back-Up = | + | = Back-Up to NIRD = |
+ | The NLPL project directory on Abel (<tt>/projects/nlpl/</tt>) is backed up | ||
+ | daily at UiO, but the corresponding directory on Taito (<tt>/proj/nlpl/</tt>) | ||
+ | is not; this is one of the reasons why the ‘on-line’ storage allocation on | ||
+ | Taito can be considerably more generous than the one on Abel. | ||
− | + | The NLPL project directories contain software and data installations that | |
+ | have been semi-manually created (in some cases following non-trivial | ||
+ | analytical work and tinkering) and would be expensive to re-create from | ||
+ | scratch. | ||
+ | Thus, the complete contents of both copies of the virtual laboratory | ||
+ | should be backed up to the ‘off-line’ NIRD storage area at least once | ||
+ | per day, so as to be able to recover from data loss (which coulde include | ||
+ | accidental deletion) quickly and without too much manual effort. | ||
− | The <tt>data/corpora/</tt> and <tt>data/vectors/</tt> sub-directories of | + | In mid-2018, the NLPL infrastructure task force has landed on a daily |
− | the ‘on-line’ project directories on | + | ‘back-up’ scheme using <tt>rsync</tt>, implemented by the script |
+ | [http://svn.nlpl.eu/operation/mirror/nird <tt>operation/mirror/nird</tt>] | ||
+ | in SVN. | ||
+ | This should allow a response | ||
+ | time of at least 24 hours upon inadvertent data loss. | ||
+ | However, this scheme remains to be reliably (i.e. via <tt>cron</tt>) | ||
+ | activated on Taito and more generally needs to be validated and made | ||
+ | more robust (for example protecting against concurrent execution by | ||
+ | use of file locking). | ||
+ | |||
+ | = Replication between Saga and Puhti = | ||
+ | |||
+ | The <tt>data/corpora/</tt>, | ||
+ | <tt>data/parsing/</tt>, <tt>data/translation/</tt>, and <tt>data/vectors/</tt> sub-directories of | ||
+ | the ‘on-line’ project directories on Saga and Puhti are automatically | ||
synchronized. | synchronized. | ||
− | The primary copy of | + | The primary copy of most of directories resides on Saga (with the excpetion of the <code>data/translation/</code> |
− | must be applied there; changes made to these sub-directories on | + | module), and all changes |
− | will be overwritten. | + | must be applied there; changes made to these sub-directories on the secondary |
+ | copy (i.e. Puhti for most modules) will be overwritten. | ||
Replication is accomplished through a set of scripts that are maintained | Replication is accomplished through a set of scripts that are maintained | ||
− | in the SubVersion repository of the project, notably | + | in the SubVersion repository of the project, notably the top-level ‘driver’ |
− | [http://svn.nlpl.eu/operation/mirror/cron.sh <tt>operation/mirror/cron.sh</tt>], | + | [http://svn.nlpl.eu/operation/mirror/cron.sh <tt>operation/mirror/cron.sh</tt>]. |
− | [http://svn.nlpl.eu/operation/mirror/data/corpora/ | + | This script runs a sequence of module-specific replication scripts, e.g. |
− | [http://svn.nlpl.eu/operation/mirror/data/vectors/ | + | [http://svn.nlpl.eu/operation/mirror/data/corpora/puhti <tt>operation/mirror/data/corpora/puhti</tt>], |
+ | [http://svn.nlpl.eu/operation/mirror/data/parsing/puhti <tt>operation/mirror/data/parsing/puhti</tt>], | ||
+ | [http://svn.nlpl.eu/operation/mirror/data/translation/saga <tt>operation/mirror/data/translation/saga</tt>], and | ||
+ | [http://svn.nlpl.eu/operation/mirror/data/vectors/puhti <tt>operation/mirror/data/vectors/puhti</tt>]. | ||
These scripts assume password-less <tt>rsync</tt> communication across the sites, | These scripts assume password-less <tt>rsync</tt> communication across the sites, | ||
− | which is accomplished | + | which is accomplished via <tt>ssh</tt> keys (for the user <tt>oe</tt>, on all systems). |
+ | |||
The top-level script is invoked by <tt>cron</tt> every night on an LTG-owned add-on | The top-level script is invoked by <tt>cron</tt> every night on an LTG-owned add-on | ||
node to Abel (<tt>ls.hpc.uio.no</tt>), so that the <tt>cron</tt> jobs need not | node to Abel (<tt>ls.hpc.uio.no</tt>), so that the <tt>cron</tt> jobs need not | ||
be re-activated every time one of the Abel login nodes is reinstalled. | be re-activated every time one of the Abel login nodes is reinstalled. |
Latest revision as of 20:48, 16 January 2020
Background
The NLPL virtual laboratory (in mid-2018) is distributed over two superclusters, viz. the Abel and Taito systems in Norway and Finland, respectively. Furthermore, NLPL enjoys a generous storage allocation on the Norwegian Infrastructure for Research Data (NIRD), which is not directly accessible on either of the two computing systems. This page documents the (still emerging) project strategy to data and software replication across the three sites, i.e. the ‘on-line’ file systems on Abel and Taito and the ‘off-line’ storage on NIRD.
Back-Up to NIRD
The NLPL project directory on Abel (/projects/nlpl/) is backed up daily at UiO, but the corresponding directory on Taito (/proj/nlpl/) is not; this is one of the reasons why the ‘on-line’ storage allocation on Taito can be considerably more generous than the one on Abel.
The NLPL project directories contain software and data installations that have been semi-manually created (in some cases following non-trivial analytical work and tinkering) and would be expensive to re-create from scratch. Thus, the complete contents of both copies of the virtual laboratory should be backed up to the ‘off-line’ NIRD storage area at least once per day, so as to be able to recover from data loss (which coulde include accidental deletion) quickly and without too much manual effort.
In mid-2018, the NLPL infrastructure task force has landed on a daily ‘back-up’ scheme using rsync, implemented by the script operation/mirror/nird in SVN. This should allow a response time of at least 24 hours upon inadvertent data loss. However, this scheme remains to be reliably (i.e. via cron) activated on Taito and more generally needs to be validated and made more robust (for example protecting against concurrent execution by use of file locking).
Replication between Saga and Puhti
The data/corpora/,
data/parsing/, data/translation/, and data/vectors/ sub-directories of
the ‘on-line’ project directories on Saga and Puhti are automatically
synchronized.
The primary copy of most of directories resides on Saga (with the excpetion of the data/translation/
module), and all changes
must be applied there; changes made to these sub-directories on the secondary
copy (i.e. Puhti for most modules) will be overwritten.
Replication is accomplished through a set of scripts that are maintained in the SubVersion repository of the project, notably the top-level ‘driver’ operation/mirror/cron.sh. This script runs a sequence of module-specific replication scripts, e.g. operation/mirror/data/corpora/puhti, operation/mirror/data/parsing/puhti, operation/mirror/data/translation/saga, and operation/mirror/data/vectors/puhti. These scripts assume password-less rsync communication across the sites, which is accomplished via ssh keys (for the user oe, on all systems).
The top-level script is invoked by cron every night on an LTG-owned add-on node to Abel (ls.hpc.uio.no), so that the cron jobs need not be re-activated every time one of the Abel login nodes is reinstalled.