Infrastructure/trickle

From Nordic Language Processing Laboratory
Revision as of 19:26, 2 March 2020 by Oe (talk | contribs)
Jump to: navigation, search

i am no big fan of the SLURM arrayrun(1) facility. what i usually do is create a large file with as many command lines as i want to run jobs, each with whatever parameters that job requires. a silly example of such a master job file could be something like

for i in 0 1 2 3 4 5 6 7 8 9; do
  for j in 0 1 2 3 4 5 6 7 8 9; do
    echo "sbatch /cluster/shared/nlpl/operation/tools/echo.slurm ${i} ${j}";
  done;
done > ~/echo.jobs

assuming such a file, i have a script that ‘trickles’ through the sequence of jobs, keeping up to some maximum limit of queue entries at any point in time, and filling up the queue to the limit again as jobs terminate. my idiom of setting into motion this process then goes as follows:

/cluster/shared/nlpl/operation/tools/trickle --start --limit 20 ~/echo.jobs
while true; do /cluster/shared/nlpl/operation/tools/trickle --limit 20
~/echo.jobs ; sleep 30; done
[19-02-16 15:00:37] trickle[20]: 20 jobs; 3 running; 0 new.
[19-02-16 15:01:07] trickle[20]: 17 jobs; 0 running; 3 new.
[19-02-16 15:01:38] trickle[23]: 17 jobs; 0 running; 3 new.
[19-02-16 15:02:10] trickle[26]: 20 jobs; 3 running; 0 new.
...

the first integer is the pointer into the job sequence, 20 initially, then at each step advancing by the number of new jobs submitted for that call.

—just in case you might find this useful ... for all i know, this script provides similar functionality to arrayrun(1), but i find it more convenient to be able to pass each job its full command line directly, without having to redirect on the job indices under arrayrun(1) control.