Slurm Workload Manager (Fimm)
Slurm is an open-source workload manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
sinfo - reports the state of partitions and nodes managed by SLURM.
squeue - reports the state of jobs or job steps.
scontrol show partition
sbatch is used to submit a job script for later execution.
scancel is used to cancel a pending or running job or job step
srun is used to submit a job for execution or initiate job steps in real time
For more information regarding to slurm command please check man page.
List of queue/partitions, incl. short description and limitations
Default partition for all users are "hpc" partition . If user needs to run job in different partition, then specific partition has to be specified, otherwise right partition will be automatically selected.
|List of queues:|
|hpc||default partition for all users||for all types of jobs|
|idle||partition that contains idle resources||for short jobs, jobs can be terminated by owner of the resources|
|t1||partition for alice experiment||limited only for alice users|
|kjem||partition for chemistry department||limited only for alice users|
NOTE: There is no need to specify a queue in the job script, the correct queue will automatically be selected.
Sequential job submission
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=1000MB #SBATCH --time=30:00 #SBATCH --output=my.stdout #SBATCH --email@example.com #SBATCH --mail-type=ALL #SBATCH --job-name="slurm_test" hostname env|grep SLURM
[saerda@login3 SLURM_TEST]$ sbatch test_slurm.sh Submitted batch job 13766015
Interactive Job submission
Following command will just take all default value and get you shell from one free node.
$ srun --pty $SHELL
Following command will get you to idle partition, 1 node, 1 core, and 5 minutes of wall time.
$ srun -p idle -I -N 1 -c 1 --pty -t 0-00:05 $SHELL
Following command will get you 1 node with 4 core, and 1GB memory
$ srun --nodes=1 --ntasks-per-node=4 --mem-per-cpu=1024 --pty $SHELL
IF you want x11 then -Y option
$ scontrol show jobid <jobid> $ scontrol show jobid -dd <jobid>
Interpreting scontrol show job information
[saerda@login3 SLURM_TEST]$ sbatch test_slurm.sh Submitted batch job 13763010 [saerda@login3 SLURM_TEST]$ scontrol show job 13763010 JobId=13763010 JobName=slurm_test UserId=saerda(52569) GroupId=(hpcadmin 1999) Priority=4294113377 Nice=0 Account=t1 QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:00:06 TimeLimit=00:30:00 TimeMin=N/A SubmitTime=2015-11-08T22:01:51 EligibleTime=2015-11-08T22:01:51 StartTime=2015-11-08T22:01:51 EndTime=2015-11-08T22:31:51 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=hpc AllocNode:Sid=login3:10120 ReqNodeList=(null) ExcNodeList=(null) NodeList=compute-3-7 BatchHost=compute-3-7 NumNodes=1 NumCPUs=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:* TRES=cpu=1,mem=1024,node=1 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryCPU=1024M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=OK Contiguous=0 Licenses=(null) Network=(null) Command=/fimm/home/vd/saerda/SLURM_TEST/test_slurm.sh WorkDir=/fimm/home/vd/saerda/SLURM_TEST StdErr=/fimm/home/vd/saerda/SLURM_TEST/my.stdout StdIn=/dev/null StdOut=/fimm/home/vd/saerda/SLURM_TEST/my.stdout Power= SICP=0
Here we will explain each field.
Each job that is submitted to Slurm is assigned a unique numerical ID. This ID appears in the output of several Slurm commands, and can be used to refer to the job for modification or cancellation.
When submitting your job, you can define a descriptive name using --job-name (or -J). Otherwise, the job name will be the name of the script that was submitted.
Each job runs using the user credentials of the user process that submitted it. These are the same credentials indicated by the id command.
The current scheduling priorty for the job, calculated based on the current scheduling policy for the cluster. Jobs with a higher priority are more likely to start sooner.
The nice value is a subtractive adjustment to a job's priority. You can voluntarialy reduce your job priority using the --nice argument.
Access to compute resources is moderated by the use of core-hour allocations to compute accounts. This account is specified using the --account (or -A) argument.
Slurm uses a "quality of service" system to control job properties.
The QOS is selected during job submission using the --qos argument.
Slurm jobs pass through a number of different states. Common states are PENDING, RUNNING, and COMPLETED.
For PENDING jobs, an explanation for why the job is not yet RUNNING is listed here.
If the job depends on another job (as defined by --dependency or -d, that dependency will be indicated here.
If a job fails due to certain scheduler conditions, Slurm may re-queue the job to run at a later time. Re-queueing can be disabled using --no-requeue.
If the job has been restarted (see Requeue above) the number of restarts will be reflected here.
Whether or not the job was submitted using sbatch.
Reboot the node before start the job.
The exit code and terminating signal (if applicable) for exited jobs.
How long the job has been running.
The time limit for the job, specified by --time or -t.
When the job was submitted.
When the job became eligle to run. Examples of reasons a job might be ineligible to run include being bound to a reservation that has not started; exceeding the maximum number of jobs allowed to be run by a user, group, or account; having an unmet job dependency; or specifying a later start time using --begin.
When the job last started.
For a RUNNING job, this is the predicted time that the job will end, based on the time limit specified by --time or -t. For a COMPLETED or CANCELLED job, this is the time that the job ended.
If the scheduler preempts a running job to allow the start of another job, the time that the job was last preempted will be recorded here.
If a job is suspended (e.g., using scontrol suspend) the time that it was last suspended will be recorded here.
The partition of compute resources targeted by the job. While the partition can be manually set using --partition or -p.
Which node the job was submitted from, along with the system id. (It's safe to ignore the system id for now.)
The list of nodes explicitly requested by the job, as specified by the --nodelist or -w argument.
The list of nodes explicitly excluded by the job, as specified by the --exclude or -x argument.
The list of nodes that the job is currently running on.
The "head node" for the job. This is where the job script itself actually runs.
The number of nodes requested by the job. May be specified using --nodes or -N.
The number of CPUs requested by the job, calculated by the nodes requested, the number of tasks requested, and the allocation of CPUs to tasks.
The number of CPU cores assigned to task. May be specified using, for example, --ntasks (-n) and --cpus-per-task (-c).
An undocumented breakout of the node hardware.
Reflects the specific allocation of CPU sockets per node (where a single socketed CPU may contain many cores). This can be specified using --sockets-per-node, implied by --cores-per-socket, or affected by other node specification arguments.
An undocumented breakout of the tasks pre node.
The minimum number of CPU cores per node requested by the job. Useful for jobs that can run on a flexible number of processors, as specified by --mincpus.
The minimum amount of memory required per CPU. Set automatically by the scheduler, but explicitly configurable with --mem-per-cpu.
The amount of temporary disk space required per node, as requested by --tmp. Note that Janus nodes do not have local disks attached, and it is expected that most file IO will take place in the shared parallel filesystem.
Node features required by the job, as specified by --constraint or -C. Node features are not currently used in the Research Computing environment.
Generic consumable features required by the job, as specified by --gres. Generic resources are not currently used in the Research Computing environment.
If the job is running as part of a resource reservation (using --reservation), that reservation will be identified here.
Whether or not the job can share resources with other running jobs, as specified with --share or -s.
Whether or not the nodes allocated for the ode must be contiguous, as specified by --contiguous.
List of licenses requested by the job, as specified by --licenses or -L. Note that Slurm is not used for license managment in the Research Computing environment.
System-specific network specification information. Not applicable to the Research Computing environment.
The command that will be executed on the head node to start the job. (See BatchHost, above.)
The initial working directory for the job, as specified by --workdir or -D. By default, this will be the working directory when the job is submitted.
The output file for the stderr stream (fd 2) of the main process of the job, running on the head node. Set by --output or -o, or explicitly by --error or -e.
The input file for the stdin stream (fd 0) of the main process of the job, running on the head node. Set to /dev/null by default, but can be configured with --input or -i.
The output file for the stdout stream (fd 1) of the main process of the job, running on the head node. Set by --output or -o.
MPI job submission
You can get this simple "Hello World" MPI test program written in C and save it as wiki_mpi_example.c
compile it as :
module load openmpi mpicc wiki_mpi_example.c -o hello_world_wiki.mpi
#!/bin/bash #CPU accounting is not enforced currently #SBATCH -N 2 #use --exclusive to get the whole nodes exclusively for this job #SBATCH --exclusive #SBATCH --time=01:00:00 #SBATCH -c 2 srun -n 10 ./hello_world_wiki.mpi
To efficiently use the computing resources we have set up a special "idle" queue in the cluster which includes all computing nodes - including those nodes which are normally dedicated to specific groups.
Jobs submitted to the "idle" queue will be able to run on dedicated nodes if they are free.
Important: if the dedicated nodes are needed by the groups that own them (they submit a job to them) the "idle queue"-jobs using the needed nodes will be killed and re-queued to try to run at a later time.
The "idle" queue is accessible to everyone who has an account on fimm.bccs.uib.no.
The "idle" queue gives you access to the following extra resources:
|Number of nodes||CPU type||Cores per node||Memory per node|
|2||Quad-Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz||8||32GB|
|30||Quad-Core Intel(R) Xeon(R) CPU E5420 @ 2.50GHz||8||16GB|
|32||Quad-Core Intel(R) Xeon(R) CPU L5430 @ 2.66GHz||8||16GB|
|12||Six-Core AMD Opteron(tm) Processor 2431||12||32GB|
|21||Quad-Core Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz||32||128|
The best situation to use the "idle" queue is:
- The "default" queue is fully utilized and special queues are free.
- You have short jobs which need high resource specification.
- Your jobs are re-runnable without manual intervention. If not please set the "#PBS -r n" flag.
Please keep in mind that when you submit your job to the "idle" queue it is not guaranteed that your job will finish successfully since the owner of the hardware can "take the resources back" any time they submit a job to their specific queues.
In many cases the computational analysis job contains a number of similar independent subtasks. The user may have several datasets that will be analyzed in the same way or same simulation code is executed with a number of different parameters. These kind of tasks are often called as "embarrassingly parallel" jobs as the task can be in principle distributed to as many processors as there are subtasks to be run. In Taito this kind of tasks can be effectively run by using the array job function of the SLURM batch job system.
#!/bin/sh #SBATCH --array=0-31 #SBATCH --time=03:15:00 # Run time in hh:mm:ss #SBATCH --mem-per-cpu=1024 # Minimum memory required per CPU (in megabytes) #SBATCH --job-name=hello-world #SBATCH --error=job.%J.out #SBATCH --output=job.%J.out echo "I am task $SLURM_ARRAY_TASK_ID on node `hostname`" sleep 60