1<!--#include virtual="header.txt"--> 2 3<h1><a name="top">Overview</a></h1> 4 5<p>Slurm is an open source, 6fault-tolerant, and highly scalable cluster management and job scheduling system 7for large and small Linux clusters. Slurm requires no kernel modifications for 8its operation and is relatively self-contained. As a cluster workload manager, 9Slurm has three key functions. First, it allocates exclusive and/or non-exclusive 10access to resources (compute nodes) to users for some duration of time so they 11can perform work. Second, it provides a framework for starting, executing, and 12monitoring work (normally a parallel job) on the set of allocated nodes. 13Finally, it arbitrates contention for resources by managing a queue of 14pending work. 15Optional plugins can be used for 16<a href="accounting.html">accounting</a>, 17<a href="reservations.html">advanced reservation</a>, 18<a href="gang_scheduling.html">gang scheduling</a> (time sharing for 19parallel jobs), backfill scheduling, 20<a href="topology.html">topology optimized resource selection</a>, 21<a href="resource_limits.html">resource limits</a> by user or bank account, 22and sophisticated <a href="priority_multifactor.html"> multifactor job 23prioritization</a> algorithms. 24 25<h2>Architecture</h2> 26<p>Slurm has a centralized manager, <b>slurmctld</b>, to monitor resources and 27work. There may also be a backup manager to assume those responsibilities in the 28event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which 29can be compared to a remote shell: it waits for work, executes that work, returns 30status, and waits for more work. 31The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications. 32There is an optional <b>slurmdbd</b> (Slurm DataBase Daemon) which can be used 33to record accounting information for multiple Slurm-managed clusters in a 34single database. 35There is an optional 36<a href="rest.html"><b>slurmrestd</b> (Slurm REST API Daemon)</a> 37which can be used to interact with Slurm through its 38<a href="https://en.wikipedia.org/wiki/Representational_state_transfer"> 39REST API</a>. 40User tools include <b>srun</b> to initiate jobs, 41<b>scancel</b> to terminate queued or running jobs, 42<b>sinfo</b> to report system status, 43<b>squeue</b> to report the status of jobs, and 44<b>sacct</b> to get information about jobs and job steps that are running or have completed. 45The <b>sview</b> commands graphically reports system and 46job status including network topology. 47There is an administrative tool <b>scontrol</b> available to monitor 48and/or modify configuration and state information on the cluster. 49The administrative tool used to manage the database is <b>sacctmgr</b>. 50It can be used to identify the clusters, valid users, valid bank accounts, etc. 51APIs are available for all functions.</p> 52 53<div class="figure"> 54 <img src="arch.gif" width="550"><br> 55 Figure 1. Slurm components 56</div> 57 58<p>Slurm has a general-purpose plugin mechanism available to easily support various 59infrastructures. This permits a wide variety of Slurm configurations using a 60building block approach. These plugins presently include: 61<ul> 62<li><a href="accounting_storageplugins.html">Accounting Storage</a>: 63 Primarily Used to store historical data about jobs. When used with 64 SlurmDBD (Slurm Database Daemon), it can also supply a 65 limits based system along with historical system status. 66</li> 67 68<li><a href="acct_gather_energy_plugins.html">Account Gather Energy</a>: 69 Gather energy consumption data per job or nodes in the system. 70 This plugin is integrated with the 71 <a href="accounting_storageplugins.html">Accounting Storage</a> and 72 <a href="jobacct_gatherplugins.html"> Job Account Gather</a> plugins. 73</li> 74 75<li><a href="authplugins.html">Authentication of communications</a>: 76 Provides authentication mechanism between various components of Slurm. 77</li> 78 79<li><a href="containers.html">Containers</a>: 80 HPC workload container support and implementations. 81</li> 82 83<li><a href="cred_plugins.html">Credential (Digital Signature 84 Generation)</a>: 85 Mechanism used to generate a digital signature, which is used to validate 86 that job step is authorized to execute on specific nodes. 87 This is distinct from the plugin used for 88 <a href="authplugins.html">Authentication</a> since the job step 89 request is sent from the user's srun command rather than directly from the 90 slurmctld daemon, which generates the job step credential and its 91 digital signature. 92</li> 93 94<li><a href="gres.html">Generic Resources</a>: Provide interface to 95 control generic resources like Processing Units (GPUs) and Intel® 96 Many Integrated Core (MIC) processors. 97</li> 98 99<li><a href="job_submit_plugins.html">Job Submit</a>: 100 Custom plugin to allow site specific control over job requirements at 101 submission and update. 102</li> 103 104<li><a href="jobacct_gatherplugins.html">Job Accounting Gather</a>: 105 Gather job step resource utilization data. 106</li> 107 108<li><a href="jobcompplugins.html">Job Completion Logging</a>: 109 Log a job's termination data. This is typically a subset of data stored by 110 an <a href="accounting_storageplugins.html">Accounting Storage Plugin</a>. 111</li> 112 113<li><a href="launch_plugins.html">Launchers</a>: 114 Controls the mechanism used by the <a href="srun.html">'srun'</a> command 115 to launch the tasks. 116</li> 117 118<li><a href="mpiplugins.html">MPI</a>: 119 Provides different hooks for the various MPI implementations. 120 For example, this can set MPI specific environment variables. 121</li> 122 123<li><a href="preempt.html">Preempt</a>: 124 Determines which jobs can preempt other jobs and the preemption mechanism 125 to be used. 126</li> 127 128<li><a href="priority_plugins.html">Priority</a>: 129 Assigns priorities to jobs upon submission and on an ongoing basis 130 (e.g. as they age). 131</li> 132 133<li><a href="proctrack_plugins.html">Process tracking (for signaling)</a>: 134 Provides a mechanism for identifying the processes associated with each job. 135 Used for job accounting and signaling. 136</li> 137 138<li><a href="schedplugins.html">Scheduler</a>: 139 Plugin determines how and when Slurm schedules jobs. 140</li> 141 142<li><a href="selectplugins.html">Node selection</a>: 143 Plugin used to determine the resources used for a job allocation. 144</li> 145 146<li><a href="site_factor.html">Site Factor (Priority)</a>: 147 Assigns a specific site_factor component of a job's multifactor priority to 148 jobs upon submission and on an ongoing basis (e.g. as they age). 149</li> 150 151<li><a href="switchplugins.html">Switch or interconnect</a>: 152 Plugin to interface with a switch or interconnect. 153 For most systems (ethernet or infiniband) this is not needed. 154</li> 155 156<li><a href="taskplugins.html">Task Affinity</a>: 157 Provides mechanism to bind a job and its individual tasks to specific 158 processors. 159</li> 160 161<li><a href="topology_plugin.html">Network Topology</a>: 162 Optimizes resource selection based upon the network topology. 163 Used for both job allocations and advanced reservation. 164</li> 165 166</ul> 167 168<p>The entities managed by these Slurm daemons, shown in Figure 2, include <b>nodes</b>, 169the compute resource in Slurm, <b>partitions</b>, which group nodes into logical 170sets, <b>jobs</b>, or allocations of resources assigned to a user for 171a specified amount of time, and <b>job steps</b>, which are sets of (possibly 172parallel) tasks within a job. 173The partitions can be considered job queues, each of which has an assortment of 174constraints such as job size limit, job time limit, users permitted to use it, etc. 175Priority-ordered jobs are allocated nodes within a partition until the resources 176(nodes, processors, memory, etc.) within that partition are exhausted. Once 177a job is assigned a set of nodes, the user is able to initiate parallel work in 178the form of job steps in any configuration within the allocation. For instance, 179a single job step may be started that utilizes all nodes allocated to the job, 180or several job steps may independently use a portion of the allocation. 181Slurm provides resource management for the processors allocated to a job, 182so that multiple job steps can be simultaneously submitted and queued until 183there are available resources within the job's allocation.</p> 184 185<div class="figure"> 186 <img src="entities.gif" width="550"><br> 187 Figure 2. Slurm entities 188</div> 189 190 191<h2>Configurability</h2> 192<p>Node state monitored include: count of processors, size of real memory, size 193of temporary disk space, and state (UP, DOWN, etc.). Additional node information 194includes weight (preference in being allocated work) and features (arbitrary information 195such as processor speed or type). 196Nodes are grouped into partitions, which may contain overlapping nodes so they are 197best thought of as job queues. 198Partition information includes: name, list of associated nodes, state (UP or DOWN), 199maximum job time limit, maximum node count per job, group access list, 200priority (important if nodes are in multiple partitions) and shared node access policy 201with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2). 202Bit maps are used to represent nodes and scheduling 203decisions can be made by performing a small number of comparisons and a series 204of fast bit map manipulations. A sample (partial. Slurm configuration file follows.</p> 205<pre> 206# 207# Sample /etc/slurm.conf 208# 209SlurmctldHost=linux0001 # Primary server 210SlurmctldHost=linux0002 # Backup server 211# 212AuthType=auth/munge 213Epilog=/usr/local/slurm/sbin/epilog 214PluginDir=/usr/local/slurm/lib 215Prolog=/usr/local/slurm/sbin/prolog 216SlurmctldPort=7002 217SlurmctldTimeout=120 218SlurmdPort=7003 219SlurmdSpoolDir=/var/tmp/slurmd.spool 220SlurmdTimeout=120 221StateSaveLocation=/usr/local/slurm/slurm.state 222TmpFS=/tmp 223# 224# Node Configurations 225# 226NodeName=DEFAULT CPUs=4 TmpDisk=16384 State=IDLE 227NodeName=lx[0001-0002] State=DRAINED 228NodeName=lx[0003-8000] RealMemory=2048 Weight=2 229NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video 230# 231# Partition Configurations 232# 233PartitionName=DEFAULT MaxTime=30 MaxNodes=2 234PartitionName=login Nodes=lx[0001-0002] State=DOWN 235PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES 236PartitionName=class Nodes=lx[0031-0040] AllowGroups=students 237PartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096 238PartitionName=batch Nodes=lx[0041-9999] 239</pre> 240 241<p style="text-align:center;">Last modified 7 Februrary 2020</p> 242 243<!--#include virtual="footer.txt"--> 244