1<!--#include virtual="header.txt"-->
2
3<h1><a name="top">Overview</a></h1>
4
5<p>Slurm is an open source,
6fault-tolerant, and highly scalable cluster management and job scheduling system
7for large and small Linux clusters. Slurm requires no kernel modifications for
8its operation and is relatively self-contained. As a cluster workload manager,
9Slurm has three key functions. First, it allocates exclusive and/or non-exclusive
10access to resources (compute nodes) to users for some duration of time so they
11can perform work. Second, it provides a framework for starting, executing, and
12monitoring work (normally a parallel job) on the set of allocated nodes.
13Finally, it arbitrates contention for resources by managing a queue of
14pending work.
15Optional plugins can be used for
16<a href="accounting.html">accounting</a>,
17<a href="reservations.html">advanced reservation</a>,
18<a href="gang_scheduling.html">gang scheduling</a> (time sharing for
19parallel jobs), backfill scheduling,
20<a href="topology.html">topology optimized resource selection</a>,
21<a href="resource_limits.html">resource limits</a> by user or bank account,
22and sophisticated <a href="priority_multifactor.html"> multifactor job
23prioritization</a> algorithms.
24
25<h2>Architecture</h2>
26<p>Slurm has a centralized manager, <b>slurmctld</b>, to monitor resources and
27work. There may also be a backup manager to assume those responsibilities in the
28event of failure. Each compute server (node) has a <b>slurmd</b> daemon, which
29can be compared to a remote shell: it waits for work, executes that work, returns
30status, and waits for more work.
31The <b>slurmd</b> daemons provide fault-tolerant hierarchical communications.
32There is an optional <b>slurmdbd</b> (Slurm DataBase Daemon) which can be used
33to record accounting information for multiple Slurm-managed clusters in a
34single database.
35There is an optional
36<a href="rest.html"><b>slurmrestd</b> (Slurm REST API Daemon)</a>
37which can be used to interact with Slurm through its
38<a href="https://en.wikipedia.org/wiki/Representational_state_transfer">
39REST API</a>.
40User tools include <b>srun</b> to initiate jobs,
41<b>scancel</b> to terminate queued or running jobs,
42<b>sinfo</b> to report system status,
43<b>squeue</b> to report the status of jobs, and
44<b>sacct</b> to get information about jobs and job steps that are running or have completed.
45The <b>sview</b> commands graphically reports system and
46job status including network topology.
47There is an administrative tool <b>scontrol</b> available to monitor
48and/or modify configuration and state information on the cluster.
49The administrative tool used to manage the database is <b>sacctmgr</b>.
50It can be used to identify the clusters, valid users, valid bank accounts, etc.
51APIs are available for all functions.</p>
52
53<div class="figure">
54  <img src="arch.gif" width="550"><br>
55  Figure 1. Slurm components
56</div>
57
58<p>Slurm has a general-purpose plugin mechanism available to easily support various
59infrastructures. This permits a wide variety of Slurm configurations using a
60building block approach. These plugins presently include:
61<ul>
62<li><a href="accounting_storageplugins.html">Accounting Storage</a>:
63  Primarily Used to store historical data about jobs.  When used with
64  SlurmDBD (Slurm Database Daemon), it can also supply a
65  limits based system along with historical system status.
66</li>
67
68<li><a href="acct_gather_energy_plugins.html">Account Gather Energy</a>:
69  Gather energy consumption data per job or nodes in the system.
70  This plugin is integrated with the
71  <a href="accounting_storageplugins.html">Accounting Storage</a> and
72  <a href="jobacct_gatherplugins.html"> Job Account Gather</a> plugins.
73</li>
74
75<li><a href="authplugins.html">Authentication of communications</a>:
76  Provides authentication mechanism between various components of Slurm.
77</li>
78
79<li><a href="containers.html">Containers</a>:
80  HPC workload container support and implementations.
81</li>
82
83<li><a href="cred_plugins.html">Credential (Digital Signature
84  Generation)</a>:
85  Mechanism used to generate a digital signature, which is used to validate
86  that job step is authorized to execute on specific nodes.
87  This is distinct from the plugin used for
88  <a href="authplugins.html">Authentication</a> since the job step
89  request is sent from the user's srun command rather than directly from the
90  slurmctld daemon, which generates the job step credential and its
91  digital signature.
92</li>
93
94<li><a href="gres.html">Generic Resources</a>: Provide interface to
95  control generic resources like Processing Units (GPUs) and Intel&reg;
96  Many Integrated Core (MIC) processors.
97</li>
98
99<li><a href="job_submit_plugins.html">Job Submit</a>:
100  Custom plugin to allow site specific control over job requirements at
101  submission and update.
102</li>
103
104<li><a href="jobacct_gatherplugins.html">Job Accounting Gather</a>:
105  Gather job step resource utilization data.
106</li>
107
108<li><a href="jobcompplugins.html">Job Completion Logging</a>:
109  Log a job's termination data. This is typically a subset of data stored by
110  an <a href="accounting_storageplugins.html">Accounting Storage Plugin</a>.
111</li>
112
113<li><a href="launch_plugins.html">Launchers</a>:
114  Controls the mechanism used by the <a href="srun.html">'srun'</a> command
115  to launch the tasks.
116</li>
117
118<li><a href="mpiplugins.html">MPI</a>:
119  Provides different hooks for the various MPI implementations.
120  For example, this can set MPI specific environment variables.
121</li>
122
123<li><a href="preempt.html">Preempt</a>:
124  Determines which jobs can preempt other jobs and the preemption mechanism
125  to be used.
126</li>
127
128<li><a href="priority_plugins.html">Priority</a>:
129  Assigns priorities to jobs upon submission and on an ongoing basis
130  (e.g. as they age).
131</li>
132
133<li><a href="proctrack_plugins.html">Process tracking (for signaling)</a>:
134  Provides a mechanism for identifying the processes associated with each job.
135  Used for job accounting and signaling.
136</li>
137
138<li><a href="schedplugins.html">Scheduler</a>:
139  Plugin determines how and when Slurm schedules jobs.
140</li>
141
142<li><a href="selectplugins.html">Node selection</a>:
143  Plugin used to determine the resources used for a job allocation.
144</li>
145
146<li><a href="site_factor.html">Site Factor (Priority)</a>:
147  Assigns a specific site_factor component of a job's multifactor priority to
148  jobs upon submission and on an ongoing basis (e.g. as they age).
149</li>
150
151<li><a href="switchplugins.html">Switch or interconnect</a>:
152  Plugin to interface with a switch or interconnect.
153  For most systems (ethernet or infiniband) this is not needed.
154</li>
155
156<li><a href="taskplugins.html">Task Affinity</a>:
157  Provides mechanism to bind a job and its individual tasks to specific
158  processors.
159</li>
160
161<li><a href="topology_plugin.html">Network Topology</a>:
162  Optimizes resource selection based upon the network topology.
163  Used for both job allocations and advanced reservation.
164</li>
165
166</ul>
167
168<p>The entities managed by these Slurm daemons, shown in Figure 2, include <b>nodes</b>,
169the compute resource in Slurm, <b>partitions</b>, which group nodes into logical
170sets, <b>jobs</b>, or allocations of resources assigned to a user for
171a specified amount of time, and <b>job steps</b>, which are sets of (possibly
172parallel) tasks within a job.
173The partitions can be considered job queues, each of which has an assortment of
174constraints such as job size limit, job time limit, users permitted to use it, etc.
175Priority-ordered jobs are allocated nodes within a partition until the resources
176(nodes, processors, memory, etc.) within that partition are exhausted. Once
177a job is assigned a set of nodes, the user is able to initiate parallel work in
178the form of job steps in any configuration within the allocation. For instance,
179a single job step may be started that utilizes all nodes allocated to the job,
180or several job steps may independently use a portion of the allocation.
181Slurm provides resource management for the processors allocated to a job,
182so that multiple job steps can be simultaneously submitted and queued until
183there are available resources within the job's allocation.</p>
184
185<div class="figure">
186  <img src="entities.gif" width="550"><br>
187  Figure 2. Slurm entities
188</div>
189
190
191<h2>Configurability</h2>
192<p>Node state monitored include: count of processors, size of real memory, size
193of temporary disk space, and state (UP, DOWN, etc.). Additional node information
194includes weight (preference in being allocated work) and features (arbitrary information
195such as processor speed or type).
196Nodes are grouped into partitions, which may contain overlapping nodes so they are
197best thought of as job queues.
198Partition information includes: name, list of associated nodes, state (UP or DOWN),
199maximum job time limit, maximum node count per job, group access list,
200priority (important if nodes are in multiple partitions) and shared node access policy
201with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2).
202Bit maps are used to represent nodes and scheduling
203decisions can be made by performing a small number of comparisons and a series
204of fast bit map manipulations. A sample (partial. Slurm configuration file follows.</p>
205<pre>
206#
207# Sample /etc/slurm.conf
208#
209SlurmctldHost=linux0001  # Primary server
210SlurmctldHost=linux0002  # Backup server
211#
212AuthType=auth/munge
213Epilog=/usr/local/slurm/sbin/epilog
214PluginDir=/usr/local/slurm/lib
215Prolog=/usr/local/slurm/sbin/prolog
216SlurmctldPort=7002
217SlurmctldTimeout=120
218SlurmdPort=7003
219SlurmdSpoolDir=/var/tmp/slurmd.spool
220SlurmdTimeout=120
221StateSaveLocation=/usr/local/slurm/slurm.state
222TmpFS=/tmp
223#
224# Node Configurations
225#
226NodeName=DEFAULT CPUs=4 TmpDisk=16384 State=IDLE
227NodeName=lx[0001-0002] State=DRAINED
228NodeName=lx[0003-8000] RealMemory=2048 Weight=2
229NodeName=lx[8001-9999] RealMemory=4096 Weight=6 Feature=video
230#
231# Partition Configurations
232#
233PartitionName=DEFAULT MaxTime=30 MaxNodes=2
234PartitionName=login Nodes=lx[0001-0002] State=DOWN
235PartitionName=debug Nodes=lx[0003-0030] State=UP Default=YES
236PartitionName=class Nodes=lx[0031-0040] AllowGroups=students
237PartitionName=DEFAULT MaxTime=UNLIMITED MaxNodes=4096
238PartitionName=batch Nodes=lx[0041-9999]
239</pre>
240
241<p style="text-align:center;">Last modified 7 Februrary 2020</p>
242
243<!--#include virtual="footer.txt"-->
244