xref: /freebsd/share/man/man9/scheduler.9 (revision aa0a1e58)
1.\" Copyright (c) 2000-2001 John H. Baldwin <jhb@FreeBSD.org>
2.\" All rights reserved.
3.\"
4.\" Redistribution and use in source and binary forms, with or without
5.\" modification, are permitted provided that the following conditions
6.\" are met:
7.\" 1. Redistributions of source code must retain the above copyright
8.\"    notice, this list of conditions and the following disclaimer.
9.\" 2. Redistributions in binary form must reproduce the above copyright
10.\"    notice, this list of conditions and the following disclaimer in the
11.\"    documentation and/or other materials provided with the distribution.
12.\"
13.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
14.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
15.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
16.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
17.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
18.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
19.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
20.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
21.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
22.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
23.\"
24.\" $FreeBSD$
25.\"
26.Dd November 3, 2000
27.Dt SCHEDULER 9
28.Os
29.Sh NAME
30.Nm curpriority_cmp ,
31.Nm maybe_resched ,
32.Nm resetpriority ,
33.Nm roundrobin ,
34.Nm roundrobin_interval ,
35.Nm sched_setup ,
36.Nm schedclock ,
37.Nm schedcpu ,
38.Nm setrunnable ,
39.Nm updatepri
40.Nd perform round-robin scheduling of runnable processes
41.Sh SYNOPSIS
42.In sys/param.h
43.In sys/proc.h
44.Ft int
45.Fn curpriority_cmp "struct proc *p"
46.Ft void
47.Fn maybe_resched "struct thread *td"
48.Ft void
49.Fn propagate_priority "struct proc *p"
50.Ft void
51.Fn resetpriority "struct ksegrp *kg"
52.Ft void
53.Fn roundrobin "void *arg"
54.Ft int
55.Fn roundrobin_interval "void"
56.Ft void
57.Fn sched_setup "void *dummy"
58.Ft void
59.Fn schedclock "struct thread *td"
60.Ft void
61.Fn schedcpu "void *arg"
62.Ft void
63.Fn setrunnable "struct thread *td"
64.Ft void
65.Fn updatepri "struct thread *td"
66.Sh DESCRIPTION
67Each process has three different priorities stored in
68.Vt "struct proc" :
69.Va p_usrpri ,
70.Va p_nativepri ,
71and
72.Va p_priority .
73.Pp
74The
75.Va p_usrpri
76member is the user priority of the process calculated from a process'
77estimated CPU time and nice level.
78.Pp
79The
80.Va p_nativepri
81member is the saved priority used by
82.Fn propagate_priority .
83When a process obtains a mutex, its priority is saved in
84.Va p_nativepri .
85While it holds the mutex, the process's priority may be bumped by another
86process that blocks on the mutex.
87When the process releases the mutex, then its priority is restored to the
88priority saved in
89.Va p_nativepri .
90.Pp
91The
92.Va p_priority
93member is the actual priority of the process and is used to determine what
94.Xr runqueue 9
95it runs on, for example.
96.Pp
97The
98.Fn curpriority_cmp
99function compares the cached priority of the currently running process with
100process
101.Fa p .
102If the currently running process has a higher priority, then it will return
103a value less than zero.
104If the current process has a lower priority, then it will return a value
105greater than zero.
106If the current process has the same priority as
107.Fa p ,
108then
109.Fn curpriority_cmp
110will return zero.
111The cached priority of the currently running process is updated when a process
112resumes from
113.Xr tsleep 9
114or returns to userland in
115.Fn userret
116and is stored in the private variable
117.Va curpriority .
118.Pp
119The
120.Fn maybe_resched
121function compares the priorities of the current thread and
122.Fa td .
123If
124.Fa td
125has a higher priority than the current thread, then a context switch is
126needed, and
127.Dv KEF_NEEDRESCHED
128is set.
129.Pp
130The
131.Fn propagate_priority
132looks at the process that owns the mutex
133.Fa p
134is blocked on.
135That process's priority is bumped to the priority of
136.Fa p
137if needed.
138If the process is currently running, then the function returns.
139If the process is on a
140.Xr runqueue 9 ,
141then the process is moved to the appropriate
142.Xr runqueue 9
143for its new priority.
144If the process is blocked on a mutex, its position in the list of
145processes blocked on the mutex in question is updated to reflect its new
146priority.
147Then, the function repeats the procedure using the process that owns the
148mutex just encountered.
149Note that a process's priorities are only bumped to the priority of the
150original process
151.Fa p ,
152not to the priority of the previously encountered process.
153.Pp
154The
155.Fn resetpriority
156function recomputes the user priority of the ksegrp
157.Fa kg
158(stored in
159.Va kg_user_pri )
160and calls
161.Fn maybe_resched
162to force a reschedule of each thread in the group if needed.
163.Pp
164The
165.Fn roundrobin
166function is used as a
167.Xr timeout 9
168function to force a reschedule every
169.Va sched_quantum
170ticks.
171.Pp
172The
173.Fn roundrobin_interval
174function simply returns the number of clock ticks in between reschedules
175triggered by
176.Fn roundrobin .
177Thus, all it does is return the current value of
178.Va sched_quantum .
179.Pp
180The
181.Fn sched_setup
182function is a
183.Xr SYSINIT 9
184that is called to start the callout driven scheduler functions.
185It just calls the
186.Fn roundrobin
187and
188.Fn schedcpu
189functions for the first time.
190After the initial call, the two functions will propagate themselves by
191registering their callout event again at the completion of the respective
192function.
193.Pp
194The
195.Fn schedclock
196function is called by
197.Fn statclock
198to adjust the priority of the currently running thread's ksegrp.
199It updates the group's estimated CPU time and then adjusts the priority via
200.Fn resetpriority .
201.Pp
202The
203.Fn schedcpu
204function updates all process priorities.
205First, it updates statistics that track how long processes have been in various
206process states.
207Secondly, it updates the estimated CPU time for the current process such
208that about 90% of the CPU usage is forgotten in 5 * load average seconds.
209For example, if the load average is 2.00,
210then at least 90% of the estimated CPU time for the process should be based
211on the amount of CPU time the process has had in the last 10 seconds.
212It then recomputes the priority of the process and moves it to the
213appropriate
214.Xr runqueue 9
215if necessary.
216Thirdly, it updates the %CPU estimate used by utilities such as
217.Xr ps 1
218and
219.Xr top 1
220so that 95% of the CPU usage is forgotten in 60 seconds.
221Once all process priorities have been updated,
222.Fn schedcpu
223calls
224.Fn vmmeter
225to update various other statistics including the load average.
226Finally, it schedules itself to run again in
227.Va hz
228clock ticks.
229.Pp
230The
231.Fn setrunnable
232function is used to change a process's state to be runnable.
233The process is placed on a
234.Xr runqueue 9
235if needed, and the swapper process is woken up and told to swap the process in
236if the process is swapped out.
237If the process has been asleep for at least one run of
238.Fn schedcpu ,
239then
240.Fn updatepri
241is used to adjust the priority of the process.
242.Pp
243The
244.Fn updatepri
245function is used to adjust the priority of a process that has been asleep.
246It retroactively decays the estimated CPU time of the process for each
247.Fn schedcpu
248event that the process was asleep.
249Finally, it calls
250.Fn resetpriority
251to adjust the priority of the process.
252.Sh SEE ALSO
253.Xr mi_switch 9 ,
254.Xr runqueue 9 ,
255.Xr sleepqueue 9 ,
256.Xr tsleep 9
257.Sh BUGS
258The
259.Va curpriority
260variable really should be per-CPU.
261In addition,
262.Fn maybe_resched
263should compare the priority of
264.Fa chk
265with that of each CPU, and then send an IPI to the processor with the lowest
266priority to trigger a reschedule if needed.
267.Pp
268Priority propagation is broken and is thus disabled by default.
269The
270.Va p_nativepri
271variable is only updated if a process does not obtain a sleep mutex on the
272first try.
273Also, if a process obtains more than one sleep mutex in this manner, and
274had its priority bumped in between, then
275.Va p_nativepri
276will be clobbered.
277