xref: /freebsd/share/man/man9/scheduler.9 (revision 06c3fb27)
1.\" Copyright (c) 2000-2001 John H. Baldwin <jhb@FreeBSD.org>
2.\"
3.\" Redistribution and use in source and binary forms, with or without
4.\" modification, are permitted provided that the following conditions
5.\" are met:
6.\" 1. Redistributions of source code must retain the above copyright
7.\"    notice, this list of conditions and the following disclaimer.
8.\" 2. Redistributions in binary form must reproduce the above copyright
9.\"    notice, this list of conditions and the following disclaimer in the
10.\"    documentation and/or other materials provided with the distribution.
11.\"
12.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
13.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
14.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
15.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
16.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
17.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
18.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
19.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
20.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
21.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
22.\"
23.Dd November 3, 2000
24.Dt SCHEDULER 9
25.Os
26.Sh NAME
27.Nm curpriority_cmp ,
28.Nm maybe_resched ,
29.Nm resetpriority ,
30.Nm roundrobin ,
31.Nm roundrobin_interval ,
32.Nm sched_setup ,
33.Nm schedclock ,
34.Nm schedcpu ,
35.Nm setrunnable ,
36.Nm updatepri
37.Nd perform round-robin scheduling of runnable processes
38.Sh SYNOPSIS
39.In sys/param.h
40.In sys/proc.h
41.Ft int
42.Fn curpriority_cmp "struct proc *p"
43.Ft void
44.Fn maybe_resched "struct thread *td"
45.Ft void
46.Fn propagate_priority "struct proc *p"
47.Ft void
48.Fn resetpriority "struct ksegrp *kg"
49.Ft void
50.Fn roundrobin "void *arg"
51.Ft int
52.Fn roundrobin_interval "void"
53.Ft void
54.Fn sched_setup "void *dummy"
55.Ft void
56.Fn schedclock "struct thread *td"
57.Ft void
58.Fn schedcpu "void *arg"
59.Ft void
60.Fn setrunnable "struct thread *td"
61.Ft void
62.Fn updatepri "struct thread *td"
63.Sh DESCRIPTION
64Each process has three different priorities stored in
65.Vt "struct proc" :
66.Va p_usrpri ,
67.Va p_nativepri ,
68and
69.Va p_priority .
70.Pp
71The
72.Va p_usrpri
73member is the user priority of the process calculated from a process'
74estimated CPU time and nice level.
75.Pp
76The
77.Va p_nativepri
78member is the saved priority used by
79.Fn propagate_priority .
80When a process obtains a mutex, its priority is saved in
81.Va p_nativepri .
82While it holds the mutex, the process's priority may be bumped by another
83process that blocks on the mutex.
84When the process releases the mutex, then its priority is restored to the
85priority saved in
86.Va p_nativepri .
87.Pp
88The
89.Va p_priority
90member is the actual priority of the process and is used to determine what
91.Xr runqueue 9
92it runs on, for example.
93.Pp
94The
95.Fn curpriority_cmp
96function compares the cached priority of the currently running process with
97process
98.Fa p .
99If the currently running process has a higher priority, then it will return
100a value less than zero.
101If the current process has a lower priority, then it will return a value
102greater than zero.
103If the current process has the same priority as
104.Fa p ,
105then
106.Fn curpriority_cmp
107will return zero.
108The cached priority of the currently running process is updated when a process
109resumes from
110.Xr tsleep 9
111or returns to userland in
112.Fn userret
113and is stored in the private variable
114.Va curpriority .
115.Pp
116The
117.Fn maybe_resched
118function compares the priorities of the current thread and
119.Fa td .
120If
121.Fa td
122has a higher priority than the current thread, then a context switch is
123needed, and
124.Dv KEF_NEEDRESCHED
125is set.
126.Pp
127The
128.Fn propagate_priority
129looks at the process that owns the mutex
130.Fa p
131is blocked on.
132That process's priority is bumped to the priority of
133.Fa p
134if needed.
135If the process is currently running, then the function returns.
136If the process is on a
137.Xr runqueue 9 ,
138then the process is moved to the appropriate
139.Xr runqueue 9
140for its new priority.
141If the process is blocked on a mutex, its position in the list of
142processes blocked on the mutex in question is updated to reflect its new
143priority.
144Then, the function repeats the procedure using the process that owns the
145mutex just encountered.
146Note that a process's priorities are only bumped to the priority of the
147original process
148.Fa p ,
149not to the priority of the previously encountered process.
150.Pp
151The
152.Fn resetpriority
153function recomputes the user priority of the ksegrp
154.Fa kg
155(stored in
156.Va kg_user_pri )
157and calls
158.Fn maybe_resched
159to force a reschedule of each thread in the group if needed.
160.Pp
161The
162.Fn roundrobin
163function is used as a
164.Xr timeout 9
165function to force a reschedule every
166.Va sched_quantum
167ticks.
168.Pp
169The
170.Fn roundrobin_interval
171function simply returns the number of clock ticks in between reschedules
172triggered by
173.Fn roundrobin .
174Thus, all it does is return the current value of
175.Va sched_quantum .
176.Pp
177The
178.Fn sched_setup
179function is a
180.Xr SYSINIT 9
181that is called to start the callout driven scheduler functions.
182It just calls the
183.Fn roundrobin
184and
185.Fn schedcpu
186functions for the first time.
187After the initial call, the two functions will propagate themselves by
188registering their callout event again at the completion of the respective
189function.
190.Pp
191The
192.Fn schedclock
193function is called by
194.Fn statclock
195to adjust the priority of the currently running thread's ksegrp.
196It updates the group's estimated CPU time and then adjusts the priority via
197.Fn resetpriority .
198.Pp
199The
200.Fn schedcpu
201function updates all process priorities.
202First, it updates statistics that track how long processes have been in various
203process states.
204Secondly, it updates the estimated CPU time for the current process such
205that about 90% of the CPU usage is forgotten in 5 * load average seconds.
206For example, if the load average is 2.00,
207then at least 90% of the estimated CPU time for the process should be based
208on the amount of CPU time the process has had in the last 10 seconds.
209It then recomputes the priority of the process and moves it to the
210appropriate
211.Xr runqueue 9
212if necessary.
213Thirdly, it updates the %CPU estimate used by utilities such as
214.Xr ps 1
215and
216.Xr top 1
217so that 95% of the CPU usage is forgotten in 60 seconds.
218Once all process priorities have been updated,
219.Fn schedcpu
220calls
221.Fn vmmeter
222to update various other statistics including the load average.
223Finally, it schedules itself to run again in
224.Va hz
225clock ticks.
226.Pp
227The
228.Fn setrunnable
229function is used to change a process's state to be runnable.
230The process is placed on a
231.Xr runqueue 9
232if needed, and the swapper process is woken up and told to swap the process in
233if the process is swapped out.
234If the process has been asleep for at least one run of
235.Fn schedcpu ,
236then
237.Fn updatepri
238is used to adjust the priority of the process.
239.Pp
240The
241.Fn updatepri
242function is used to adjust the priority of a process that has been asleep.
243It retroactively decays the estimated CPU time of the process for each
244.Fn schedcpu
245event that the process was asleep.
246Finally, it calls
247.Fn resetpriority
248to adjust the priority of the process.
249.Sh SEE ALSO
250.Xr mi_switch 9 ,
251.Xr runqueue 9 ,
252.Xr sleepqueue 9 ,
253.Xr tsleep 9
254.Sh BUGS
255The
256.Va curpriority
257variable really should be per-CPU.
258In addition,
259.Fn maybe_resched
260should compare the priority of
261.Fa chk
262with that of each CPU, and then send an IPI to the processor with the lowest
263priority to trigger a reschedule if needed.
264.Pp
265Priority propagation is broken and is thus disabled by default.
266The
267.Va p_nativepri
268variable is only updated if a process does not obtain a sleep mutex on the
269first try.
270Also, if a process obtains more than one sleep mutex in this manner, and
271had its priority bumped in between, then
272.Va p_nativepri
273will be clobbered.
274