xref: /freebsd/share/man/man9/scheduler.9 (revision 81ad6265)
1.\" Copyright (c) 2000-2001 John H. Baldwin <jhb@FreeBSD.org>
2.\"
3.\" Redistribution and use in source and binary forms, with or without
4.\" modification, are permitted provided that the following conditions
5.\" are met:
6.\" 1. Redistributions of source code must retain the above copyright
7.\"    notice, this list of conditions and the following disclaimer.
8.\" 2. Redistributions in binary form must reproduce the above copyright
9.\"    notice, this list of conditions and the following disclaimer in the
10.\"    documentation and/or other materials provided with the distribution.
11.\"
12.\" THIS SOFTWARE IS PROVIDED BY THE DEVELOPERS ``AS IS'' AND ANY EXPRESS OR
13.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
14.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
15.\" IN NO EVENT SHALL THE DEVELOPERS BE LIABLE FOR ANY DIRECT, INDIRECT,
16.\" INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
17.\" NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
18.\" DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
19.\" THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
20.\" (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
21.\" THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
22.\"
23.\" $FreeBSD$
24.\"
25.Dd November 3, 2000
26.Dt SCHEDULER 9
27.Os
28.Sh NAME
29.Nm curpriority_cmp ,
30.Nm maybe_resched ,
31.Nm resetpriority ,
32.Nm roundrobin ,
33.Nm roundrobin_interval ,
34.Nm sched_setup ,
35.Nm schedclock ,
36.Nm schedcpu ,
37.Nm setrunnable ,
38.Nm updatepri
39.Nd perform round-robin scheduling of runnable processes
40.Sh SYNOPSIS
41.In sys/param.h
42.In sys/proc.h
43.Ft int
44.Fn curpriority_cmp "struct proc *p"
45.Ft void
46.Fn maybe_resched "struct thread *td"
47.Ft void
48.Fn propagate_priority "struct proc *p"
49.Ft void
50.Fn resetpriority "struct ksegrp *kg"
51.Ft void
52.Fn roundrobin "void *arg"
53.Ft int
54.Fn roundrobin_interval "void"
55.Ft void
56.Fn sched_setup "void *dummy"
57.Ft void
58.Fn schedclock "struct thread *td"
59.Ft void
60.Fn schedcpu "void *arg"
61.Ft void
62.Fn setrunnable "struct thread *td"
63.Ft void
64.Fn updatepri "struct thread *td"
65.Sh DESCRIPTION
66Each process has three different priorities stored in
67.Vt "struct proc" :
68.Va p_usrpri ,
69.Va p_nativepri ,
70and
71.Va p_priority .
72.Pp
73The
74.Va p_usrpri
75member is the user priority of the process calculated from a process'
76estimated CPU time and nice level.
77.Pp
78The
79.Va p_nativepri
80member is the saved priority used by
81.Fn propagate_priority .
82When a process obtains a mutex, its priority is saved in
83.Va p_nativepri .
84While it holds the mutex, the process's priority may be bumped by another
85process that blocks on the mutex.
86When the process releases the mutex, then its priority is restored to the
87priority saved in
88.Va p_nativepri .
89.Pp
90The
91.Va p_priority
92member is the actual priority of the process and is used to determine what
93.Xr runqueue 9
94it runs on, for example.
95.Pp
96The
97.Fn curpriority_cmp
98function compares the cached priority of the currently running process with
99process
100.Fa p .
101If the currently running process has a higher priority, then it will return
102a value less than zero.
103If the current process has a lower priority, then it will return a value
104greater than zero.
105If the current process has the same priority as
106.Fa p ,
107then
108.Fn curpriority_cmp
109will return zero.
110The cached priority of the currently running process is updated when a process
111resumes from
112.Xr tsleep 9
113or returns to userland in
114.Fn userret
115and is stored in the private variable
116.Va curpriority .
117.Pp
118The
119.Fn maybe_resched
120function compares the priorities of the current thread and
121.Fa td .
122If
123.Fa td
124has a higher priority than the current thread, then a context switch is
125needed, and
126.Dv KEF_NEEDRESCHED
127is set.
128.Pp
129The
130.Fn propagate_priority
131looks at the process that owns the mutex
132.Fa p
133is blocked on.
134That process's priority is bumped to the priority of
135.Fa p
136if needed.
137If the process is currently running, then the function returns.
138If the process is on a
139.Xr runqueue 9 ,
140then the process is moved to the appropriate
141.Xr runqueue 9
142for its new priority.
143If the process is blocked on a mutex, its position in the list of
144processes blocked on the mutex in question is updated to reflect its new
145priority.
146Then, the function repeats the procedure using the process that owns the
147mutex just encountered.
148Note that a process's priorities are only bumped to the priority of the
149original process
150.Fa p ,
151not to the priority of the previously encountered process.
152.Pp
153The
154.Fn resetpriority
155function recomputes the user priority of the ksegrp
156.Fa kg
157(stored in
158.Va kg_user_pri )
159and calls
160.Fn maybe_resched
161to force a reschedule of each thread in the group if needed.
162.Pp
163The
164.Fn roundrobin
165function is used as a
166.Xr timeout 9
167function to force a reschedule every
168.Va sched_quantum
169ticks.
170.Pp
171The
172.Fn roundrobin_interval
173function simply returns the number of clock ticks in between reschedules
174triggered by
175.Fn roundrobin .
176Thus, all it does is return the current value of
177.Va sched_quantum .
178.Pp
179The
180.Fn sched_setup
181function is a
182.Xr SYSINIT 9
183that is called to start the callout driven scheduler functions.
184It just calls the
185.Fn roundrobin
186and
187.Fn schedcpu
188functions for the first time.
189After the initial call, the two functions will propagate themselves by
190registering their callout event again at the completion of the respective
191function.
192.Pp
193The
194.Fn schedclock
195function is called by
196.Fn statclock
197to adjust the priority of the currently running thread's ksegrp.
198It updates the group's estimated CPU time and then adjusts the priority via
199.Fn resetpriority .
200.Pp
201The
202.Fn schedcpu
203function updates all process priorities.
204First, it updates statistics that track how long processes have been in various
205process states.
206Secondly, it updates the estimated CPU time for the current process such
207that about 90% of the CPU usage is forgotten in 5 * load average seconds.
208For example, if the load average is 2.00,
209then at least 90% of the estimated CPU time for the process should be based
210on the amount of CPU time the process has had in the last 10 seconds.
211It then recomputes the priority of the process and moves it to the
212appropriate
213.Xr runqueue 9
214if necessary.
215Thirdly, it updates the %CPU estimate used by utilities such as
216.Xr ps 1
217and
218.Xr top 1
219so that 95% of the CPU usage is forgotten in 60 seconds.
220Once all process priorities have been updated,
221.Fn schedcpu
222calls
223.Fn vmmeter
224to update various other statistics including the load average.
225Finally, it schedules itself to run again in
226.Va hz
227clock ticks.
228.Pp
229The
230.Fn setrunnable
231function is used to change a process's state to be runnable.
232The process is placed on a
233.Xr runqueue 9
234if needed, and the swapper process is woken up and told to swap the process in
235if the process is swapped out.
236If the process has been asleep for at least one run of
237.Fn schedcpu ,
238then
239.Fn updatepri
240is used to adjust the priority of the process.
241.Pp
242The
243.Fn updatepri
244function is used to adjust the priority of a process that has been asleep.
245It retroactively decays the estimated CPU time of the process for each
246.Fn schedcpu
247event that the process was asleep.
248Finally, it calls
249.Fn resetpriority
250to adjust the priority of the process.
251.Sh SEE ALSO
252.Xr mi_switch 9 ,
253.Xr runqueue 9 ,
254.Xr sleepqueue 9 ,
255.Xr tsleep 9
256.Sh BUGS
257The
258.Va curpriority
259variable really should be per-CPU.
260In addition,
261.Fn maybe_resched
262should compare the priority of
263.Fa chk
264with that of each CPU, and then send an IPI to the processor with the lowest
265priority to trigger a reschedule if needed.
266.Pp
267Priority propagation is broken and is thus disabled by default.
268The
269.Va p_nativepri
270variable is only updated if a process does not obtain a sleep mutex on the
271first try.
272Also, if a process obtains more than one sleep mutex in this manner, and
273had its priority bumped in between, then
274.Va p_nativepri
275will be clobbered.
276