README.md
1SCHED_EXT EXAMPLE SCHEDULERS
2============================
3
4# Introduction
5
6This directory contains a number of example sched_ext schedulers. These
7schedulers are meant to provide examples of different types of schedulers
8that can be built using sched_ext, and illustrate how various features of
9sched_ext can be used.
10
11Some of the examples are performant, production-ready schedulers. That is, for
12the correct workload and with the correct tuning, they may be deployed in a
13production environment with acceptable or possibly even improved performance.
14Others are just examples that in practice, would not provide acceptable
15performance (though they could be improved to get there).
16
17This README will describe these example schedulers, including describing the
18types of workloads or scenarios they're designed to accommodate, and whether or
19not they're production ready. For more details on any of these schedulers,
20please see the header comment in their .bpf.c file.
21
22
23# Compiling the examples
24
25There are a few toolchain dependencies for compiling the example schedulers.
26
27## Toolchain dependencies
28
291. clang >= 16.0.0
30
31The schedulers are BPF programs, and therefore must be compiled with clang. gcc
32is actively working on adding a BPF backend compiler as well, but are still
33missing some features such as BTF type tags which are necessary for using
34kptrs.
35
362. pahole >= 1.25
37
38You may need pahole in order to generate BTF from DWARF.
39
403. rust >= 1.70.0
41
42Rust schedulers uses features present in the rust toolchain >= 1.70.0. You
43should be able to use the stable build from rustup, but if that doesn't
44work, try using the rustup nightly build.
45
46There are other requirements as well, such as make, but these are the main /
47non-trivial ones.
48
49## Compiling the kernel
50
51In order to run a sched_ext scheduler, you'll have to run a kernel compiled
52with the patches in this repository, and with a minimum set of necessary
53Kconfig options:
54
55```
56CONFIG_BPF=y
57CONFIG_SCHED_CLASS_EXT=y
58CONFIG_BPF_SYSCALL=y
59CONFIG_BPF_JIT=y
60CONFIG_DEBUG_INFO_BTF=y
61```
62
63It's also recommended that you also include the following Kconfig options:
64
65```
66CONFIG_BPF_JIT_ALWAYS_ON=y
67CONFIG_BPF_JIT_DEFAULT_ON=y
68CONFIG_PAHOLE_HAS_SPLIT_BTF=y
69CONFIG_PAHOLE_HAS_BTF_TAG=y
70```
71
72There is a `Kconfig` file in this directory whose contents you can append to
73your local `.config` file, as long as there are no conflicts with any existing
74options in the file.
75
76## Getting a vmlinux.h file
77
78You may notice that most of the example schedulers include a "vmlinux.h" file.
79This is a large, auto-generated header file that contains all of the types
80defined in some vmlinux binary that was compiled with
81[BTF](https://docs.kernel.org/bpf/btf.html) (i.e. with the BTF-related Kconfig
82options specified above).
83
84The header file is created using `bpftool`, by passing it a vmlinux binary
85compiled with BTF as follows:
86
87```bash
88$ bpftool btf dump file /path/to/vmlinux format c > vmlinux.h
89```
90
91`bpftool` analyzes all of the BTF encodings in the binary, and produces a
92header file that can be included by BPF programs to access those types. For
93example, using vmlinux.h allows a scheduler to access fields defined directly
94in vmlinux as follows:
95
96```c
97#include "vmlinux.h"
98// vmlinux.h is also implicitly included by scx_common.bpf.h.
99#include "scx_common.bpf.h"
100
101/*
102 * vmlinux.h provides definitions for struct task_struct and
103 * struct scx_enable_args.
104 */
105void BPF_STRUCT_OPS(example_enable, struct task_struct *p,
106 struct scx_enable_args *args)
107{
108 bpf_printk("Task %s enabled in example scheduler", p->comm);
109}
110
111// vmlinux.h provides the definition for struct sched_ext_ops.
112SEC(".struct_ops.link")
113struct sched_ext_ops example_ops {
114 .enable = (void *)example_enable,
115 .name = "example",
116}
117```
118
119The scheduler build system will generate this vmlinux.h file as part of the
120scheduler build pipeline. It looks for a vmlinux file in the following
121dependency order:
122
1231. If the O= environment variable is defined, at `$O/vmlinux`
1242. If the KBUILD_OUTPUT= environment variable is defined, at
125 `$KBUILD_OUTPUT/vmlinux`
1263. At `../../vmlinux` (i.e. at the root of the kernel tree where you're
127 compiling the schedulers)
1283. `/sys/kernel/btf/vmlinux`
1294. `/boot/vmlinux-$(uname -r)`
130
131In other words, if you have compiled a kernel in your local repo, its vmlinux
132file will be used to generate vmlinux.h. Otherwise, it will be the vmlinux of
133the kernel you're currently running on. This means that if you're running on a
134kernel with sched_ext support, you may not need to compile a local kernel at
135all.
136
137### Aside on CO-RE
138
139One of the cooler features of BPF is that it supports
140[CO-RE](https://nakryiko.com/posts/bpf-core-reference-guide/) (Compile Once Run
141Everywhere). This feature allows you to reference fields inside of structs with
142types defined internal to the kernel, and not have to recompile if you load the
143BPF program on a different kernel with the field at a different offset. In our
144example above, we print out a task name with `p->comm`. CO-RE would perform
145relocations for that access when the program is loaded to ensure that it's
146referencing the correct offset for the currently running kernel.
147
148## Compiling the schedulers
149
150Once you have your toolchain setup, and a vmlinux that can be used to generate
151a full vmlinux.h file, you can compile the schedulers using `make`:
152
153```bash
154$ make -j($nproc)
155```
156
157# Example schedulers
158
159This directory contains the following example schedulers. These schedulers are
160for testing and demonstrating different aspects of sched_ext. While some may be
161useful in limited scenarios, they are not intended to be practical.
162
163For more scheduler implementations, tools and documentation, visit
164https://github.com/sched-ext/scx.
165
166## scx_simple
167
168A simple scheduler that provides an example of a minimal sched_ext scheduler.
169scx_simple can be run in either global weighted vtime mode, or FIFO mode.
170
171Though very simple, in limited scenarios, this scheduler can perform reasonably
172well on single-socket systems with a unified L3 cache.
173
174## scx_qmap
175
176Another simple, yet slightly more complex scheduler that provides an example of
177a basic weighted FIFO queuing policy. It also provides examples of some common
178useful BPF features, such as sleepable per-task storage allocation in the
179`ops.prep_enable()` callback, and using the `BPF_MAP_TYPE_QUEUE` map type to
180enqueue tasks. It also illustrates how core-sched support could be implemented.
181
182## scx_central
183
184A "central" scheduler where scheduling decisions are made from a single CPU.
185This scheduler illustrates how scheduling decisions can be dispatched from a
186single CPU, allowing other cores to run with infinite slices, without timer
187ticks, and without having to incur the overhead of making scheduling decisions.
188
189The approach demonstrated by this scheduler may be useful for any workload that
190benefits from minimizing scheduling overhead and timer ticks. An example of
191where this could be particularly useful is running VMs, where running with
192infinite slices and no timer ticks allows the VM to avoid unnecessary expensive
193vmexits.
194
195## scx_flatcg
196
197A flattened cgroup hierarchy scheduler. This scheduler implements hierarchical
198weight-based cgroup CPU control by flattening the cgroup hierarchy into a single
199layer, by compounding the active weight share at each level. The effect of this
200is a much more performant CPU controller, which does not need to descend down
201cgroup trees in order to properly compute a cgroup's share.
202
203Similar to scx_simple, in limited scenarios, this scheduler can perform
204reasonably well on single socket-socket systems with a unified L3 cache and show
205significantly lowered hierarchical scheduling overhead.
206
207
208# Troubleshooting
209
210There are a number of common issues that you may run into when building the
211schedulers. We'll go over some of the common ones here.
212
213## Build Failures
214
215### Old version of clang
216
217```
218error: static assertion failed due to requirement 'SCX_DSQ_FLAG_BUILTIN': bpftool generated vmlinux.h is missing high bits for 64bit enums, upgrade clang and pahole
219 _Static_assert(SCX_DSQ_FLAG_BUILTIN,
220 ^~~~~~~~~~~~~~~~~~~~
2211 error generated.
222```
223
224This means you built the kernel or the schedulers with an older version of
225clang than what's supported (i.e. older than 16.0.0). To remediate this:
226
2271. `which clang` to make sure you're using a sufficiently new version of clang.
228
2292. `make fullclean` in the root path of the repository, and rebuild the kernel
230 and schedulers.
231
2323. Rebuild the kernel, and then your example schedulers.
233
234The schedulers are also cleaned if you invoke `make mrproper` in the root
235directory of the tree.
236
237### Stale kernel build / incomplete vmlinux.h file
238
239As described above, you'll need a `vmlinux.h` file that was generated from a
240vmlinux built with BTF, and with sched_ext support enabled. If you don't,
241you'll see errors such as the following which indicate that a type being
242referenced in a scheduler is unknown:
243
244```
245/path/to/sched_ext/tools/sched_ext/user_exit_info.h:25:23: note: forward declaration of 'struct scx_exit_info'
246
247const struct scx_exit_info *ei)
248
249^
250```
251
252In order to resolve this, please follow the steps above in
253[Getting a vmlinux.h file](#getting-a-vmlinuxh-file) in order to ensure your
254schedulers are using a vmlinux.h file that includes the requisite types.
255
256## Misc
257
258### llvm: [OFF]
259
260You may see the following output when building the schedulers:
261
262```
263Auto-detecting system features:
264... clang-bpf-co-re: [ on ]
265... llvm: [ OFF ]
266... libcap: [ on ]
267... libbfd: [ on ]
268```
269
270Seeing `llvm: [ OFF ]` here is not an issue. You can safely ignore.
271