1*c3123552SMauro Carvalho Chehab=============================
2*c3123552SMauro Carvalho ChehabPer-task statistics interface
3*c3123552SMauro Carvalho Chehab=============================
4*c3123552SMauro Carvalho Chehab
5*c3123552SMauro Carvalho Chehab
6*c3123552SMauro Carvalho ChehabTaskstats is a netlink-based interface for sending per-task and
7*c3123552SMauro Carvalho Chehabper-process statistics from the kernel to userspace.
8*c3123552SMauro Carvalho Chehab
9*c3123552SMauro Carvalho ChehabTaskstats was designed for the following benefits:
10*c3123552SMauro Carvalho Chehab
11*c3123552SMauro Carvalho Chehab- efficiently provide statistics during lifetime of a task and on its exit
12*c3123552SMauro Carvalho Chehab- unified interface for multiple accounting subsystems
13*c3123552SMauro Carvalho Chehab- extensibility for use by future accounting patches
14*c3123552SMauro Carvalho Chehab
15*c3123552SMauro Carvalho ChehabTerminology
16*c3123552SMauro Carvalho Chehab-----------
17*c3123552SMauro Carvalho Chehab
18*c3123552SMauro Carvalho Chehab"pid", "tid" and "task" are used interchangeably and refer to the standard
19*c3123552SMauro Carvalho ChehabLinux task defined by struct task_struct.  per-pid stats are the same as
20*c3123552SMauro Carvalho Chehabper-task stats.
21*c3123552SMauro Carvalho Chehab
22*c3123552SMauro Carvalho Chehab"tgid", "process" and "thread group" are used interchangeably and refer to the
23*c3123552SMauro Carvalho Chehabtasks that share an mm_struct i.e. the traditional Unix process. Despite the
24*c3123552SMauro Carvalho Chehabuse of tgid, there is no special treatment for the task that is thread group
25*c3123552SMauro Carvalho Chehableader - a process is deemed alive as long as it has any task belonging to it.
26*c3123552SMauro Carvalho Chehab
27*c3123552SMauro Carvalho ChehabUsage
28*c3123552SMauro Carvalho Chehab-----
29*c3123552SMauro Carvalho Chehab
30*c3123552SMauro Carvalho ChehabTo get statistics during a task's lifetime, userspace opens a unicast netlink
31*c3123552SMauro Carvalho Chehabsocket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid.
32*c3123552SMauro Carvalho ChehabThe response contains statistics for a task (if pid is specified) or the sum of
33*c3123552SMauro Carvalho Chehabstatistics for all tasks of the process (if tgid is specified).
34*c3123552SMauro Carvalho Chehab
35*c3123552SMauro Carvalho ChehabTo obtain statistics for tasks which are exiting, the userspace listener
36*c3123552SMauro Carvalho Chehabsends a register command and specifies a cpumask. Whenever a task exits on
37*c3123552SMauro Carvalho Chehabone of the cpus in the cpumask, its per-pid statistics are sent to the
38*c3123552SMauro Carvalho Chehabregistered listener. Using cpumasks allows the data received by one listener
39*c3123552SMauro Carvalho Chehabto be limited and assists in flow control over the netlink interface and is
40*c3123552SMauro Carvalho Chehabexplained in more detail below.
41*c3123552SMauro Carvalho Chehab
42*c3123552SMauro Carvalho ChehabIf the exiting task is the last thread exiting its thread group,
43*c3123552SMauro Carvalho Chehaban additional record containing the per-tgid stats is also sent to userspace.
44*c3123552SMauro Carvalho ChehabThe latter contains the sum of per-pid stats for all threads in the thread
45*c3123552SMauro Carvalho Chehabgroup, both past and present.
46*c3123552SMauro Carvalho Chehab
47*c3123552SMauro Carvalho Chehabgetdelays.c is a simple utility demonstrating usage of the taskstats interface
48*c3123552SMauro Carvalho Chehabfor reporting delay accounting statistics. Users can register cpumasks,
49*c3123552SMauro Carvalho Chehabsend commands and process responses, listen for per-tid/tgid exit data,
50*c3123552SMauro Carvalho Chehabwrite the data received to a file and do basic flow control by increasing
51*c3123552SMauro Carvalho Chehabreceive buffer sizes.
52*c3123552SMauro Carvalho Chehab
53*c3123552SMauro Carvalho ChehabInterface
54*c3123552SMauro Carvalho Chehab---------
55*c3123552SMauro Carvalho Chehab
56*c3123552SMauro Carvalho ChehabThe user-kernel interface is encapsulated in include/linux/taskstats.h
57*c3123552SMauro Carvalho Chehab
58*c3123552SMauro Carvalho ChehabTo avoid this documentation becoming obsolete as the interface evolves, only
59*c3123552SMauro Carvalho Chehaban outline of the current version is given. taskstats.h always overrides the
60*c3123552SMauro Carvalho Chehabdescription here.
61*c3123552SMauro Carvalho Chehab
62*c3123552SMauro Carvalho Chehabstruct taskstats is the common accounting structure for both per-pid and
63*c3123552SMauro Carvalho Chehabper-tgid data. It is versioned and can be extended by each accounting subsystem
64*c3123552SMauro Carvalho Chehabthat is added to the kernel. The fields and their semantics are defined in the
65*c3123552SMauro Carvalho Chehabtaskstats.h file.
66*c3123552SMauro Carvalho Chehab
67*c3123552SMauro Carvalho ChehabThe data exchanged between user and kernel space is a netlink message belonging
68*c3123552SMauro Carvalho Chehabto the NETLINK_GENERIC family and using the netlink attributes interface.
69*c3123552SMauro Carvalho ChehabThe messages are in the format::
70*c3123552SMauro Carvalho Chehab
71*c3123552SMauro Carvalho Chehab    +----------+- - -+-------------+-------------------+
72*c3123552SMauro Carvalho Chehab    | nlmsghdr | Pad |  genlmsghdr | taskstats payload |
73*c3123552SMauro Carvalho Chehab    +----------+- - -+-------------+-------------------+
74*c3123552SMauro Carvalho Chehab
75*c3123552SMauro Carvalho Chehab
76*c3123552SMauro Carvalho ChehabThe taskstats payload is one of the following three kinds:
77*c3123552SMauro Carvalho Chehab
78*c3123552SMauro Carvalho Chehab1. Commands: Sent from user to kernel. Commands to get data on
79*c3123552SMauro Carvalho Chehaba pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID,
80*c3123552SMauro Carvalho Chehabcontaining a u32 pid or tgid in the attribute payload. The pid/tgid denotes
81*c3123552SMauro Carvalho Chehabthe task/process for which userspace wants statistics.
82*c3123552SMauro Carvalho Chehab
83*c3123552SMauro Carvalho ChehabCommands to register/deregister interest in exit data from a set of cpus
84*c3123552SMauro Carvalho Chehabconsist of one attribute, of type
85*c3123552SMauro Carvalho ChehabTASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the
86*c3123552SMauro Carvalho Chehabattribute payload. The cpumask is specified as an ascii string of
87*c3123552SMauro Carvalho Chehabcomma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8
88*c3123552SMauro Carvalho Chehabthe cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest
89*c3123552SMauro Carvalho Chehabin cpus before closing the listening socket, the kernel cleans up its interest
90*c3123552SMauro Carvalho Chehabset over time. However, for the sake of efficiency, an explicit deregistration
91*c3123552SMauro Carvalho Chehabis advisable.
92*c3123552SMauro Carvalho Chehab
93*c3123552SMauro Carvalho Chehab2. Response for a command: sent from the kernel in response to a userspace
94*c3123552SMauro Carvalho Chehabcommand. The payload is a series of three attributes of type:
95*c3123552SMauro Carvalho Chehab
96*c3123552SMauro Carvalho Chehaba) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates
97*c3123552SMauro Carvalho Chehaba pid/tgid will be followed by some stats.
98*c3123552SMauro Carvalho Chehab
99*c3123552SMauro Carvalho Chehabb) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats
100*c3123552SMauro Carvalho Chehabare being returned.
101*c3123552SMauro Carvalho Chehab
102*c3123552SMauro Carvalho Chehabc) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The
103*c3123552SMauro Carvalho Chehabsame structure is used for both per-pid and per-tgid stats.
104*c3123552SMauro Carvalho Chehab
105*c3123552SMauro Carvalho Chehab3. New message sent by kernel whenever a task exits. The payload consists of a
106*c3123552SMauro Carvalho Chehab   series of attributes of the following type:
107*c3123552SMauro Carvalho Chehab
108*c3123552SMauro Carvalho Chehaba) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats
109*c3123552SMauro Carvalho Chehabb) TASKSTATS_TYPE_PID: contains exiting task's pid
110*c3123552SMauro Carvalho Chehabc) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats
111*c3123552SMauro Carvalho Chehabd) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats
112*c3123552SMauro Carvalho Chehabe) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs
113*c3123552SMauro Carvalho Chehabf) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process
114*c3123552SMauro Carvalho Chehab
115*c3123552SMauro Carvalho Chehab
116*c3123552SMauro Carvalho Chehabper-tgid stats
117*c3123552SMauro Carvalho Chehab--------------
118*c3123552SMauro Carvalho Chehab
119*c3123552SMauro Carvalho ChehabTaskstats provides per-process stats, in addition to per-task stats, since
120*c3123552SMauro Carvalho Chehabresource management is often done at a process granularity and aggregating task
121*c3123552SMauro Carvalho Chehabstats in userspace alone is inefficient and potentially inaccurate (due to lack
122*c3123552SMauro Carvalho Chehabof atomicity).
123*c3123552SMauro Carvalho Chehab
124*c3123552SMauro Carvalho ChehabHowever, maintaining per-process, in addition to per-task stats, within the
125*c3123552SMauro Carvalho Chehabkernel has space and time overheads. To address this, the taskstats code
126*c3123552SMauro Carvalho Chehabaccumulates each exiting task's statistics into a process-wide data structure.
127*c3123552SMauro Carvalho ChehabWhen the last task of a process exits, the process level data accumulated also
128*c3123552SMauro Carvalho Chehabgets sent to userspace (along with the per-task data).
129*c3123552SMauro Carvalho Chehab
130*c3123552SMauro Carvalho ChehabWhen a user queries to get per-tgid data, the sum of all other live threads in
131*c3123552SMauro Carvalho Chehabthe group is added up and added to the accumulated total for previously exited
132*c3123552SMauro Carvalho Chehabthreads of the same thread group.
133*c3123552SMauro Carvalho Chehab
134*c3123552SMauro Carvalho ChehabExtending taskstats
135*c3123552SMauro Carvalho Chehab-------------------
136*c3123552SMauro Carvalho Chehab
137*c3123552SMauro Carvalho ChehabThere are two ways to extend the taskstats interface to export more
138*c3123552SMauro Carvalho Chehabper-task/process stats as patches to collect them get added to the kernel
139*c3123552SMauro Carvalho Chehabin future:
140*c3123552SMauro Carvalho Chehab
141*c3123552SMauro Carvalho Chehab1. Adding more fields to the end of the existing struct taskstats. Backward
142*c3123552SMauro Carvalho Chehab   compatibility is ensured by the version number within the
143*c3123552SMauro Carvalho Chehab   structure. Userspace will use only the fields of the struct that correspond
144*c3123552SMauro Carvalho Chehab   to the version its using.
145*c3123552SMauro Carvalho Chehab
146*c3123552SMauro Carvalho Chehab2. Defining separate statistic structs and using the netlink attributes
147*c3123552SMauro Carvalho Chehab   interface to return them. Since userspace processes each netlink attribute
148*c3123552SMauro Carvalho Chehab   independently, it can always ignore attributes whose type it does not
149*c3123552SMauro Carvalho Chehab   understand (because it is using an older version of the interface).
150*c3123552SMauro Carvalho Chehab
151*c3123552SMauro Carvalho Chehab
152*c3123552SMauro Carvalho ChehabChoosing between 1. and 2. is a matter of trading off flexibility and
153*c3123552SMauro Carvalho Chehaboverhead. If only a few fields need to be added, then 1. is the preferable
154*c3123552SMauro Carvalho Chehabpath since the kernel and userspace don't need to incur the overhead of
155*c3123552SMauro Carvalho Chehabprocessing new netlink attributes. But if the new fields expand the existing
156*c3123552SMauro Carvalho Chehabstruct too much, requiring disparate userspace accounting utilities to
157*c3123552SMauro Carvalho Chehabunnecessarily receive large structures whose fields are of no interest, then
158*c3123552SMauro Carvalho Chehabextending the attributes structure would be worthwhile.
159*c3123552SMauro Carvalho Chehab
160*c3123552SMauro Carvalho ChehabFlow control for taskstats
161*c3123552SMauro Carvalho Chehab--------------------------
162*c3123552SMauro Carvalho Chehab
163*c3123552SMauro Carvalho ChehabWhen the rate of task exits becomes large, a listener may not be able to keep
164*c3123552SMauro Carvalho Chehabup with the kernel's rate of sending per-tid/tgid exit data leading to data
165*c3123552SMauro Carvalho Chehabloss. This possibility gets compounded when the taskstats structure gets
166*c3123552SMauro Carvalho Chehabextended and the number of cpus grows large.
167*c3123552SMauro Carvalho Chehab
168*c3123552SMauro Carvalho ChehabTo avoid losing statistics, userspace should do one or more of the following:
169*c3123552SMauro Carvalho Chehab
170*c3123552SMauro Carvalho Chehab- increase the receive buffer sizes for the netlink sockets opened by
171*c3123552SMauro Carvalho Chehab  listeners to receive exit data.
172*c3123552SMauro Carvalho Chehab
173*c3123552SMauro Carvalho Chehab- create more listeners and reduce the number of cpus being listened to by
174*c3123552SMauro Carvalho Chehab  each listener. In the extreme case, there could be one listener for each cpu.
175*c3123552SMauro Carvalho Chehab  Users may also consider setting the cpu affinity of the listener to the subset
176*c3123552SMauro Carvalho Chehab  of cpus to which it listens, especially if they are listening to just one cpu.
177*c3123552SMauro Carvalho Chehab
178*c3123552SMauro Carvalho ChehabDespite these measures, if the userspace receives ENOBUFS error messages
179*c3123552SMauro Carvalho Chehabindicated overflow of receive buffers, it should take measures to handle the
180*c3123552SMauro Carvalho Chehabloss of data.
181