1*c3123552SMauro Carvalho Chehab============================= 2*c3123552SMauro Carvalho ChehabPer-task statistics interface 3*c3123552SMauro Carvalho Chehab============================= 4*c3123552SMauro Carvalho Chehab 5*c3123552SMauro Carvalho Chehab 6*c3123552SMauro Carvalho ChehabTaskstats is a netlink-based interface for sending per-task and 7*c3123552SMauro Carvalho Chehabper-process statistics from the kernel to userspace. 8*c3123552SMauro Carvalho Chehab 9*c3123552SMauro Carvalho ChehabTaskstats was designed for the following benefits: 10*c3123552SMauro Carvalho Chehab 11*c3123552SMauro Carvalho Chehab- efficiently provide statistics during lifetime of a task and on its exit 12*c3123552SMauro Carvalho Chehab- unified interface for multiple accounting subsystems 13*c3123552SMauro Carvalho Chehab- extensibility for use by future accounting patches 14*c3123552SMauro Carvalho Chehab 15*c3123552SMauro Carvalho ChehabTerminology 16*c3123552SMauro Carvalho Chehab----------- 17*c3123552SMauro Carvalho Chehab 18*c3123552SMauro Carvalho Chehab"pid", "tid" and "task" are used interchangeably and refer to the standard 19*c3123552SMauro Carvalho ChehabLinux task defined by struct task_struct. per-pid stats are the same as 20*c3123552SMauro Carvalho Chehabper-task stats. 21*c3123552SMauro Carvalho Chehab 22*c3123552SMauro Carvalho Chehab"tgid", "process" and "thread group" are used interchangeably and refer to the 23*c3123552SMauro Carvalho Chehabtasks that share an mm_struct i.e. the traditional Unix process. Despite the 24*c3123552SMauro Carvalho Chehabuse of tgid, there is no special treatment for the task that is thread group 25*c3123552SMauro Carvalho Chehableader - a process is deemed alive as long as it has any task belonging to it. 26*c3123552SMauro Carvalho Chehab 27*c3123552SMauro Carvalho ChehabUsage 28*c3123552SMauro Carvalho Chehab----- 29*c3123552SMauro Carvalho Chehab 30*c3123552SMauro Carvalho ChehabTo get statistics during a task's lifetime, userspace opens a unicast netlink 31*c3123552SMauro Carvalho Chehabsocket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. 32*c3123552SMauro Carvalho ChehabThe response contains statistics for a task (if pid is specified) or the sum of 33*c3123552SMauro Carvalho Chehabstatistics for all tasks of the process (if tgid is specified). 34*c3123552SMauro Carvalho Chehab 35*c3123552SMauro Carvalho ChehabTo obtain statistics for tasks which are exiting, the userspace listener 36*c3123552SMauro Carvalho Chehabsends a register command and specifies a cpumask. Whenever a task exits on 37*c3123552SMauro Carvalho Chehabone of the cpus in the cpumask, its per-pid statistics are sent to the 38*c3123552SMauro Carvalho Chehabregistered listener. Using cpumasks allows the data received by one listener 39*c3123552SMauro Carvalho Chehabto be limited and assists in flow control over the netlink interface and is 40*c3123552SMauro Carvalho Chehabexplained in more detail below. 41*c3123552SMauro Carvalho Chehab 42*c3123552SMauro Carvalho ChehabIf the exiting task is the last thread exiting its thread group, 43*c3123552SMauro Carvalho Chehaban additional record containing the per-tgid stats is also sent to userspace. 44*c3123552SMauro Carvalho ChehabThe latter contains the sum of per-pid stats for all threads in the thread 45*c3123552SMauro Carvalho Chehabgroup, both past and present. 46*c3123552SMauro Carvalho Chehab 47*c3123552SMauro Carvalho Chehabgetdelays.c is a simple utility demonstrating usage of the taskstats interface 48*c3123552SMauro Carvalho Chehabfor reporting delay accounting statistics. Users can register cpumasks, 49*c3123552SMauro Carvalho Chehabsend commands and process responses, listen for per-tid/tgid exit data, 50*c3123552SMauro Carvalho Chehabwrite the data received to a file and do basic flow control by increasing 51*c3123552SMauro Carvalho Chehabreceive buffer sizes. 52*c3123552SMauro Carvalho Chehab 53*c3123552SMauro Carvalho ChehabInterface 54*c3123552SMauro Carvalho Chehab--------- 55*c3123552SMauro Carvalho Chehab 56*c3123552SMauro Carvalho ChehabThe user-kernel interface is encapsulated in include/linux/taskstats.h 57*c3123552SMauro Carvalho Chehab 58*c3123552SMauro Carvalho ChehabTo avoid this documentation becoming obsolete as the interface evolves, only 59*c3123552SMauro Carvalho Chehaban outline of the current version is given. taskstats.h always overrides the 60*c3123552SMauro Carvalho Chehabdescription here. 61*c3123552SMauro Carvalho Chehab 62*c3123552SMauro Carvalho Chehabstruct taskstats is the common accounting structure for both per-pid and 63*c3123552SMauro Carvalho Chehabper-tgid data. It is versioned and can be extended by each accounting subsystem 64*c3123552SMauro Carvalho Chehabthat is added to the kernel. The fields and their semantics are defined in the 65*c3123552SMauro Carvalho Chehabtaskstats.h file. 66*c3123552SMauro Carvalho Chehab 67*c3123552SMauro Carvalho ChehabThe data exchanged between user and kernel space is a netlink message belonging 68*c3123552SMauro Carvalho Chehabto the NETLINK_GENERIC family and using the netlink attributes interface. 69*c3123552SMauro Carvalho ChehabThe messages are in the format:: 70*c3123552SMauro Carvalho Chehab 71*c3123552SMauro Carvalho Chehab +----------+- - -+-------------+-------------------+ 72*c3123552SMauro Carvalho Chehab | nlmsghdr | Pad | genlmsghdr | taskstats payload | 73*c3123552SMauro Carvalho Chehab +----------+- - -+-------------+-------------------+ 74*c3123552SMauro Carvalho Chehab 75*c3123552SMauro Carvalho Chehab 76*c3123552SMauro Carvalho ChehabThe taskstats payload is one of the following three kinds: 77*c3123552SMauro Carvalho Chehab 78*c3123552SMauro Carvalho Chehab1. Commands: Sent from user to kernel. Commands to get data on 79*c3123552SMauro Carvalho Chehaba pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, 80*c3123552SMauro Carvalho Chehabcontaining a u32 pid or tgid in the attribute payload. The pid/tgid denotes 81*c3123552SMauro Carvalho Chehabthe task/process for which userspace wants statistics. 82*c3123552SMauro Carvalho Chehab 83*c3123552SMauro Carvalho ChehabCommands to register/deregister interest in exit data from a set of cpus 84*c3123552SMauro Carvalho Chehabconsist of one attribute, of type 85*c3123552SMauro Carvalho ChehabTASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the 86*c3123552SMauro Carvalho Chehabattribute payload. The cpumask is specified as an ascii string of 87*c3123552SMauro Carvalho Chehabcomma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 88*c3123552SMauro Carvalho Chehabthe cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest 89*c3123552SMauro Carvalho Chehabin cpus before closing the listening socket, the kernel cleans up its interest 90*c3123552SMauro Carvalho Chehabset over time. However, for the sake of efficiency, an explicit deregistration 91*c3123552SMauro Carvalho Chehabis advisable. 92*c3123552SMauro Carvalho Chehab 93*c3123552SMauro Carvalho Chehab2. Response for a command: sent from the kernel in response to a userspace 94*c3123552SMauro Carvalho Chehabcommand. The payload is a series of three attributes of type: 95*c3123552SMauro Carvalho Chehab 96*c3123552SMauro Carvalho Chehaba) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates 97*c3123552SMauro Carvalho Chehaba pid/tgid will be followed by some stats. 98*c3123552SMauro Carvalho Chehab 99*c3123552SMauro Carvalho Chehabb) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats 100*c3123552SMauro Carvalho Chehabare being returned. 101*c3123552SMauro Carvalho Chehab 102*c3123552SMauro Carvalho Chehabc) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The 103*c3123552SMauro Carvalho Chehabsame structure is used for both per-pid and per-tgid stats. 104*c3123552SMauro Carvalho Chehab 105*c3123552SMauro Carvalho Chehab3. New message sent by kernel whenever a task exits. The payload consists of a 106*c3123552SMauro Carvalho Chehab series of attributes of the following type: 107*c3123552SMauro Carvalho Chehab 108*c3123552SMauro Carvalho Chehaba) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats 109*c3123552SMauro Carvalho Chehabb) TASKSTATS_TYPE_PID: contains exiting task's pid 110*c3123552SMauro Carvalho Chehabc) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats 111*c3123552SMauro Carvalho Chehabd) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats 112*c3123552SMauro Carvalho Chehabe) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs 113*c3123552SMauro Carvalho Chehabf) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process 114*c3123552SMauro Carvalho Chehab 115*c3123552SMauro Carvalho Chehab 116*c3123552SMauro Carvalho Chehabper-tgid stats 117*c3123552SMauro Carvalho Chehab-------------- 118*c3123552SMauro Carvalho Chehab 119*c3123552SMauro Carvalho ChehabTaskstats provides per-process stats, in addition to per-task stats, since 120*c3123552SMauro Carvalho Chehabresource management is often done at a process granularity and aggregating task 121*c3123552SMauro Carvalho Chehabstats in userspace alone is inefficient and potentially inaccurate (due to lack 122*c3123552SMauro Carvalho Chehabof atomicity). 123*c3123552SMauro Carvalho Chehab 124*c3123552SMauro Carvalho ChehabHowever, maintaining per-process, in addition to per-task stats, within the 125*c3123552SMauro Carvalho Chehabkernel has space and time overheads. To address this, the taskstats code 126*c3123552SMauro Carvalho Chehabaccumulates each exiting task's statistics into a process-wide data structure. 127*c3123552SMauro Carvalho ChehabWhen the last task of a process exits, the process level data accumulated also 128*c3123552SMauro Carvalho Chehabgets sent to userspace (along with the per-task data). 129*c3123552SMauro Carvalho Chehab 130*c3123552SMauro Carvalho ChehabWhen a user queries to get per-tgid data, the sum of all other live threads in 131*c3123552SMauro Carvalho Chehabthe group is added up and added to the accumulated total for previously exited 132*c3123552SMauro Carvalho Chehabthreads of the same thread group. 133*c3123552SMauro Carvalho Chehab 134*c3123552SMauro Carvalho ChehabExtending taskstats 135*c3123552SMauro Carvalho Chehab------------------- 136*c3123552SMauro Carvalho Chehab 137*c3123552SMauro Carvalho ChehabThere are two ways to extend the taskstats interface to export more 138*c3123552SMauro Carvalho Chehabper-task/process stats as patches to collect them get added to the kernel 139*c3123552SMauro Carvalho Chehabin future: 140*c3123552SMauro Carvalho Chehab 141*c3123552SMauro Carvalho Chehab1. Adding more fields to the end of the existing struct taskstats. Backward 142*c3123552SMauro Carvalho Chehab compatibility is ensured by the version number within the 143*c3123552SMauro Carvalho Chehab structure. Userspace will use only the fields of the struct that correspond 144*c3123552SMauro Carvalho Chehab to the version its using. 145*c3123552SMauro Carvalho Chehab 146*c3123552SMauro Carvalho Chehab2. Defining separate statistic structs and using the netlink attributes 147*c3123552SMauro Carvalho Chehab interface to return them. Since userspace processes each netlink attribute 148*c3123552SMauro Carvalho Chehab independently, it can always ignore attributes whose type it does not 149*c3123552SMauro Carvalho Chehab understand (because it is using an older version of the interface). 150*c3123552SMauro Carvalho Chehab 151*c3123552SMauro Carvalho Chehab 152*c3123552SMauro Carvalho ChehabChoosing between 1. and 2. is a matter of trading off flexibility and 153*c3123552SMauro Carvalho Chehaboverhead. If only a few fields need to be added, then 1. is the preferable 154*c3123552SMauro Carvalho Chehabpath since the kernel and userspace don't need to incur the overhead of 155*c3123552SMauro Carvalho Chehabprocessing new netlink attributes. But if the new fields expand the existing 156*c3123552SMauro Carvalho Chehabstruct too much, requiring disparate userspace accounting utilities to 157*c3123552SMauro Carvalho Chehabunnecessarily receive large structures whose fields are of no interest, then 158*c3123552SMauro Carvalho Chehabextending the attributes structure would be worthwhile. 159*c3123552SMauro Carvalho Chehab 160*c3123552SMauro Carvalho ChehabFlow control for taskstats 161*c3123552SMauro Carvalho Chehab-------------------------- 162*c3123552SMauro Carvalho Chehab 163*c3123552SMauro Carvalho ChehabWhen the rate of task exits becomes large, a listener may not be able to keep 164*c3123552SMauro Carvalho Chehabup with the kernel's rate of sending per-tid/tgid exit data leading to data 165*c3123552SMauro Carvalho Chehabloss. This possibility gets compounded when the taskstats structure gets 166*c3123552SMauro Carvalho Chehabextended and the number of cpus grows large. 167*c3123552SMauro Carvalho Chehab 168*c3123552SMauro Carvalho ChehabTo avoid losing statistics, userspace should do one or more of the following: 169*c3123552SMauro Carvalho Chehab 170*c3123552SMauro Carvalho Chehab- increase the receive buffer sizes for the netlink sockets opened by 171*c3123552SMauro Carvalho Chehab listeners to receive exit data. 172*c3123552SMauro Carvalho Chehab 173*c3123552SMauro Carvalho Chehab- create more listeners and reduce the number of cpus being listened to by 174*c3123552SMauro Carvalho Chehab each listener. In the extreme case, there could be one listener for each cpu. 175*c3123552SMauro Carvalho Chehab Users may also consider setting the cpu affinity of the listener to the subset 176*c3123552SMauro Carvalho Chehab of cpus to which it listens, especially if they are listening to just one cpu. 177*c3123552SMauro Carvalho Chehab 178*c3123552SMauro Carvalho ChehabDespite these measures, if the userspace receives ENOBUFS error messages 179*c3123552SMauro Carvalho Chehabindicated overflow of receive buffers, it should take measures to handle the 180*c3123552SMauro Carvalho Chehabloss of data. 181