1Developer notes regarding trace(1), by David van Moolenbroek. 2 3 4OVERALL CODE STRUCTURE 5 6The general tracing engine is in trace.c. It passes IPC-level system call 7enter and leave events off to call.c, which handles IPC-level system call 8printing and passes off system calls to be interpreted by a service-specific 9system call handler whenever possible. All the service-specific code is in the 10service/ subdirectory, grouped by destination service. IOCTLs are a special 11case, which are handled in ioctl.c and passed on to driver-type-grouped IOCTL 12handlers in the ioctl/ subdirectory (this grouping is not strict). Some of the 13generated output goes through the formatting code in format.c, and all of it 14ends up in output.c. The remaining source files contain support code. 15 16 17ADDING A SYSTEM CALL HANDLER 18 19In principle, every system call stops the traced process twice: once when the 20system call is started (the call-enter event) and once when the system call 21returns (the call-leave event). The tracer uses the call-enter event to print 22the request being made, and the call-leave event to print the result of the 23call. The output format is supposed to mimic largely what the system call 24looks like from a C program, although with additional information where that 25makes sense. The general output format for system calls is: 26 27 name(parameters) = result 28 29..where "name" is the name of the system call, "parameters" is a list of system 30call parameters, and "result" is the result of the system call. If possible, 31the part up to and including the equals sign is printed from the call-enter 32event, and the result is printed from the call-leave event. However, many 33system calls actually pass a pointer to a block of memory that is filled with 34meaningful content as part of the system call. For that reason, it is also 35possible that the call-enter event stops printing somewhere inside the 36parameters block, and the call-leave event prints the rest of the parameters, 37as well as the equals sign and the result after it. The place in the printed 38system call where the call-enter printer stops and the call-leave printer is 39supposed to pick up again, is referred to as the "call split". 40 41The tracer has to a handler structure for every system call that can be made by 42a user program to any of the the MINIX3 services. This handler structure 43provides three elements: the name of the system call, an "out" function that 44handles printing of the call-enter part of the system call, and an "in" 45function that handles printing of the call-leave part of the system call. The 46"out" function is expected to print zero or more call parameters, and then 47return a call type, which indicates whether all parameters have been printed 48yet, or not. In fact, there are three call types, shown here with an example 49which has a "|" pipe symbol added to indicate the call split: 50 51 CT_DONE: write(5, "foo", 3) = |3 52 CT_NOTDONE: read(5, |"foo", 1024) = 3 53 CT_NORETURN: execve("foo", ["foo"], []")| = -1 [ENOENT] 54 55The CT_DONE call type indicates that the handler is done printing all the 56parameters during the call-enter event, and the call split will be after the 57equals sign. The CT_NOTDONE call type indicates that the handler is not done 58printing all parameters yet, thus yielding a call split in the middle of the 59parameters block (or even right after the opening parenthesis). The no-return 60(CT_NORETURN) call type is used for a small number of functions that do not 61return on success. Currently, these are the exit(), execve(), and sigreturn() 62system calls. For these calls, no result will be printed at all, unless such 63a call fails, in which case a failure result is printed after all. The call 64split is such that the entire parameters block is printed upon entering the 65call, but the equals sign and result are printed only if the call does return. 66 67Now more about the handler structure for the system call. First of all, each 68system call has a name, which must be a static string. It may be supplied 69either as a string, or as a function that returns a name string. The latter is 70for cases where one message-level system call is used to implement multiple 71C-level system calls (such as setitimer() and getitimer() both going through 72PM_ITIMER). The name function has the following prototype: 73 74 const char *svc_syscall_name(const message *m_out); 75 76..where "m_out" is a local copy of the request message, which the name function 77can use to decide what string to return for the system call. As a sidenote, 78in the future, the system call name will be used to implement call filtering. 79 80An "out" printer function has the following prototype: 81 82 int svc_syscall_out(struct trace_proc *proc, const message *m_out); 83 84Here, "proc" is a pointer to the process structure containing information about 85the process making the system call; proc->pid returns the process PID, but the 86function should not access any other fields of this structure directly. 87Instead, many of the output primitive and helper functions (which are all 88prefixed with "put_") take this pointer as part of the call. "m_out" is a 89local copy of the request message, and the printer may access its fields as it 90sees fit. 91 92The printer function should simply print parameters. The call name and the 93opening parenthesis are printed by the main output routine. 94 95All simple call parameters should be printed using the put_field() and 96put_value() functions. The former prints a parameter or field name as flat 97text; the latter is a printf-like interface to the former. By default, call 98paramaters are simply printed as "value", but if printing all names is enabled, 99call parameters are printed as "name=value". Thus, all parameters should be 100given a name, even if this name does not show up by default. Either way, these 101two functions take care of deciding whether to print the name, as well as of 102printing separators between the parameters. More about printing more complex 103parameters (such as structures) in a bit. 104 105The out printer function must return one of the three CT_ call type values. If 106it returns CT_DONE, the main output routine will immediately print the closing 107parenthesis and equals sign. If it returns CF_NORETURN, a closing parenthesis 108will be printed. If it return CF_NOTDONE, only a parameter field separator 109(that is, a comma and a space) will be printed--after all, it can be assumed 110that more parameters will be printed later. 111 112An "in" printer function has the following prototype: 113 114 void svc_syscall_in(struct trace_proc *proc, const message *m_out, 115 const message *m_in, int failed); 116 117Again, "proc" is the traced process of which its current system call has now 118returned. "m_out" is again the request message, guaranteed to be unchanged 119since the "out" call. "m_in" is the reply message from the service. "failed" 120is either 0 to indicate that the call appears to have succeeded, or PF_FAILED 121to indicate that the call definitely failed. If PF_FAILED is set, the call 122has failed either at the IPC level or at the system call level (or for another, 123less common reason). In that case, the contents of "m_in" may be garbage and 124"m_in" must not be used at all. 125 126For CF_NOTDONE type calls, the in printer function should first print the 127remaining parameters. Here especially, it is important to consider that the 128entire call may fail. In that case, the parameters of which the contents were 129still going to be printed may also contain garbage, since they were never 130filled. The expected behavior is to print such parameters as pointer or "&.." 131or something else to indicate that their actual contents are not valid. 132 133Either way, once a CF_NOTDONE type call function is done printing the remaining 134parameters, it must call put_equals(proc) to print the closing parenthesis of 135the call and the equals sign. CF_NORETURN calls must also use put_equals(proc) 136to print the equals sign. 137 138Then comes the result part. If the call failed, the in printer function *must* 139use put_result(proc) to print the failure result. This call not only takes 140care of converting negative error codes from m_in->m_type into "-1 [ECODE]" but 141also prints appropriate failure codes for IPC-level and other exceptional 142failures. Only if the system call did not fail, may the in printer function 143choose to not call put_result(proc), which on success simply prints 144m_in->m_type as an integer. Similarly, if the system call succeeded, the in 145printer function may print extended results after the primary result, generally 146in parentheses. For example, getpid() and getppid() share the same system call 147and thus the tracer prints both return values, one as the primary result of the 148actual call and one in parentheses with a clarifying name as extended result: 149 150 getpid() = 3 (ppid=1) 151 152It should now be clear that printing extended results makes no sense if the 153system call failed. 154 155Besidse put_equals and put_result, the following more or less generic support 156functions are available to print the various parts of the requests and replies. 157 158 put_field - output a parameter, structure field, and so on; this function 159 should be used for just about every actual value 160 put_value - printf-like version of put_field 161 put_text - output plain text; for call handlers, this should be used only to 162 to add things right after a put_field call, never on its own 163 put_fmt - printf-like version of put_text, should generally not be used 164 from call handlers at all 165 put_open - open a nested block of fields, surrounded by parentheses, 166 brackets, or something like that; this is used for structures, 167 arrays, and any other similar nontrivial case of nesting 168 put_close - close a previously opened block of fields; the nesting depth is 169 actually tracked (to keep per-level separators etc), so each 170 put_open call must have a corresponding put_close call 171 put_open_struct - perform several tasks necessary to start printing the 172 fields of a structure; note that this function may fail! 173 put_close_struct - end successful printing of a structure 174 put_ptr - print a pointer in the traced process 175 put_buf - print a buffer or string 176 put_flags - print a bitwise flags field 177 put_tail - helper function for printing the continuation part of an array 178 179Many of these support functions take a flags field which takes PF_-prefixed 180flags to modify the output they generate. The value of 'failed' in the in 181printer function may actually be passed (bitwise-OR'ed in) as the PF_FAILED 182flag to these support functions, and they will do the right thing. For 183example, a call to put_open_struct with the PF_FAILED flag will end up simply 184printing the pointer to the structure, and not allow printing of the contents 185of the structure. 186 187The above support functions are documented (at a basic level) within the code, 188but in many cases, it may be useful to look up how they are used in practice by 189the existing handlers. The same goes for various less clear cases; while there 190is basic support for printing structures, support for printing arrays must be 191coded fully by hand, as has been done for many places. A serious attempt has 192been made to make the output consistent across the board (mainly thanks to the 193output format of strace, on which the output of this tracer has been based, 194sometimes very strictly and sometimes more loosely, but that aside) so it is 195always advisable to follow the ways of the existing handlers. Also keep in 196mind that there are already printer functions for several generic structures, 197and these should be used whenever possible (e.g., see the put_fd() comment). 198 199Finally, the default_out and default_in functions may be used as printer 200functions for call with no parameters, and for functions which need no more 201than put_result() to print their system call result, respectively. 202 203 204ADDING AN IOCTL HANDLER 205 206There are many IOCTL requests, and many have their own associated data types. 207Like with system calls, the idea is to provide an actual implementation for any 208IOCTLs that can actually occur in the wild. This consists of printing the full 209IOCTL name, as well as its argument. First something about how handling IOCTLs 210is grouped into files in the ioctl subdirectory, then about the actual 211procedure the IOCTLs are handled. 212 213Grouping of IOCTL handling in the ioctl subdirectory is currently based on the 214IOCTLs' associated device type. This is not a performance optimization: for 215any given IOCTL, there is no way for the main IOCTL code (in ioctl.c) to know 216which group, if any, contains a handler for the IOCTL, so it simply queries all 217groups. The grouping is there only to keep down the size of individual source 218files, and as such not even strict: for example, networking IOCTLs are 219technically a subset of character IOCTLs, and kept separate only because there 220are so many of them. The point here is mainly that the separation is not at 221all set in stone. However, the svrctl group is an exception: svrctl(2) 222requests are very much like IOCTLs, and thus also treated as such, but they are 223in a different namespace. Thus, their handlers are in a separate file. 224 225As per the ioctl_table structure, each group has a function to return the name 226of an IOCTL it knows (typically <group>_ioctl_name), and a function to handle 227IOCTL arguments (typically <group>_ioctl_arg). Whenever an IOCTL system call 228is made, each group's name function is queried. This function has the 229following prototype: 230 231 const char *group_ioctl_name(unsigned long req); 232 233The "req" parameter contains the IOCTL request code. The function is to return 234a static non-NULL string if it knows the name for the request code, or NULL 235otherwise. If the function returns a non-NULL string, that name will be used 236for the IOCTL. In addition, if the IOCTL has an argument at all, i.e. it is 237not of the basic _IO() type, that group (and only that group!) will be queried 238about the IOCTL argument, by calling the group's IOCTL argument function. The 239IOCTL argument function has the following prototype: 240 241 int group_ioctl_arg(struct trace_proc *proc, unsigned long req, void *ptr, 242 int dir); 243 244For a single IOCTL, this function may be called up to three times. The first 245time, "ptr" will be NULL, and based on the same IOCTL request code "req", the 246function must return any bitwise combination of two flags: IF_OUT and IF_IN. 247 248The returned flags determine whether and how the IOCTL's argument will be 249printed: before and/or after performing the IOCTL system call. These two flags 250effectively correspond to the "write" and "read" argument directions of IOCTLs: 251IF_OUT indicates that the argument should be printed before the IOCTL request, 252and this is to be used only for IOCTLs of type _IOW() and _IOWR(). IF_IN 253indicates that the argument should be printed after the IOCTL request (but if 254it was successful only), and is to be used only for IOCTLs of type _IOR() and 255_IOWR(). 256 257The returned flag combination determines how the IOCTL is formatted. The 258following possible return values result in the following output formats, again 259with the "|" indicating the call split, "out" being the IOCTL argument contents 260printed before the IOCTL call, and "in" being the IOCTL argument printed after 261the IOCTL call: 262 263 0: ioctl(3, IOCFOO, &0xaddress) = |0 264 IF_OUT: ioctl(3, IOCFOO, {out}) = |0 265 IF_OUT|IF_IN: ioctl(3, IOCFOO, {out}) = |0 {in} 266 IF_IN: ioctl(3, IOCFOO, |{in}) = 0 267 268Both IF_ flags are optional, mainly because it is not always needed to print 269both sides for an _IOWR() request. However, using the wrong flag (e.g., IF_OUT 270for an _IOR() request, which simply makes no sense) will trigger an assert. 271Also, the function should basically never return 0 for an IOCTL it recognizes. 272Again, for IOCTLs of type _IO(), which have no argument, the argument function 273is not called at all. 274 275Now the important part. For each flag that is returned on the initial call to 276the argument function, the argument function will be called again, this time to 277perform actual printing of the argument. For these subsequent calls, "ptr" 278will point to the argument data which has been copied to the local address 279space, and "dir" will contain one of the returned flags (that is, either IF_OUT 280or IF_IN) to indicate whether the function is called before or after the IOCTL 281call. As should now be obvious, if the first call returned IF_OUT | IF_IN, the 282function will be called again with "dir" set to IF_OUT, and if the IOCTL call 283did not fail, once more (for the third time), now with "dir" set to IF_IN. 284 285For these calls with an actual "ptr" value and a direction, the function should 286indeed print the argument as appropriate, using "proc" as process pointer for 287use in calls to the printing functions. The general approach is to print non- 288structure arguments as single values with no field name, and structure 289arguments by printing its fields with their field names. The main code (in 290ioctl.c) ensures that the output is enclosed in curly brackets, thus making the 291output look like a structure anyway. 292 293For these subsequent calls, the argument function's return value should be 294IF_ALL if all parts of the IOCTL argument have been printed, or 0 otherwise. 295In the latter case, the main code will add a final ".." field to indicate to 296the user that not all parts of the argument have been printed, very much like 297the "all" parameter of put_close_struct. 298 299If no name can be found for the IOCTL request code, the argument will simply be 300printed as a pointer. The same happens in error cases, for example if copying 301in the IOCTL data resulted in an error. 302 303There is no support for dealing with multiple IOCTLs with the exact same 304request code--something that should not, but sadly does, occur in practice. 305For now, the preferred approach would be to implement only support for the 306IOCTL that is most likely to be found in practice, and possibly to put a horse 307head in the bed of whoever introduced the duplicate request code. 308 309 310INTERNALS: MULTIPROCESS OUTPUT AND PREEMPTION 311 312Things get interesting when multiple processes are traced at once. Due to the 313nature of process scheduling, system calls may end up being preempted between 314the call-enter and call-leave phases. This means that the output of a system 315call has to be suspended to give way to an event from another traced process. 316Such preemption may occur with literally all calls; not just "blocking" calls. 317 318The tracer goes through some lengths to aid the user in following the output in 319the light of preemtion. The most important aspect is that the output of the 320call-enter phase is recorded, so that in the case of preemption, the call-leave 321phase can start by replaying the record. As a result, the user gets to see the 322whole system call on a single line, instead of just the second half. Such 323system call resumptions are marked with a "*" in their prefix, to show that 324the call was not just entered. The output therefore looks like this: 325 326 2| syscall() = <..> 327 3| othercall() = 0 328 2|*syscall() = 0 329 330Signals that arrive during a call will cause a resumption of the call as well. 331As a result, a call may be resumed multiple times: 332 333 2| syscall() = <..> 334 3| othercall() = 0 335 2|*syscall() = ** SIGUSR1 ** ** SIGUSR2 ** <..> 336 3| othercall() = -1 [EBUSY] 337 2|*syscall() = ** SIGHUP ** <..> 338 3| othercall() = 0 339 2|*syscall() = 0 340 341This entire scenario shows one single system call from process 2. 342 343In the current implementation, the output that should be recorded and/or cause 344the "<..>" preemption marker, as well as the cases where the recorded text must 345be replayed, are marked by the code explicitly. Replay takes place in three 346cases: upon the call-leave event (obviously), upon receiving a signal (as shown 347above), and when it is required that a suspended no-return call is shown as 348completed before continuing with other output. The last case applies to exit() 349and execve(), and both are documented in the code quite extensively. Generally 350speaking, in all output lines where no recording or replay actions are 351performed, the recording will not be replayed but also not removed. This 352allows for intermediate lines for that process in the output. Practically 353speaking, future support for job control could even print when a process get 354stopped and continued, for that process, while preempting the output for the 355ongoing system call for that same process. 356 357It is possible that the output of the call-enter phase exhausts the recording 358buffer for its process. In this case, a new, shorter text is generated upon 359process resumption. There are many other aspects to proper output formatting 360in the light of preemption, but most of them should be documented as part of 361the code reasonably well. 362