1DataFlowSanitizer Design Document 2================================= 3 4This document sets out the design for DataFlowSanitizer, a general 5dynamic data flow analysis. Unlike other Sanitizer tools, this tool is 6not designed to detect a specific class of bugs on its own. Instead, 7it provides a generic dynamic data flow analysis framework to be used 8by clients to help detect application-specific issues within their 9own code. 10 11DataFlowSanitizer is a program instrumentation which can associate 12a number of taint labels with any data stored in any memory region 13accessible by the program. The analysis is dynamic, which means that 14it operates on a running program, and tracks how the labels propagate 15through that program. 16 17Use Cases 18--------- 19 20This instrumentation can be used as a tool to help monitor how data 21flows from a program's inputs (sources) to its outputs (sinks). 22This has applications from a privacy/security perspective in that 23one can audit how a sensitive data item is used within a program and 24ensure it isn't exiting the program anywhere it shouldn't be. 25 26Interface 27--------- 28 29A number of functions are provided which will attach taint labels to 30memory regions and extract the set of labels associated with a 31specific memory region. These functions are declared in the header 32file ``sanitizer/dfsan_interface.h``. 33 34.. code-block:: c 35 36 /// Sets the label for each address in [addr,addr+size) to \c label. 37 void dfsan_set_label(dfsan_label label, void *addr, size_t size); 38 39 /// Sets the label for each address in [addr,addr+size) to the union of the 40 /// current label for that address and \c label. 41 void dfsan_add_label(dfsan_label label, void *addr, size_t size); 42 43 /// Retrieves the label associated with the given data. 44 /// 45 /// The type of 'data' is arbitrary. The function accepts a value of any type, 46 /// which can be truncated or extended (implicitly or explicitly) as necessary. 47 /// The truncation/extension operations will preserve the label of the original 48 /// value. 49 dfsan_label dfsan_get_label(long data); 50 51 /// Retrieves the label associated with the data at the given address. 52 dfsan_label dfsan_read_label(const void *addr, size_t size); 53 54 /// Returns whether the given label contains the label elem. 55 int dfsan_has_label(dfsan_label label, dfsan_label elem); 56 57 /// Computes the union of \c l1 and \c l2, resulting in a union label. 58 dfsan_label dfsan_union(dfsan_label l1, dfsan_label l2); 59 60 /// Flushes the DFSan shadow, i.e. forgets about all labels currently associated 61 /// with the application memory. Use this call to start over the taint tracking 62 /// within the same process. 63 /// 64 /// Note: If another thread is working with tainted data during the flush, that 65 /// taint could still be written to shadow after the flush. 66 void dfsan_flush(void); 67 68The following functions are provided to check origin tracking status and results. 69 70.. code-block:: c 71 72 /// Retrieves the immediate origin associated with the given data. The returned 73 /// origin may point to another origin. 74 /// 75 /// The type of 'data' is arbitrary. The function accepts a value of any type, 76 /// which can be truncated or extended (implicitly or explicitly) as necessary. 77 /// The truncation/extension operations will preserve the label of the original 78 /// value. 79 dfsan_origin dfsan_get_origin(long data); 80 81 /// Retrieves the very first origin associated with the data at the given 82 /// address. 83 dfsan_origin dfsan_get_init_origin(const void *addr); 84 85 /// Prints the origin trace of the label at the address `addr` to stderr. It also 86 /// prints description at the beginning of the trace. If origin tracking is not 87 /// on, or the address is not labeled, it prints nothing. 88 void dfsan_print_origin_trace(const void *addr, const char *description); 89 90 /// Prints the origin trace of the label at the address `addr` to a pre-allocated 91 /// output buffer. If origin tracking is not on, or the address is` 92 /// not labeled, it prints nothing. 93 /// 94 /// `addr` is the tainted memory address whose origin we are printing. 95 /// `description` is a description printed at the beginning of the trace. 96 /// `out_buf` is the output buffer to write the results to. `out_buf_size` is 97 /// the size of `out_buf`. The function returns the number of symbols that 98 /// should have been written to `out_buf` (not including trailing null byte '\0'). 99 /// Thus, the string is truncated iff return value is not less than `out_buf_size`. 100 size_t dfsan_sprint_origin_trace(const void *addr, const char *description, 101 char *out_buf, size_t out_buf_size); 102 103 /// Returns the value of `-dfsan-track-origins`. 104 int dfsan_get_track_origins(void); 105 106The following functions are provided to register hooks called by custom wrappers. 107 108.. code-block:: c 109 110 /// Sets a callback to be invoked on calls to `write`. The callback is invoked 111 /// before the write is done. The write is not guaranteed to succeed when the 112 /// callback executes. Pass in NULL to remove any callback. 113 typedef void (*dfsan_write_callback_t)(int fd, const void *buf, size_t count); 114 void dfsan_set_write_callback(dfsan_write_callback_t labeled_write_callback); 115 116 /// Callbacks to be invoked on calls to `memcmp` or `strncmp`. 117 void dfsan_weak_hook_memcmp(void *caller_pc, const void *s1, const void *s2, 118 size_t n, dfsan_label s1_label, 119 dfsan_label s2_label, dfsan_label n_label); 120 void dfsan_weak_hook_strncmp(void *caller_pc, const char *s1, const char *s2, 121 size_t n, dfsan_label s1_label, 122 dfsan_label s2_label, dfsan_label n_label); 123 124Taint label representation 125-------------------------- 126 127We use an 8-bit unsigned integer for the representation of a 128label. The label identifier 0 is special, and means that the data item 129is unlabelled. This is optimizing for low CPU and code size overhead 130of the instrumentation. When a label union operation is requested at a 131join point (any arithmetic or logical operation with two or more 132operands, such as addition), we can simply OR the two labels in O(1). 133 134Users are responsible for managing the 8 integer labels (i.e., keeping 135track of what labels they have used so far, picking one that is yet 136unused, etc). 137 138Origin tracking trace representation 139------------------------------------ 140 141An origin tracking trace is a list of chains. Each chain has a stack trace 142where the DFSan runtime records a label propagation, and a pointer to its 143previous chain. The very first chain does not point to any chain. 144 145Every four 4-bytes aligned application bytes share a 4-byte origin trace ID. A 1464-byte origin trace ID contains a 4-bit depth and a 28-bit hash ID of a chain. 147 148A chain ID is calculated as a hash from a chain structure. A chain structure 149contains a stack ID and the previous chain ID. The chain head has a zero 150previous chain ID. A stack ID is a hash from a stack trace. The 4-bit depth 151limits the maximal length of a path. The environment variable ``origin_history_size`` 152can set the depth limit. Non-positive values mean unlimited. Its default value 153is 16. When reaching the limit, origin tracking ignores following propagation 154chains. 155 156The first chain of a trace starts by `dfsan_set_label` with non-zero labels. A 157new chain is appended at the end of a trace at stores or memory transfers when 158``-dfsan-track-origins`` is 1. Memory transfers include LLVM memory transfer 159instructions, glibc memcpy and memmove. When ``-dfsan-track-origins`` is 2, a 160new chain is also appended at loads. 161 162Other instructions do not create new chains, but simply propagate origin trace 163IDs. If an instruction has more than one operands with non-zero labels, the origin 164treace ID of the last operand with non-zero label is propagated to the result of 165the instruction. 166 167Memory layout and label management 168---------------------------------- 169 170The following is the memory layout for Linux/x86\_64: 171 172+---------------+---------------+--------------------+ 173| Start | End | Use | 174+===============+===============+====================+ 175| 0x700000000000|0x800000000000 | application 3 | 176+---------------+---------------+--------------------+ 177| 0x610000000000|0x700000000000 | unused | 178+---------------+---------------+--------------------+ 179| 0x600000000000|0x610000000000 | origin 1 | 180+---------------+---------------+--------------------+ 181| 0x510000000000|0x600000000000 | application 2 | 182+---------------+---------------+--------------------+ 183| 0x500000000000|0x510000000000 | shadow 1 | 184+---------------+---------------+--------------------+ 185| 0x400000000000|0x500000000000 | unused | 186+---------------+---------------+--------------------+ 187| 0x300000000000|0x400000000000 | origin 3 | 188+---------------+---------------+--------------------+ 189| 0x200000000000|0x300000000000 | shadow 3 | 190+---------------+---------------+--------------------+ 191| 0x110000000000|0x200000000000 | origin 2 | 192+---------------+---------------+--------------------+ 193| 0x100000000000|0x110000000000 | unused | 194+---------------+---------------+--------------------+ 195| 0x010000000000|0x100000000000 | shadow 2 | 196+---------------+---------------+--------------------+ 197| 0x000000000000|0x010000000000 | application 1 | 198+---------------+---------------+--------------------+ 199 200Each byte of application memory corresponds to a single byte of shadow 201memory, which is used to store its taint label. We map memory, shadow, and 202origin regions to each other with these masks and offsets: 203 204* shadow_addr = memory_addr ^ 0x500000000000 205 206* origin_addr = shadow_addr + 0x100000000000 207 208As for LLVM SSA registers, we have not found it necessary to associate a label 209with each byte or bit of data, as some other tools do. Instead, labels are 210associated directly with registers. Loads will result in a union of 211all shadow labels corresponding to bytes loaded, and stores will 212result in a copy of the label of the stored value to the shadow of all 213bytes stored to. 214 215Propagating labels through arguments 216------------------------------------ 217 218In order to propagate labels through function arguments and return values, 219DataFlowSanitizer changes the ABI of each function in the translation unit. 220There are currently two supported ABIs: 221 222* Args -- Argument and return value labels are passed through additional 223 arguments and by modifying the return type. 224 225* TLS -- Argument and return value labels are passed through TLS variables 226 ``__dfsan_arg_tls`` and ``__dfsan_retval_tls``. 227 228The main advantage of the TLS ABI is that it is more tolerant of ABI mismatches 229(TLS storage is not shared with any other form of storage, whereas extra 230arguments may be stored in registers which under the native ABI are not used 231for parameter passing and thus could contain arbitrary values). On the other 232hand the args ABI is more efficient and allows ABI mismatches to be more easily 233identified by checking for nonzero labels in nominally unlabelled programs. 234 235Implementing the ABI list 236------------------------- 237 238The `ABI list <DataFlowSanitizer.html#abi-list>`_ provides a list of functions 239which conform to the native ABI, each of which is callable from an instrumented 240program. This is implemented by replacing each reference to a native ABI 241function with a reference to a function which uses the instrumented ABI. 242Such functions are automatically-generated wrappers for the native functions. 243For example, given the ABI list example provided in the user manual, the 244following wrappers will be generated under the args ABI: 245 246.. code-block:: llvm 247 248 define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) { 249 entry: 250 %2 = call i8* @malloc(i64 %0) 251 %3 = insertvalue { i8*, i16 } undef, i8* %2, 0 252 %4 = insertvalue { i8*, i16 } %3, i16 0, 1 253 ret { i8*, i16 } %4 254 } 255 256 define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) { 257 entry: 258 %2 = call i32 @tolower(i32 %0) 259 %3 = insertvalue { i32, i16 } undef, i32 %2, 0 260 %4 = insertvalue { i32, i16 } %3, i16 %1, 1 261 ret { i32, i16 } %4 262 } 263 264 define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) { 265 entry: 266 %labelreturn = alloca i16 267 %6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn) 268 %7 = load i16* %labelreturn 269 %8 = insertvalue { i8*, i16 } undef, i8* %6, 0 270 %9 = insertvalue { i8*, i16 } %8, i16 %7, 1 271 ret { i8*, i16 } %9 272 } 273 274As an optimization, direct calls to native ABI functions will call the 275native ABI function directly and the pass will compute the appropriate label 276internally. This has the advantage of reducing the number of union operations 277required when the return value label is known to be zero (i.e. ``discard`` 278functions, or ``functional`` functions with known unlabelled arguments). 279 280Checking ABI Consistency 281------------------------ 282 283DFSan changes the ABI of each function in the module. This makes it possible 284for a function with the native ABI to be called with the instrumented ABI, 285or vice versa, thus possibly invoking undefined behavior. A simple way 286of statically detecting instances of this problem is to append the suffix 287".dfsan" to the name of each instrumented-ABI function. 288 289This will not catch every such problem; in particular function pointers passed 290across the instrumented-native barrier cannot be used on the other side. 291These problems could potentially be caught dynamically. 292