1=================
2DataFlowSanitizer
3=================
4
5.. toctree::
6   :hidden:
7
8   DataFlowSanitizerDesign
9
10.. contents::
11   :local:
12
13Introduction
14============
15
16DataFlowSanitizer is a generalised dynamic data flow analysis.
17
18Unlike other Sanitizer tools, this tool is not designed to detect a
19specific class of bugs on its own.  Instead, it provides a generic
20dynamic data flow analysis framework to be used by clients to help
21detect application-specific issues within their own code.
22
23How to build libc++ with DFSan
24==============================
25
26DFSan requires either all of your code to be instrumented or for uninstrumented
27functions to be listed as ``uninstrumented`` in the `ABI list`_.
28
29If you'd like to have instrumented libc++ functions, then you need to build it
30with DFSan instrumentation from source. Here is an example of how to build
31libc++ and the libc++ ABI with data flow sanitizer instrumentation.
32
33.. code-block:: console
34
35  cd libcxx-build
36
37  # An example using ninja
38  cmake -GNinja path/to/llvm-project/llvm \
39    -DCMAKE_C_COMPILER=clang \
40    -DCMAKE_CXX_COMPILER=clang++ \
41    -DLLVM_USE_SANITIZER="DataFlow" \
42    -DLLVM_ENABLE_LIBCXX=ON \
43    -DLLVM_ENABLE_PROJECTS="libcxx;libcxxabi"
44
45  ninja cxx cxxabi
46
47Note: Ensure you are building with a sufficiently new version of Clang.
48
49Usage
50=====
51
52With no program changes, applying DataFlowSanitizer to a program
53will not alter its behavior.  To use DataFlowSanitizer, the program
54uses API functions to apply tags to data to cause it to be tracked, and to
55check the tag of a specific data item.  DataFlowSanitizer manages
56the propagation of tags through the program according to its data flow.
57
58The APIs are defined in the header file ``sanitizer/dfsan_interface.h``.
59For further information about each function, please refer to the header
60file.
61
62.. _ABI list:
63
64ABI List
65--------
66
67DataFlowSanitizer uses a list of functions known as an ABI list to decide
68whether a call to a specific function should use the operating system's native
69ABI or whether it should use a variant of this ABI that also propagates labels
70through function parameters and return values.  The ABI list file also controls
71how labels are propagated in the former case.  DataFlowSanitizer comes with a
72default ABI list which is intended to eventually cover the glibc library on
73Linux but it may become necessary for users to extend the ABI list in cases
74where a particular library or function cannot be instrumented (e.g. because
75it is implemented in assembly or another language which DataFlowSanitizer does
76not support) or a function is called from a library or function which cannot
77be instrumented.
78
79DataFlowSanitizer's ABI list file is a :doc:`SanitizerSpecialCaseList`.
80The pass treats every function in the ``uninstrumented`` category in the
81ABI list file as conforming to the native ABI.  Unless the ABI list contains
82additional categories for those functions, a call to one of those functions
83will produce a warning message, as the labelling behavior of the function
84is unknown.  The other supported categories are ``discard``, ``functional``
85and ``custom``.
86
87* ``discard`` -- To the extent that this function writes to (user-accessible)
88  memory, it also updates labels in shadow memory (this condition is trivially
89  satisfied for functions which do not write to user-accessible memory).  Its
90  return value is unlabelled.
91* ``functional`` -- Like ``discard``, except that the label of its return value
92  is the union of the label of its arguments.
93* ``custom`` -- Instead of calling the function, a custom wrapper ``__dfsw_F``
94  is called, where ``F`` is the name of the function.  This function may wrap
95  the original function or provide its own implementation.  This category is
96  generally used for uninstrumentable functions which write to user-accessible
97  memory or which have more complex label propagation behavior.  The signature
98  of ``__dfsw_F`` is based on that of ``F`` with each argument having a
99  label of type ``dfsan_label`` appended to the argument list.  If ``F``
100  is of non-void return type a final argument of type ``dfsan_label *``
101  is appended to which the custom function can store the label for the
102  return value.  For example:
103
104.. code-block:: c++
105
106  void f(int x);
107  void __dfsw_f(int x, dfsan_label x_label);
108
109  void *memcpy(void *dest, const void *src, size_t n);
110  void *__dfsw_memcpy(void *dest, const void *src, size_t n,
111                      dfsan_label dest_label, dfsan_label src_label,
112                      dfsan_label n_label, dfsan_label *ret_label);
113
114If a function defined in the translation unit being compiled belongs to the
115``uninstrumented`` category, it will be compiled so as to conform to the
116native ABI.  Its arguments will be assumed to be unlabelled, but it will
117propagate labels in shadow memory.
118
119For example:
120
121.. code-block:: none
122
123  # main is called by the C runtime using the native ABI.
124  fun:main=uninstrumented
125  fun:main=discard
126
127  # malloc only writes to its internal data structures, not user-accessible memory.
128  fun:malloc=uninstrumented
129  fun:malloc=discard
130
131  # tolower is a pure function.
132  fun:tolower=uninstrumented
133  fun:tolower=functional
134
135  # memcpy needs to copy the shadow from the source to the destination region.
136  # This is done in a custom function.
137  fun:memcpy=uninstrumented
138  fun:memcpy=custom
139
140For instrumented functions, the ABI list supports a ``force_zero_labels``
141category, which will make all stores and return values set zero labels.
142Functions should never be labelled with both ``force_zero_labels``
143and ``uninstrumented`` or any of the unistrumented wrapper kinds.
144
145For example:
146
147.. code-block:: none
148
149  # e.g. void writes_data(char* out_buf, int out_buf_len) {...}
150  # Applying force_zero_labels will force out_buf shadow to zero.
151  fun:writes_data=force_zero_labels
152
153
154Compilation Flags
155-----------------
156
157* ``-dfsan-abilist`` -- The additional ABI list files that control how shadow
158  parameters are passed. File names are separated by comma.
159* ``-dfsan-combine-pointer-labels-on-load`` -- Controls whether to include or
160  ignore the labels of pointers in load instructions. Its default value is true.
161  For example:
162
163.. code-block:: c++
164
165  v = *p;
166
167If the flag is true, the label of ``v`` is the union of the label of ``p`` and
168the label of ``*p``. If the flag is false, the label of ``v`` is the label of
169just ``*p``.
170
171* ``-dfsan-combine-pointer-labels-on-store`` -- Controls whether to include or
172  ignore the labels of pointers in store instructions. Its default value is
173  false. For example:
174
175.. code-block:: c++
176
177  *p = v;
178
179If the flag is true, the label of ``*p`` is the union of the label of ``p`` and
180the label of ``v``. If the flag is false, the label of ``*p`` is the label of
181just ``v``.
182
183* ``-dfsan-combine-offset-labels-on-gep`` -- Controls whether to propagate
184  labels of offsets in GEP instructions. Its default value is true. For example:
185
186.. code-block:: c++
187
188  p += i;
189
190If the flag is true, the label of ``p`` is the union of the label of ``p`` and
191the label of ``i``. If the flag is false, the label of ``p`` is unchanged.
192
193* ``-dfsan-track-select-control-flow`` -- Controls whether to track the control
194  flow of select instructions. Its default value is true. For example:
195
196.. code-block:: c++
197
198  v = b? v1: v2;
199
200If the flag is true, the label of ``v`` is the union of the labels of ``b``,
201``v1`` and ``v2``.  If the flag is false, the label of ``v`` is the union of the
202labels of just ``v1`` and ``v2``.
203
204* ``-dfsan-event-callbacks`` -- An experimental feature that inserts callbacks for
205  certain data events. Currently callbacks are only inserted for loads, stores,
206  memory transfers (i.e. memcpy and memmove), and comparisons. Its default value
207  is false. If this flag is set to true, a user must provide definitions for the
208  following callback functions:
209
210.. code-block:: c++
211
212  void __dfsan_load_callback(dfsan_label Label, void* Addr);
213  void __dfsan_store_callback(dfsan_label Label, void* Addr);
214  void __dfsan_mem_transfer_callback(dfsan_label *Start, size_t Len);
215  void __dfsan_cmp_callback(dfsan_label CombinedLabel);
216
217* ``-dfsan-track-origins`` -- Controls how to track origins. When its value is
218  0, the runtime does not track origins. When its value is 1, the runtime tracks
219  origins at memory store operations. When its value is 2, the runtime tracks
220  origins at memory load and store operations. Its default value is 0.
221
222* ``-dfsan-instrument-with-call-threshold`` -- If a function being instrumented
223  requires more than this number of origin stores, use callbacks instead of
224  inline checks (-1 means never use callbacks). Its default value is 3500.
225
226Environment Variables
227---------------------
228
229* ``warn_unimplemented`` -- Whether to warn on unimplemented functions. Its
230  default value is false.
231* ``strict_data_dependencies`` -- Whether to propagate labels only when there is
232  explicit obvious data dependency (e.g., when comparing strings, ignore the fact
233  that the output of the comparison might be implicit data-dependent on the
234  content of the strings). This applies only to functions with ``custom`` category
235  in ABI list. Its default value is true.
236* ``origin_history_size`` -- The limit of origin chain length. Non-positive values
237  mean unlimited. Its default value is 16.
238* ``origin_history_per_stack_limit`` -- The limit of origin node's references count.
239  Non-positive values mean unlimited. Its default value is 20000.
240* ``store_context_size`` -- The depth limit of origin tracking stack traces. Its
241  default value is 20.
242* ``zero_in_malloc`` -- Whether to zero shadow space of new allocated memory. Its
243  default value is true.
244* ``zero_in_free`` --- Whether to zero shadow space of deallocated memory. Its
245  default value is true.
246
247Example
248=======
249
250DataFlowSanitizer supports up to 8 labels, to achieve low CPU and code
251size overhead. Base labels are simply 8-bit unsigned integers that are
252powers of 2 (i.e. 1, 2, 4, 8, ..., 128), and union labels are created
253by ORing base labels.
254
255The following program demonstrates label propagation by checking that
256the correct labels are propagated.
257
258.. code-block:: c++
259
260  #include <sanitizer/dfsan_interface.h>
261  #include <assert.h>
262
263  int main(void) {
264    int i = 100;
265    int j = 200;
266    int k = 300;
267    dfsan_label i_label = 1;
268    dfsan_label j_label = 2;
269    dfsan_label k_label = 4;
270    dfsan_set_label(i_label, &i, sizeof(i));
271    dfsan_set_label(j_label, &j, sizeof(j));
272    dfsan_set_label(k_label, &k, sizeof(k));
273
274    dfsan_label ij_label = dfsan_get_label(i + j);
275
276    assert(ij_label & i_label);  // ij_label has i_label
277    assert(ij_label & j_label);  // ij_label has j_label
278    assert(!(ij_label & k_label));  // ij_label doesn't have k_label
279    assert(ij_label == 3);  // Verifies all of the above
280
281    // Or, equivalently:
282    assert(dfsan_has_label(ij_label, i_label));
283    assert(dfsan_has_label(ij_label, j_label));
284    assert(!dfsan_has_label(ij_label, k_label));
285
286    dfsan_label ijk_label = dfsan_get_label(i + j + k);
287
288    assert(ijk_label & i_label);  // ijk_label has i_label
289    assert(ijk_label & j_label);  // ijk_label has j_label
290    assert(ijk_label & k_label);  // ijk_label has k_label
291    assert(ijk_label == 7);  // Verifies all of the above
292
293    // Or, equivalently:
294    assert(dfsan_has_label(ijk_label, i_label));
295    assert(dfsan_has_label(ijk_label, j_label));
296    assert(dfsan_has_label(ijk_label, k_label));
297
298    return 0;
299  }
300
301Origin Tracking
302===============
303
304DataFlowSanitizer can track origins of labeled values. This feature is enabled by
305``-mllvm -dfsan-track-origins=1``. For example,
306
307.. code-block:: console
308
309    % cat test.cc
310    #include <sanitizer/dfsan_interface.h>
311    #include <stdio.h>
312
313    int main(int argc, char** argv) {
314      int i = 0;
315      dfsan_set_label(i_label, &i, sizeof(i));
316      int j = i + 1;
317      dfsan_print_origin_trace(&j, "A flow from i to j");
318      return 0;
319    }
320
321    % clang++ -fsanitize=dataflow -mllvm -dfsan-track-origins=1 -fno-omit-frame-pointer -g -O2 test.cc
322    % ./a.out
323    Taint value 0x1 (at 0x7ffd42bf415c) origin tracking (A flow from i to j)
324    Origin value: 0x13900001, Taint value was stored to memory at
325      #0 0x55676db85a62 in main test.cc:7:7
326      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
327
328    Origin value: 0x9e00001, Taint value was created at
329      #0 0x55676db85a08 in main test.cc:6:3
330      #1 0x7f0083611bbc in __libc_start_main libc-start.c:285
331
332By ``-mllvm -dfsan-track-origins=1`` DataFlowSanitizer collects only
333intermediate stores a labeled value went through. Origin tracking slows down
334program execution by a factor of 2x on top of the usual DataFlowSanitizer
335slowdown and increases memory overhead by 1x. By ``-mllvm -dfsan-track-origins=2``
336DataFlowSanitizer also collects intermediate loads a labeled value went through.
337This mode slows down program execution by a factor of 4x.
338
339Current status
340==============
341
342DataFlowSanitizer is a work in progress, currently under development for
343x86\_64 Linux.
344
345Design
346======
347
348Please refer to the :doc:`design document<DataFlowSanitizerDesign>`.
349