1# Buffers and dataflow
2
3This page describes the dataflow in Perfetto when recording traces. It describes
4all the buffering stages, explains how to size the buffers and how to debug
5data losses.
6
7## Concepts
8
9Tracing in Perfetto is an asynchronous multiple-writer single-reader pipeline.
10In many senses, its architecture is very similar to modern GPUs' command
11buffers.
12
13The design principles of the tracing dataflow are:
14
15* The tracing fastpath is based on direct writes into a shared memory buffer.
16* Highly optimized for low-overhead writing. NOT optimized for low-latency
17  reading.
18* Trace data is eventually committed in the central trace buffer by the end
19  of the trace or when explicit flush requests are issued via the IPC channel.
20* Producers are untrusted and should not be able to see each-other's trace data,
21  as that would leak sensitive information.
22
23In the general case, there are two types buffers involved in a trace. When
24pulling data from the Linux kernel's ftrace infrastructure, there is a third
25stage of buffering (one per-CPU) involved:
26
27![Buffers](/docs/images/buffers.png)
28
29#### Tracing service's central buffers
30
31These buffers (yellow, in the picture above) are defined by the user in the
32`buffers` section of the [trace config](config.md). In the most simple cases,
33one tracing session = one buffer, regardless of the number of data sources and
34producers.
35
36This is the place where the tracing data is ultimately kept, while in memory,
37whether it comes from the kernel ftrace infrastructure, from some other data
38source in `traced_probes` or from another userspace process using the
39[Perfetto SDK](/docs/instrumentation/tracing-sdk.md).
40At the end of the trace (or during, if in [streaming mode]) these buffers are
41written into the output trace file.
42
43These buffers can contain a mixture of trace packets coming from different data
44sources and even different producer processes. What-goes-where is defined in the
45[buffers mapping section](config.md#dynamic-buffer-mapping) of the trace config.
46Because of this, the tracing buffers are not shared across processes, to avoid
47cross-talking and information leaking across producer processes.
48
49#### Shared memory buffers
50
51Each producer process has one memory buffer shared 1:1 with the tracing service
52(blue, in the picture above), regardless of the number of data sources it hosts.
53This buffer is a temporary staging buffer and has two purposes:
54
551. Zero-copy on the writer path. This buffer allows direct serialization of the
56   tracing data from the writer fastpath in a memory region directly readable by
57   the tracing service.
58
592. Decoupling writes from reads of the tracing service. The tracing service has
60   the job of moving trace packets from the shared memory buffer (blue) into the
61   central buffer (yellow) as fast as it can.
62   The shared memory buffer hides the scheduling and response latencies of the
63   tracing service, allowing the producer to keep writing without losing data
64   when the tracing service is temporarily blocked.
65
66#### Ftrace buffer
67
68When the `linux.ftrace` data source is enabled, the kernel will have its own
69per-CPU buffers. These are unavoidable because the kernel cannot write directly
70into user-space buffers. The `traced_probes` process will periodically read
71those buffers, convert the data into binary protos and follow the same dataflow
72of userspace tracing. These buffers need to be just large enough to hold data
73between two frace read cycles (`TraceConfig.FtraceConfig.drain_period_ms`).
74
75## Life of a trace packet
76
77Here is a summary to understand the dataflow of trace packets across buffers.
78Consider the case of a producer process hosting two data sources writing packets
79at a different rates, both targeting the same central buffer.
80
811. When each data source starts writing, it will grab a free page of the shared
82   memory buffer and directly serialize proto-encoded tracing data onto it.
83
842. When a page of the shared memory buffer is filled, the producer will send an
85   async IPC to the service, asking it to copy the shared memory page just
86   written. Then, the producer will grab the next free page in the shared memory
87   buffer and keep writing.
88
893. When the service receives the IPC, it copies the shared memory page into
90   the central buffer and marks the shared memory buffer page as free again. Data
91   sources within the producer are able to reuse that page at this point.
92
934. When the tracing session ends, the service sends a `Flush` request to all
94   data sources. In reaction to this, data sources will commit all outstanding
95   shared memory pages, even if not completely full. The services copies these
96   pages into the service's central buffer.
97
98![Dataflow animation](/docs/images/dataflow.svg)
99
100## Buffer sizing
101
102#### Central buffer sizing
103
104The math for sizing the central buffer is quite straightforward: in the default
105case of tracing without `write_into_file` (when the trace file is written only
106at the end of the trace), the buffer will hold as much data as it has been
107written by the various data sources.
108
109The total length of the trace will be `(buffer size) / (aggregated write rate)`.
110If all producers write at a combined rate of 2 MB/s, a 16 MB buffer will hold
111~ 8 seconds of tracing data.
112
113The write rate is highly dependent on the data sources configured and by the
114activity of the system. 1-2 MB/s is a typical figure on Android traces with
115scheduler tracing, but can go up easily by 1+ orders of magnitude if chattier
116data sources are enabled (e.g., syscall or pagefault tracing).
117
118When using [streaming mode] the buffer needs to be able to hold enough data
119between two `file_write_period_ms` periods (default: 5s).
120For instance, if `file_write_period_ms = 5000` and the write data rate is 2 MB/s
121the central buffer needs to be at least 5 * 2 = 10 MB to avoid data losses.
122
123#### Shared memory buffer sizing
124
125The sizing of the shared memory buffer depends on:
126
127* The scheduling characteristics of the underlying system, i.e. for how long the
128 tracing service can be blocked on the scheduler queues. This is a function of
129 the kernel configuration and nice-ness level of the `traced` process.
130* The max write rate of all data sources within a producer process.
131
132Suppose that a producer produce at a max rate of 8 MB/s. If `traced` gets
133blocked for 10 ms, the shared memory buffer need to be at least 8 * 0.01 = 80 KB
134to avoid losses.
135
136Empirical measurements suggest that on most Android systems a shared memory
137buffer size of 128-512 KB is good enough.
138
139The default shared memory buffer size is 256 KB. When using the Perfetto Client
140Library, this value can be tweaked setting `TracingInitArgs.shmem_size_hint_kb`.
141
142WARNING: if a data source writes very large trace packets in a single batch,
143either the shared memory buffer needs to be big enough to handle that or
144`BufferExhaustedPolicy.kStall` must be employed.
145
146For instance, consider a data source that emits a 2MB screenshot every 10s.
147Its (simplified) code, would look like:
148```c++
149for (;;) {
150  ScreenshotDataSource::Trace([](ScreenshotDataSource::TraceContext ctx) {
151    auto packet = ctx.NewTracePacket();
152    packet.set_bitmap(Grab2MBScreenshot());
153  });
154  std::this_thread::sleep_for(std::chrono::seconds(10));
155}
156```
157
158Its average write rate is 2MB / 10s = 200 KB/s. However, the data source will
159create bursts of 2MB back-to-back without yielding; it is limited only by the
160tracing serialization overhead. In practice, it will write the 2MB buffer at
161O(GB/s). If the shared memory buffer is < 2 MB, the tracing service will be
162unlikely to catch up at that rate and data losses will be experienced.
163
164In a case like this these options are:
165
166* Increase the size of the shared memory buffer in the producer that hosts the
167  data source.
168* Split the write into chunks spaced by some delay.
169* Adopt the `BufferExhaustedPolicy::kStall` when defining the data source:
170
171```c++
172class ScreenshotDataSource : public perfetto::DataSource<ScreenshotDataSource> {
173 public:
174  constexpr static BufferExhaustedPolicy kBufferExhaustedPolicy =
175      BufferExhaustedPolicy::kStall;
176 ...
177};
178```
179
180## Debugging data losses
181
182#### Ftrace kernel buffer losses
183
184When using the Linux kernel ftrace data source, losses can occur in the
185kernel -> userspace path if the `traced_probes` process gets blocked for too
186long.
187
188At the trace proto level, losses in this path are recorded:
189* In the [`FtraceCpuStats`][FtraceCpuStats] messages, emitted both at the
190  beginning and end of the trace. If the `overrun` field is non-zero, data has
191  been lost.
192* In the [`FtraceEventBundle.lost_events`][FtraceEventBundle] field. This allows
193  to locate precisely the point where data loss happened.
194
195At the TraceProcessor SQL level, this data is available in the `stats` table:
196
197```sql
198> select * from stats where name like 'ftrace_cpu_overrun_end'
199name                 idx                  severity             source value
200-------------------- -------------------- -------------------- ------ ------
201ftrace_cpu_overrun_e                    0 data_loss            trace       0
202ftrace_cpu_overrun_e                    1 data_loss            trace       0
203ftrace_cpu_overrun_e                    2 data_loss            trace       0
204ftrace_cpu_overrun_e                    3 data_loss            trace       0
205ftrace_cpu_overrun_e                    4 data_loss            trace       0
206ftrace_cpu_overrun_e                    5 data_loss            trace       0
207ftrace_cpu_overrun_e                    6 data_loss            trace       0
208ftrace_cpu_overrun_e                    7 data_loss            trace       0
209```
210
211These losses can be mitigated either increasing
212[`TraceConfig.FtraceConfig.buffer_size_kb`][FtraceConfig]
213 or decreasing
214[`TraceConfig.FtraceConfig.drain_period_ms`][FtraceConfig]
215
216#### Shared memory losses
217
218Tracing data can be lost in the shared memory due to bursts while traced is
219blocked.
220
221At the trace proto level, losses in this path are recorded:
222
223* In [`TraceStats.BufferStats.trace_writer_packet_loss`][BufferStats].
224* In [`TracePacket.previous_packet_dropped`][TracePacket].
225  Caveat: the very first packet emitted by every data source is also marked as
226  `previous_packet_dropped=true`. This is because the service has no way to
227  tell if that was the truly first packet or everything else before that was
228  lost.
229
230At the TraceProcessor SQL level, this data is available in the `stats` table:
231```sql
232> select * from stats where name = 'traced_buf_trace_writer_packet_loss'
233name                 idx                  severity             source    value
234-------------------- -------------------- -------------------- --------- -----
235traced_buf_trace_wri                    0 data_loss            trace         0
236```
237
238#### Central buffer losses
239
240Data losses in the central buffer can happen for two different reasons:
241
2421. When using `fill_policy: RING_BUFFER`, older tracing data is overwritten by
243   virtue of wrapping in the ring buffer.
244   These losses are recorded, at the trace proto level, in
245   [`TraceStats.BufferStats.chunks_overwritten`][BufferStats].
246
2472. When using `fill_policy: DISCARD`, newer tracing data committed after the
248   buffer is full is dropped.
249   These losses are recorded, at the trace proto level, in
250   [`TraceStats.BufferStats.chunks_discarded`][BufferStats].
251
252At the TraceProcessor SQL level, this data is available in the `stats` table,
253one entry per central buffer:
254
255```sql
256> select * from stats where name = 'traced_buf_chunks_overwritten' or name = 'traced_buf_chunks_discarded'
257name                 idx                  severity             source  value
258-------------------- -------------------- -------------------- ------- -----
259traced_buf_chunks_di                    0 info                 trace       0
260traced_buf_chunks_ov                    0 data_loss            trace       0
261```
262
263Summary: the best way to detect and debug data losses is to use Trace Processor
264and issue the query:
265`select * from stats where severity = 'data_loss' and value != 0`
266
267## Atomicity and ordering guarantees
268
269A "writer sequence" is the sequence of trace packets emitted by a given
270TraceWriter from a data source. In almost all cases 1 data source ==
2711+ TraceWriter(s). Some data sources that support writing from multiple threads
272typically create one TraceWriter per thread.
273
274* Trace packets written from a sequence are emitted in the trace file in the
275  same order they have been written.
276
277* There is no ordering guarantee between packets written by different sequences.
278  Sequences are, by design, concurrent and more than one linearization is
279  possible. The service does NOT respect global timestamp ordering across
280  different sequences. If two packets from two sequences were emitted in
281  global timestamp order, the service can still emit them in the trace file in
282  the opposite order.
283
284* Trace packets are atomic. If a trace packet is emitted in the trace file, it
285  is guaranteed to be contain all the fields that the data source wrote. If a
286  trace packet is large and spans across several shared memory buffer pages, the
287  service will save it in the trace file only if it can observe that all
288  fragments have been committed without gaps.
289
290* If a trace packet is lost (e.g. because of wrapping in the ring buffer
291  or losses in the shared memory buffer), no further trace packet will be
292  emitted for that sequence, until all packets before are dropped as well.
293  In other words, if the tracing service ends up in a situation where it sees
294  packets 1,2,5,6 for a sequence, it will only emit 1, 2. If, however, new
295  packets (e.g., 7, 8, 9) are written and they overwrite 1, 2, clearing the gap,
296  the full sequence 5, 6, 7, 8, 9 will be emitted.
297  This behavior, however, doesn't hold when using [streaming mode] because,
298  in that case, the periodic read will consume the packets in the buffer and
299  clear the gaps, allowing the sequence to restart.
300
301## Incremental state in trace packets
302
303In many cases trace packets are fully independent of each other and can be
304processed and interpreted without further context.
305In some cases, however, they can have _incremental state_ and behave similarly
306to inter-frame video encoding techniques, where some frames require the keyframe
307to be present to be meaningfully decoded.
308
309Here are are two concrete examples:
310
3111. Ftrace scheduling slices and /proc/pid scans. ftrace scheduling events are
312   keyed by thread id. In most cases users want to map those events back to the
313   parent process (the thread-group). To solve this, when both the
314   `linux.ftrace` and the `linux.process_stats` data sources are enabled in a
315   Perfetto trace, the latter does capture process<>thread associations from
316   the /proc pseudo-filesystem, whenever a new thread-id is seen by ftrace.
317   A typical trace in this case looks as follows:
318   ```
319    # From process_stats's /proc scanner.
320    pid: 610; ppid: 1; cmdline: "/system/bin/surfaceflinger"
321
322    # From ftrace
323    timestamp: 95054961131912; sched_wakeup: pid: 610;     target_cpu: 2;
324    timestamp: 95054977528943; sched_switch: prev_pid: 610 prev_prio: 98
325  ```
326  The /proc entry is emitted only once per process to avoid bloating the size of
327  the trace. In lack of data losses this is fine to be able to reconstruct all
328  scheduling events for that pid. If, however, the process_stats packet gets
329  dropped in the ring buffer, there will be no way left to work out the process
330  details for all the other ftrace events that refer to that PID.
331
3322. The [Track Event library](/docs/instrumentation/track-events) in the Perfetto
333   SDK makes extensive use of string interning. Mos strings and descriptors
334   (e.g. details about processes / threads) are emitted only once and later
335   referred to using a monotonic ID. In case a loss of the descriptor packet,
336   it is not possible to make fully sense of those events.
337
338Trace Processor has built-in mechanism that detect loss of interning data and
339skips ingesting packets that refer to missing interned strings or descriptors.
340
341When using tracing in ring-buffer mode, these types of losses are very likely to
342happen.
343
344There are two mitigations for this:
345
3461. Issuing periodic invalidations of the incremental state via
347   [`TraceConfig.IncrementalStateConfig.clear_period_ms`][IncrStateConfig].
348   This will cause the data sources that make use of incremental state to
349   periodically drop the interning / process mapping tables and re-emit the
350   descriptors / strings on the next occurrence. This mitigates quite well the
351   problem in the context of ring-buffer traces, as long as the
352   `clear_period_ms` is one order of magnitude lower than the estimated length
353   of trace data in the central trace buffer.
354
3552. Recording the incremental state into a dedicated buffer (via
356   `DataSourceConfig.target_buffer`). This technique is quite commonly used with
357   in the ftrace + process_stats example mentioned before, recording the
358   process_stats packet in a dedicated buffer less likely to wrap (ftrace events
359   are much more frequent than descriptors for new processes).
360
361## Flushes and windowed trace importing
362
363Another common problem experienced in traces that involve multiple data sources
364is the non-synchronous nature of trace commits. As explained in the
365[Life of a trace packet](#life-of-a-trace-packet) section above, trace data is
366committed only when a full memory page of the shared memory buffer is filled (or
367at when the tracing session ends). In most cases, if data sources produce events
368at a regular cadence, pages are filled quite quickly and events are committed
369in the central buffers within seconds.
370
371In some other cases, however, a data source can emit events only sporadically.
372Imagine the case of a data source that emits events when the display is turned
373on/off. Such an infrequent event might end up being staged in the shared memory
374buffer for very long times and can end up being committed in the trace buffer
375hours after it happened.
376
377Another scenario where this can happen is when using ftrace and when a
378particular CPU is idle most of the time or gets hot-unplugged (ftrace uses
379per-cpu buffers). In this case a CPU might record little-or-no data for several
380minutes while the other CPUs pump thousands of new trace events per second.
381
382This causes two side effects that end up breaking user expectations or causing
383bugs:
384
385* The UI can show an abnormally long timeline with a huge gap in the middle.
386  The packet ordering of events doesn't matter for the UI because events are
387  sorted by timestamp at import time. The trace in this case will contain very
388  recent events plus a handful of stale events that happened hours before. The
389  UI, for correctness, will try to display all events, showing a handful of
390  early events, followed by a huge temporal gap when nothing happened,
391  followed by the stream of recent events.
392
393* When recording long traces, Trace Processor can show import errors of the form
394  "XXX event out-of-order". This is because. in order to limit the memory usage
395  at import time, Trace Processor sorts events using a sliding window. If trace
396  packets are too out-of-order (trace file order vs timestamp order), the
397  sorting will fail and some packets will be dropped.
398
399#### Mitigations
400
401The best mitigation for these sort of problems is to specify a
402[`flush_period_ms`][TraceConfig] in the trace config (10-30 seconds is usually
403good enough for most cases), especially when recording long traces.
404
405This will cause the tracing service to issue periodic flush requests to data
406sources. A flush requests causes the data source to commit the shared memory
407buffer pages into the central buffer, even if they are not completely full.
408By default, a flush issued only at the end of the trace.
409
410In case of long traces recorded without `flush_period_ms`, another option is to
411pass the `--full-sort` option to `trace_processor_shell` when importing the
412trace. Doing so will disable the windowed sorting at the cost of a higher
413memory usage (the trace file will be fully buffered in memory before parsing).
414
415[streaming mode]: /docs/concepts/config#long-traces
416[TraceConfig]: /docs/reference/trace-config-proto.autogen#TraceConfig
417[FtraceConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig
418[IncrStateConfig]: /docs/reference/trace-config-proto.autogen#FtraceConfig.IncrementalStateConfig
419[FtraceCpuStats]: /docs/reference/trace-packet-proto.autogen#FtraceCpuStats
420[FtraceEventBundle]: /docs/reference/trace-packet-proto.autogen#FtraceEventBundle
421[TracePacket]: /docs/reference/trace-packet-proto.autogen#TracePacket
422[BufferStats]: /docs/reference/trace-packet-proto.autogen#TraceStats.BufferStats
423