1<!--
2Copyright 2017 The Crashpad Authors. All rights reserved.
3
4Licensed under the Apache License, Version 2.0 (the "License");
5you may not use this file except in compliance with the License.
6You may obtain a copy of the License at
7
8    http://www.apache.org/licenses/LICENSE-2.0
9
10Unless required by applicable law or agreed to in writing, software
11distributed under the License is distributed on an "AS IS" BASIS,
12WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13See the License for the specific language governing permissions and
14limitations under the License.
15-->
16
17# Crashpad Overview Design
18
19[TOC]
20
21## Objective
22
23Crashpad is a library for capturing, storing and transmitting postmortem crash
24reports from a client to an upstream collection server. Crashpad aims to make it
25possible for clients to capture process state at the time of crash with the best
26possible fidelity and coverage, with the minimum of fuss.
27
28Crashpad also provides a facility for clients to capture dumps of process state
29on-demand for diagnostic purposes.
30
31Crashpad additionally provides minimal facilities for clients to adorn their
32crashes with application-specific metadata in the form of per-process key/value
33pairs. More sophisticated clients are able to adorn crash reports further
34through extensibility points that allow the embedder to augment the crash report
35with application-specific metadata.
36
37## Background
38
39It’s an unfortunate truth that any large piece of software will contain bugs
40that will cause it to occasionally crash. Even in the absence of bugs, software
41incompatibilities can cause program instability.
42
43Fixing bugs and incompatibilities in client software that ships to millions of
44users around the world is a daunting task. User reports and manual reproduction
45of crashes can work, but even given a user report, often times the problem is
46not readily reproducible. This is for various reasons, such as e.g. system
47version or third-party software incompatibility, or the problem can happen due
48to a race of some sort. Users are also unlikely to report problems they
49encounter, and user reports are often of poor quality, as unfortunately most
50users don’t have experience with making good bug reports.
51
52Automatic crash telemetry has been the best solution to the problem so far, as
53this relieves the burden of manual reporting from users, while capturing the
54hardware and software state at the time of crash.
55
56TODO(siggi): examples of this?
57
58Crash telemetry involves capturing postmortem crash dumps and transmitting them
59to a backend collection server. On the server they can be stackwalked and
60symbolized, and evaluated and aggregated in various ways. Stackwalking and
61symbolizing the reports on an upstream server has several benefits over
62performing these tasks on the client. High-fidelity stackwalking requires access
63to bulky unwind data, and it may be desirable to not ship this to end users out
64of concern for the application size. The process of symbolization requires
65access to debugging symbols, which can be quite large, and the symbolization
66process can consume considerable other resources. Transmitting un-stackwalked
67and un-symbolized postmortem dumps to the collection server also allows deep
68analysis of individual dumps, which is often necessary to resolve the bug
69causing the crash.
70
71Transmitting reports to the collection server allows aggregating crashes by
72cause, which in turn allows assessing the importance of different crashes in
73terms of the occurrence rate and e.g. the potential security impact.
74
75A postmortem crash dump must contain the program state at the time of crash
76with sufficient fidelity to allow diagnosing and fixing the problem. As the full
77program state is usually too large to transmit to an upstream server, the
78postmortem dump captures a heuristic subset of the full state.
79
80The crashed program is in an indeterminate state and, in fact, has often crashed
81because of corrupt global state - such as heap. It’s therefore important to
82generate crash reports with as little execution in the crashed process as
83possible. Different operating systems vary in the facilities they provide for
84this.
85
86## Overview
87
88Crashpad is a client-side library that focuses on capturing machine and program
89state in a postmortem crash report, and transmitting this report to a backend
90server - a “collection server”. The Crashpad library is embedded by the client
91application. Conceptually, Crashpad breaks down into the handler and the client.
92The handler runs in a separate process from the client or clients. It is
93responsible for snapshotting the crashing client process’ state on a crash,
94saving it to a crash dump, and transmitting the crash dump to an upstream
95server. Clients register with the handler to allow it to capture and upload
96their crashes.
97
98### The Crashpad handler
99
100The Crashpad handler is instantiated in a process supplied by the embedding
101application. It provides means for clients to register themselves by some means
102of IPC, or where operating system support is available, by taking advantage of
103such support to cause crash notifications to be delivered to the handler. On
104crash, the handler snapshots the crashed client process’ state, writes it to a
105postmortem dump in a database, and may also transmit the dump to an upstream
106server if so configured.
107
108The Crashpad handler is able to handle cross-bitted requests and generate crash
109dumps across bitness, where e.g. the handler is a 64-bit process while the
110client is a 32-bit process or vice versa. In the case of Windows, this is
111limited by the OS such that a 32-bit handler can only generate crash dumps for
11232-bit clients, but a 64-bit handler can acquire nearly all of the detail for a
11332-bit process.
114
115### The Crashpad client
116
117The Crashpad client provides two main facilities.
1181. Registration with the Crashpad handler.
1192. Metadata communication to the Crashpad handler on crash.
120
121A Crashpad embedder links the Crashpad client library into one or more
122executables, whether a loadable library or a program file. The client process
123then registers with the Crashpad handler through some mode of IPC or other
124operating system-specific support.
125
126On crash, metadata is communicated to the Crashpad handler via the CrashpadInfo
127structure. Each client executable module linking the Crashpad client library
128embeds a CrashpadInfo structure, which can be updated by the client with
129whatever state the client wishes to record with a crash.
130
131![Overview image](overview.png)
132
133Here is an overview picture of the conceptual relationships between embedder (in
134light blue), client modules (darker blue), and Crashpad (in green). Note that
135multiple client modules can contain a CrashpadInfo structure, but only one
136registration is necessary.
137
138## Detailed Design
139
140### Requirements
141
142The purpose of Crashpad is to capture machine, OS and application state in
143sufficient detail and fidelity to allow developers to diagnose and, where
144possible, fix the issue causing the crash.
145
146Each distinct crash report is assigned a globally unique ID, in order to allow
147users to associate them with a user report, report in bug reports and so on.
148
149It’s critical to safeguard the user’s privacy by ensuring that no crash report
150is ever uploaded without user consent. Likewise it’s important to ensure that
151Crashpad never captures or uploads reports from non-client processes.
152
153### Concepts
154
155* **Client ID**. A UUID tied to a single instance of a Crashpad database. When
156  creating a crash report, the Crashpad handler includes the client ID stored
157  in the database. This provides a means to determine how many individual end
158  users are affected by a specific crash signature.
159
160* **Crash ID**. A UUID representing a single crash report. Uploaded crash
161  reports also receive a “server ID.” The Crashpad database indexes both the
162  locally-generated and server-generated IDs.
163
164* **Collection Server**. See [crash server documentation.](
165  https://goto.google.com/crash-server-overview)
166
167* **Client Process**. Any process that has registered with a Crashpad handler.
168
169* **Handler process**. A process hosting the Crashpad handler library. This may
170  be a dedicated executable, or it may be hosted within a client executable
171  with control passed to it based on special signaling under the client’s
172  control, such as a command-line parameter.
173
174* **CrashpadInfo**. A structure used by client modules to provide information to
175  the handler.
176
177* **Annotations**. Each CrashpadInfo structure points to a dictionary of
178  {string, string} annotations that the client can use to communicate
179  application state in the case of crash.
180
181* **Database**. The Crashpad database contains persistent client settings as
182  well as crash dumps pending upload.
183
184TODO(siggi): moar concepts?
185
186### Overview Picture
187
188Here is a rough overview picture of the various Crashpad constructs, their
189layering and intended use by clients.
190
191![Layering image](layering.png)
192
193Dark blue boxes are interfaces, light blue boxes are implementation. Gray is the
194embedding client application. Note that wherever possible, implementation that
195necessarily has to be OS-specific, exposes OS-agnostic interfaces to the rest of
196Crashpad and the client.
197
198### Registration
199
200The particulars of how a client registers with the handler varies across
201operating systems.
202
203#### macOS
204
205At registration time, the client designates a Mach port monitored by the
206Crashpad handler as the EXC_CRASH exception port for the client. The port may be
207acquired by launching a new handler process or by retrieving service already
208registered with the system. The registration is maintained by the kernel and is
209inherited by subprocesses at creation time by default, so only the topmost
210process of a process tree need register.
211
212Crashpad provides a facility for a process to disassociate (unregister) with an
213existing crash handler, which can be necessary when an older client spawns an
214updated version.
215
216#### Windows
217
218There are two modes of registration on Windows. In both cases the handler is
219advised of the address of a set of structures in the client process’ address
220space. These structures include a pair of ExceptionInformation structs, one for
221generating a postmortem dump for a crashing process, and another one for
222generating a dump for a non- crashing process.
223
224##### Normal registration
225
226In the normal registration mode, the client connects to a named pipe by a
227pre-arranged name. A registration request is written to the pipe. During
228registration, the handler creates a set of events, duplicates them to the
229registering client, then returns the handle values in the registration response.
230This is a blocking process.
231
232##### Initial Handler Creation
233
234In order to avoid blocking client startup for the creation and initialization of
235the handler, a different mode of registration can be used for the handler
236creation. In this mode, the client creates a set of event handles and inherits
237them into the newly created handler process. The handler process is advised of
238the handle values and the location of the ExceptionInformation structures by way
239of command line arguments in this mode.
240
241#### Linux/Android
242
243TODO(mmentovai): describe this. See this preliminary doc.
244
245### Capturing Exceptions
246
247The details of how Crashpad captures the exceptions leading to crashes varies
248between operating systems.
249
250#### macOS
251
252On macOS, the operating system will notify the handler of client crashes via the
253Mach port set as the client process’ exception port. As exceptions are
254dispatched to the Mach port by the kernel, on macOS, exceptions can be handled
255entirely from the Crashpad handler without the need to run any code in the crash
256process at the time of the exception.
257
258#### Windows
259
260On Windows, the OS dispatches exceptions in the context of the crashing thread.
261To notify the handler of exceptions, the Crashpad client registers an
262UnhandledExceptionFilter (UEF) in the client process. When an exception trickles
263up to the UEF, it stores the exception information and the crashing thread’s ID
264in the ExceptionInformation structure registered with the handler. It then sets
265an event handle to signal the handler to go ahead and process the exception.
266
267##### Caveats
268
269* If the crashing thread’s stack is smashed when an exception occurs, the
270  exception cannot be dispatched. In this case the OS will summarily terminate
271  the process, without the handler having an opportunity to generate a crash
272  report.
273* If an exception is handled in the crashing thread, it will never propagate
274  to the UEF, and thus a crash report won’t be generated. This happens a fair
275  bit in Windows as system libraries will often dispatch callbacks under a
276  structured exception handler. This occurs during Window message dispatching
277  on some system configurations, as well as during e.g. DLL entry point
278  notifications.
279* A growing number of conditions in the system and runtime exist where
280  detected corruption or illegal calls result in summary termination of the
281  process, in which case no crash report will be generated.
282
283###### Out-Of-Process Exception Handling
284
285There exists a mechanism in Windows Error Reporting (WER) that allows a client
286process to register for handling client exceptions out of the crashing process.
287Unfortunately this mechanism is difficult to use, and doesn’t provide coverage
288for many of the caveats above. [Details
289here.](https://crashpad.chromium.org/bug/133)
290
291#### Linux/Android
292
293TODO(mmentovai): describe this. See [this preliminary
294doc.](https://goto.google.com/crashpad-android-dd)
295
296### The CrashpadInfo structure
297
298The CrashpadInfo structure is used to communicate information from the client to
299the handler. Each executable module in a client process can contain a
300CrashpadInfo structure. On a crash, the handler crawls all modules in the
301crashing process to locate all CrashpadInfo structures present. The CrashpadInfo
302structures are linked into a special, named section of the executable, where the
303handler can readily find them.
304
305The CrashpadInfo structure has a magic signature, and contains a size and a
306version field. The intent is to allow backwards compatibility from older client
307modules to newer handler. It may also be necessary to provide forwards
308compatibility from newer clients to older handler, though this hasn’t occurred
309yet.
310
311The CrashpadInfo structure contains such properties as the cap for how much
312memory to include in the crash dump, some tristate flags for controlling the
313handler’s behavior, a pointer to an annotation dictionary and so on.
314
315### Snapshot
316
317Snapshot is a layer of interfaces that represent the machine and OS entities
318that Crashpad cares about. Different concrete implementations of snapshot can
319then be backed different ways, such as e.g. from the in-memory representation of
320a crashed process, or e.g. from the contents of a minidump.
321
322### Crash Dump Creation
323
324To create a crash dump, a subset of the machine, OS and application state is
325grabbed from the crashed process into an in-memory snapshot structure in the
326handler process. Since the full application state is typically too large for
327capturing to disk and transmitting to an upstream server, the snapshot contains
328a heuristically selected subset of the full state.
329
330The precise details of what’s captured varies between operating systems, but
331generally includes the following
332* The set of modules (executable, shared libraries) that are loaded into the
333  crashing process.
334* An enumeration of the threads running in the crashing process, including the
335  register contents and the contents of stack memory of each thread.
336* A selection of the OS-related state of the process, such as e.g. the command
337  line, environment and so on.
338* A selection of memory potentially referenced from registers and from stack.
339
340To capture a crash dump, the crashing process is first suspended, then a
341snapshot is created in the handler process. The snapshot includes the
342CrashpadInfo structures of the modules loaded into the process, and the contents
343of those is used to control the level of detail captured for the crash dump.
344
345Once the snapshot has been constructed, it is then written to a minidump file,
346which is added to the database. The process is un-suspended after the minidump
347file has been written. In the case of a crash (as opposed to a client request to
348produce a dump without crashing), it is then either killed by the operating
349system or the Crashpad handler.
350
351In general the snapshotting process has to be very intimate with the operating
352system it’s working with, so there will be a set of concrete implementation
353classes, many deriving from the snapshot interfaces, doing this for each
354operating system.
355
356### Minidump
357
358The minidump implementation is responsible for writing a snapshot to a
359serialized on-disk file in the minidump format. The minidump implementation is
360OS-agnostic, as it works on an OS-agnostic Snapshot interface.
361
362TODO(siggi): Talk about two-phase writes and contents ordering here.
363
364### Database
365
366The Crashpad database contains persistent client settings, including a unique
367crash client identifier and the upload-enabled bit. Note that the crash client
368identifier is assigned by Crashpad, and is distinct from any identifiers the
369client application uses to identify users, installs, machines or such - if any.
370The expectation is that the client application will manage the user’s upload
371consent, and inform Crashpad of changes in consent.
372
373The unique client identifier is set at the time of database creation. It is then
374recorded into every crash report collected by the handler and communicated to
375the upstream server.
376
377The database stores a configurable number of recorded crash dumps to a
378configurable maximum aggregate size. For each crash dump it stores annotations
379relating to whether the crash dumps have been uploaded. For successfully
380uploaded crash dumps it also stores their server-assigned ID.
381
382The database consists of a settings file, named "settings.dat" with binary
383contents (see crashpad::Settings::Data for the file format), as well as
384directory containing the crash dumps. Additionally each crash dump is adorned
385with properties relating to the state of the dump for upload and such. The
386details of how these properties are stored vary between platforms.
387
388#### macOS
389
390The macOS implementation simply stores database properties on the minidump files
391in filesystem extended attributes.
392
393#### Windows
394
395The Windows implementation stores database properties in a binary file named
396“metadata” at the top level of the database directory.
397
398### Report Format
399
400Crash reports are recorded in the Windows minidump format with
401extensions to support Crashpad additions, such as e.g. Annotations.
402
403### Upload to collection server
404
405#### Wire Format
406
407For the time being, Crashpad uses the Breakpad wire protocol, which is
408essentially a MIME multipart message communicated over HTTP(S). To support this,
409the annotations from all the CrashpadInfo structures found in the crashing
410process are merged to create the Breakpad “crash keys” as form data. The
411postmortem minidump is then attached as an “application/octet- stream”
412attachment with the name “upload_file_minidump”. The entirety of the request
413body, including the minidump, can be gzip-compressed to reduce transmission time
414and increase transmission reliability. Note that by convention there is a set of
415“crash keys” that are used to communicate the product, version, client ID and
416other relevant data about the client, to the server. Crashpad normally stores
417these values in the minidump file itself, but retrieves them from the minidump
418and supplies them as form data for compatibility with the Breakpad-style server.
419
420This is a temporary compatibility measure to allow the current Breakpad-based
421upstream server to handle Crashpad reports. In the fullness of time, the wire
422protocol is expected to change to remove this redundant transmission and
423processing of the Annotations.
424
425#### Transport
426
427The embedding client controls the URL of the collection server by the command
428line passed to the handler. The handler can upload crashes with HTTP or HTTPS,
429depending on client’s preference. It’s strongly suggested use HTTPS transport
430for crash uploads to protect the user’s privacy against man-in-the-middle
431snoopers.
432
433TODO(mmentovai): Certificate pinning.
434
435#### Throttling & Retry Strategy
436
437To protect both the collection server from DDoS as well as to protect the
438clients from unreasonable data transfer demands, the handler implements a
439client-side throttling strategy. At the moment, the strategy is very simplistic,
440it simply limits uploads to one upload per hour, and failed uploads are aborted.
441
442An experiment has been conducted to lift all throttling. Analysis on the
443aggregate data this produced shows that multiple crashes within a short timespan
444on the same client are nearly always due to the same cause. Therefore there is
445very little loss of signal due to the throttling, though the ability to
446reconstruct at least the full crash count is highly desirable.
447
448The lack of retry is expected to [change
449soon](https://crashpad.chromium.org/bug/23), as this creates blind spots for
450client crashes that exclusively occur on e.g. network down events, during
451suspend and resume and such.
452
453### Extensibility
454
455#### Client Extensibility
456
457Clients are able to extend the generated crash reports in two ways, by
458manipulating their CrashpadInfo structure.
459The two extensibility points are:
4601. Nominating a set of address ranges for inclusion in the crash report.
4612. Adding user-defined minidump streams for inclusion in the crash report.
462
463In both cases the CrashpadInfo structure has to be updated before a crash
464occurs.
465
466##### Embedder Extensibility
467
468Additionally, embedders of the handler can provide "user stream data source"
469instances to the handler's main function. Any time a minidump is written, these
470instances get called.
471
472Each data source may contribute a custom stream to the minidump, which can be
473computed from e.g. system or application state relevant to the crash.
474
475As a case in point, it can be handy to know whether the system was under memory
476or other resource duress at the time of crash.
477
478### Dependencies
479
480Aside from system headers and APIs, when used outside of Chromium, Crashpad has
481a dependency on “mini_chromium”, which is a subset of the Chromium base library.
482This is to allow non-Chromium clients to use Crashpad, without taking a direct
483dependency on the Chromium base, while allowing Chromium projects to use
484Crashpad with minimum code duplication or hassle. When using Crashpad as part of
485Chromium, Chromium’s own copy of the base library is used instead of
486mini_chromium.
487
488The downside to this is that mini_chromium must be kept up to date with
489interface and implementation changes in Chromium base, for the subset of
490functionality used by Crashpad.
491
492## Caveats
493
494TODO(anyone): You may need to describe what you did not do or why simpler
495approaches don't work. Mention other things to watch out for (if any).
496
497## Security Considerations
498
499Crashpad may be used to capture the state of sandboxed processes and it writes
500minidumps to disk. It may therefore straddle security boundaries, so it’s
501important that Crashpad handle all data it reads out of the crashed process with
502extreme care. The Crashpad handler takes care to access client address spaces
503through specially-designed accessors that check pointer validity and enforce
504accesses within prescribed bounds. The flow of information into the Crashpad
505handler is exclusively one-way: Crashpad never communicates anything back to
506its clients, aside from providing single-bit indications of completion.
507
508## Privacy Considerations
509
510Crashpad may capture arbitrary contents from crashed process’ memory, including
511user IDs and passwords, credit card information, URLs and whatever other content
512users have trusted the crashing program with. The client program must acquire
513and honor the user’s consent to upload crash reports, and appropriately manage
514the upload state in Crashpad’s database.
515
516Crashpad must also be careful not to upload crashes for arbitrary processes on
517the user’s system. To this end, Crashpad will never upload a process that hasn’t
518registered with the handler, but note that registrations are inherited by child
519processes on some operating systems.
520