1<!-- 2Copyright 2017 The Crashpad Authors. All rights reserved. 3 4Licensed under the Apache License, Version 2.0 (the "License"); 5you may not use this file except in compliance with the License. 6You may obtain a copy of the License at 7 8 http://www.apache.org/licenses/LICENSE-2.0 9 10Unless required by applicable law or agreed to in writing, software 11distributed under the License is distributed on an "AS IS" BASIS, 12WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13See the License for the specific language governing permissions and 14limitations under the License. 15--> 16 17# Crashpad Overview Design 18 19[TOC] 20 21## Objective 22 23Crashpad is a library for capturing, storing and transmitting postmortem crash 24reports from a client to an upstream collection server. Crashpad aims to make it 25possible for clients to capture process state at the time of crash with the best 26possible fidelity and coverage, with the minimum of fuss. 27 28Crashpad also provides a facility for clients to capture dumps of process state 29on-demand for diagnostic purposes. 30 31Crashpad additionally provides minimal facilities for clients to adorn their 32crashes with application-specific metadata in the form of per-process key/value 33pairs. More sophisticated clients are able to adorn crash reports further 34through extensibility points that allow the embedder to augment the crash report 35with application-specific metadata. 36 37## Background 38 39It’s an unfortunate truth that any large piece of software will contain bugs 40that will cause it to occasionally crash. Even in the absence of bugs, software 41incompatibilities can cause program instability. 42 43Fixing bugs and incompatibilities in client software that ships to millions of 44users around the world is a daunting task. User reports and manual reproduction 45of crashes can work, but even given a user report, often times the problem is 46not readily reproducible. This is for various reasons, such as e.g. system 47version or third-party software incompatibility, or the problem can happen due 48to a race of some sort. Users are also unlikely to report problems they 49encounter, and user reports are often of poor quality, as unfortunately most 50users don’t have experience with making good bug reports. 51 52Automatic crash telemetry has been the best solution to the problem so far, as 53this relieves the burden of manual reporting from users, while capturing the 54hardware and software state at the time of crash. 55 56TODO(siggi): examples of this? 57 58Crash telemetry involves capturing postmortem crash dumps and transmitting them 59to a backend collection server. On the server they can be stackwalked and 60symbolized, and evaluated and aggregated in various ways. Stackwalking and 61symbolizing the reports on an upstream server has several benefits over 62performing these tasks on the client. High-fidelity stackwalking requires access 63to bulky unwind data, and it may be desirable to not ship this to end users out 64of concern for the application size. The process of symbolization requires 65access to debugging symbols, which can be quite large, and the symbolization 66process can consume considerable other resources. Transmitting un-stackwalked 67and un-symbolized postmortem dumps to the collection server also allows deep 68analysis of individual dumps, which is often necessary to resolve the bug 69causing the crash. 70 71Transmitting reports to the collection server allows aggregating crashes by 72cause, which in turn allows assessing the importance of different crashes in 73terms of the occurrence rate and e.g. the potential security impact. 74 75A postmortem crash dump must contain the program state at the time of crash 76with sufficient fidelity to allow diagnosing and fixing the problem. As the full 77program state is usually too large to transmit to an upstream server, the 78postmortem dump captures a heuristic subset of the full state. 79 80The crashed program is in an indeterminate state and, in fact, has often crashed 81because of corrupt global state - such as heap. It’s therefore important to 82generate crash reports with as little execution in the crashed process as 83possible. Different operating systems vary in the facilities they provide for 84this. 85 86## Overview 87 88Crashpad is a client-side library that focuses on capturing machine and program 89state in a postmortem crash report, and transmitting this report to a backend 90server - a “collection server”. The Crashpad library is embedded by the client 91application. Conceptually, Crashpad breaks down into the handler and the client. 92The handler runs in a separate process from the client or clients. It is 93responsible for snapshotting the crashing client process’ state on a crash, 94saving it to a crash dump, and transmitting the crash dump to an upstream 95server. Clients register with the handler to allow it to capture and upload 96their crashes. 97 98### The Crashpad handler 99 100The Crashpad handler is instantiated in a process supplied by the embedding 101application. It provides means for clients to register themselves by some means 102of IPC, or where operating system support is available, by taking advantage of 103such support to cause crash notifications to be delivered to the handler. On 104crash, the handler snapshots the crashed client process’ state, writes it to a 105postmortem dump in a database, and may also transmit the dump to an upstream 106server if so configured. 107 108The Crashpad handler is able to handle cross-bitted requests and generate crash 109dumps across bitness, where e.g. the handler is a 64-bit process while the 110client is a 32-bit process or vice versa. In the case of Windows, this is 111limited by the OS such that a 32-bit handler can only generate crash dumps for 11232-bit clients, but a 64-bit handler can acquire nearly all of the detail for a 11332-bit process. 114 115### The Crashpad client 116 117The Crashpad client provides two main facilities. 1181. Registration with the Crashpad handler. 1192. Metadata communication to the Crashpad handler on crash. 120 121A Crashpad embedder links the Crashpad client library into one or more 122executables, whether a loadable library or a program file. The client process 123then registers with the Crashpad handler through some mode of IPC or other 124operating system-specific support. 125 126On crash, metadata is communicated to the Crashpad handler via the CrashpadInfo 127structure. Each client executable module linking the Crashpad client library 128embeds a CrashpadInfo structure, which can be updated by the client with 129whatever state the client wishes to record with a crash. 130 131![Overview image](overview.png) 132 133Here is an overview picture of the conceptual relationships between embedder (in 134light blue), client modules (darker blue), and Crashpad (in green). Note that 135multiple client modules can contain a CrashpadInfo structure, but only one 136registration is necessary. 137 138## Detailed Design 139 140### Requirements 141 142The purpose of Crashpad is to capture machine, OS and application state in 143sufficient detail and fidelity to allow developers to diagnose and, where 144possible, fix the issue causing the crash. 145 146Each distinct crash report is assigned a globally unique ID, in order to allow 147users to associate them with a user report, report in bug reports and so on. 148 149It’s critical to safeguard the user’s privacy by ensuring that no crash report 150is ever uploaded without user consent. Likewise it’s important to ensure that 151Crashpad never captures or uploads reports from non-client processes. 152 153### Concepts 154 155* **Client ID**. A UUID tied to a single instance of a Crashpad database. When 156 creating a crash report, the Crashpad handler includes the client ID stored 157 in the database. This provides a means to determine how many individual end 158 users are affected by a specific crash signature. 159 160* **Crash ID**. A UUID representing a single crash report. Uploaded crash 161 reports also receive a “server ID.” The Crashpad database indexes both the 162 locally-generated and server-generated IDs. 163 164* **Collection Server**. See [crash server documentation.]( 165 https://goto.google.com/crash-server-overview) 166 167* **Client Process**. Any process that has registered with a Crashpad handler. 168 169* **Handler process**. A process hosting the Crashpad handler library. This may 170 be a dedicated executable, or it may be hosted within a client executable 171 with control passed to it based on special signaling under the client’s 172 control, such as a command-line parameter. 173 174* **CrashpadInfo**. A structure used by client modules to provide information to 175 the handler. 176 177* **Annotations**. Each CrashpadInfo structure points to a dictionary of 178 {string, string} annotations that the client can use to communicate 179 application state in the case of crash. 180 181* **Database**. The Crashpad database contains persistent client settings as 182 well as crash dumps pending upload. 183 184TODO(siggi): moar concepts? 185 186### Overview Picture 187 188Here is a rough overview picture of the various Crashpad constructs, their 189layering and intended use by clients. 190 191![Layering image](layering.png) 192 193Dark blue boxes are interfaces, light blue boxes are implementation. Gray is the 194embedding client application. Note that wherever possible, implementation that 195necessarily has to be OS-specific, exposes OS-agnostic interfaces to the rest of 196Crashpad and the client. 197 198### Registration 199 200The particulars of how a client registers with the handler varies across 201operating systems. 202 203#### macOS 204 205At registration time, the client designates a Mach port monitored by the 206Crashpad handler as the EXC_CRASH exception port for the client. The port may be 207acquired by launching a new handler process or by retrieving service already 208registered with the system. The registration is maintained by the kernel and is 209inherited by subprocesses at creation time by default, so only the topmost 210process of a process tree need register. 211 212Crashpad provides a facility for a process to disassociate (unregister) with an 213existing crash handler, which can be necessary when an older client spawns an 214updated version. 215 216#### Windows 217 218There are two modes of registration on Windows. In both cases the handler is 219advised of the address of a set of structures in the client process’ address 220space. These structures include a pair of ExceptionInformation structs, one for 221generating a postmortem dump for a crashing process, and another one for 222generating a dump for a non- crashing process. 223 224##### Normal registration 225 226In the normal registration mode, the client connects to a named pipe by a 227pre-arranged name. A registration request is written to the pipe. During 228registration, the handler creates a set of events, duplicates them to the 229registering client, then returns the handle values in the registration response. 230This is a blocking process. 231 232##### Initial Handler Creation 233 234In order to avoid blocking client startup for the creation and initialization of 235the handler, a different mode of registration can be used for the handler 236creation. In this mode, the client creates a set of event handles and inherits 237them into the newly created handler process. The handler process is advised of 238the handle values and the location of the ExceptionInformation structures by way 239of command line arguments in this mode. 240 241#### Linux/Android 242 243TODO(mmentovai): describe this. See this preliminary doc. 244 245### Capturing Exceptions 246 247The details of how Crashpad captures the exceptions leading to crashes varies 248between operating systems. 249 250#### macOS 251 252On macOS, the operating system will notify the handler of client crashes via the 253Mach port set as the client process’ exception port. As exceptions are 254dispatched to the Mach port by the kernel, on macOS, exceptions can be handled 255entirely from the Crashpad handler without the need to run any code in the crash 256process at the time of the exception. 257 258#### Windows 259 260On Windows, the OS dispatches exceptions in the context of the crashing thread. 261To notify the handler of exceptions, the Crashpad client registers an 262UnhandledExceptionFilter (UEF) in the client process. When an exception trickles 263up to the UEF, it stores the exception information and the crashing thread’s ID 264in the ExceptionInformation structure registered with the handler. It then sets 265an event handle to signal the handler to go ahead and process the exception. 266 267##### Caveats 268 269* If the crashing thread’s stack is smashed when an exception occurs, the 270 exception cannot be dispatched. In this case the OS will summarily terminate 271 the process, without the handler having an opportunity to generate a crash 272 report. 273* If an exception is handled in the crashing thread, it will never propagate 274 to the UEF, and thus a crash report won’t be generated. This happens a fair 275 bit in Windows as system libraries will often dispatch callbacks under a 276 structured exception handler. This occurs during Window message dispatching 277 on some system configurations, as well as during e.g. DLL entry point 278 notifications. 279* A growing number of conditions in the system and runtime exist where 280 detected corruption or illegal calls result in summary termination of the 281 process, in which case no crash report will be generated. 282 283###### Out-Of-Process Exception Handling 284 285There exists a mechanism in Windows Error Reporting (WER) that allows a client 286process to register for handling client exceptions out of the crashing process. 287Unfortunately this mechanism is difficult to use, and doesn’t provide coverage 288for many of the caveats above. [Details 289here.](https://crashpad.chromium.org/bug/133) 290 291#### Linux/Android 292 293TODO(mmentovai): describe this. See [this preliminary 294doc.](https://goto.google.com/crashpad-android-dd) 295 296### The CrashpadInfo structure 297 298The CrashpadInfo structure is used to communicate information from the client to 299the handler. Each executable module in a client process can contain a 300CrashpadInfo structure. On a crash, the handler crawls all modules in the 301crashing process to locate all CrashpadInfo structures present. The CrashpadInfo 302structures are linked into a special, named section of the executable, where the 303handler can readily find them. 304 305The CrashpadInfo structure has a magic signature, and contains a size and a 306version field. The intent is to allow backwards compatibility from older client 307modules to newer handler. It may also be necessary to provide forwards 308compatibility from newer clients to older handler, though this hasn’t occurred 309yet. 310 311The CrashpadInfo structure contains such properties as the cap for how much 312memory to include in the crash dump, some tristate flags for controlling the 313handler’s behavior, a pointer to an annotation dictionary and so on. 314 315### Snapshot 316 317Snapshot is a layer of interfaces that represent the machine and OS entities 318that Crashpad cares about. Different concrete implementations of snapshot can 319then be backed different ways, such as e.g. from the in-memory representation of 320a crashed process, or e.g. from the contents of a minidump. 321 322### Crash Dump Creation 323 324To create a crash dump, a subset of the machine, OS and application state is 325grabbed from the crashed process into an in-memory snapshot structure in the 326handler process. Since the full application state is typically too large for 327capturing to disk and transmitting to an upstream server, the snapshot contains 328a heuristically selected subset of the full state. 329 330The precise details of what’s captured varies between operating systems, but 331generally includes the following 332* The set of modules (executable, shared libraries) that are loaded into the 333 crashing process. 334* An enumeration of the threads running in the crashing process, including the 335 register contents and the contents of stack memory of each thread. 336* A selection of the OS-related state of the process, such as e.g. the command 337 line, environment and so on. 338* A selection of memory potentially referenced from registers and from stack. 339 340To capture a crash dump, the crashing process is first suspended, then a 341snapshot is created in the handler process. The snapshot includes the 342CrashpadInfo structures of the modules loaded into the process, and the contents 343of those is used to control the level of detail captured for the crash dump. 344 345Once the snapshot has been constructed, it is then written to a minidump file, 346which is added to the database. The process is un-suspended after the minidump 347file has been written. In the case of a crash (as opposed to a client request to 348produce a dump without crashing), it is then either killed by the operating 349system or the Crashpad handler. 350 351In general the snapshotting process has to be very intimate with the operating 352system it’s working with, so there will be a set of concrete implementation 353classes, many deriving from the snapshot interfaces, doing this for each 354operating system. 355 356### Minidump 357 358The minidump implementation is responsible for writing a snapshot to a 359serialized on-disk file in the minidump format. The minidump implementation is 360OS-agnostic, as it works on an OS-agnostic Snapshot interface. 361 362TODO(siggi): Talk about two-phase writes and contents ordering here. 363 364### Database 365 366The Crashpad database contains persistent client settings, including a unique 367crash client identifier and the upload-enabled bit. Note that the crash client 368identifier is assigned by Crashpad, and is distinct from any identifiers the 369client application uses to identify users, installs, machines or such - if any. 370The expectation is that the client application will manage the user’s upload 371consent, and inform Crashpad of changes in consent. 372 373The unique client identifier is set at the time of database creation. It is then 374recorded into every crash report collected by the handler and communicated to 375the upstream server. 376 377The database stores a configurable number of recorded crash dumps to a 378configurable maximum aggregate size. For each crash dump it stores annotations 379relating to whether the crash dumps have been uploaded. For successfully 380uploaded crash dumps it also stores their server-assigned ID. 381 382The database consists of a settings file, named "settings.dat" with binary 383contents (see crashpad::Settings::Data for the file format), as well as 384directory containing the crash dumps. Additionally each crash dump is adorned 385with properties relating to the state of the dump for upload and such. The 386details of how these properties are stored vary between platforms. 387 388#### macOS 389 390The macOS implementation simply stores database properties on the minidump files 391in filesystem extended attributes. 392 393#### Windows 394 395The Windows implementation stores database properties in a binary file named 396“metadata” at the top level of the database directory. 397 398### Report Format 399 400Crash reports are recorded in the Windows minidump format with 401extensions to support Crashpad additions, such as e.g. Annotations. 402 403### Upload to collection server 404 405#### Wire Format 406 407For the time being, Crashpad uses the Breakpad wire protocol, which is 408essentially a MIME multipart message communicated over HTTP(S). To support this, 409the annotations from all the CrashpadInfo structures found in the crashing 410process are merged to create the Breakpad “crash keys” as form data. The 411postmortem minidump is then attached as an “application/octet- stream” 412attachment with the name “upload_file_minidump”. The entirety of the request 413body, including the minidump, can be gzip-compressed to reduce transmission time 414and increase transmission reliability. Note that by convention there is a set of 415“crash keys” that are used to communicate the product, version, client ID and 416other relevant data about the client, to the server. Crashpad normally stores 417these values in the minidump file itself, but retrieves them from the minidump 418and supplies them as form data for compatibility with the Breakpad-style server. 419 420This is a temporary compatibility measure to allow the current Breakpad-based 421upstream server to handle Crashpad reports. In the fullness of time, the wire 422protocol is expected to change to remove this redundant transmission and 423processing of the Annotations. 424 425#### Transport 426 427The embedding client controls the URL of the collection server by the command 428line passed to the handler. The handler can upload crashes with HTTP or HTTPS, 429depending on client’s preference. It’s strongly suggested use HTTPS transport 430for crash uploads to protect the user’s privacy against man-in-the-middle 431snoopers. 432 433TODO(mmentovai): Certificate pinning. 434 435#### Throttling & Retry Strategy 436 437To protect both the collection server from DDoS as well as to protect the 438clients from unreasonable data transfer demands, the handler implements a 439client-side throttling strategy. At the moment, the strategy is very simplistic, 440it simply limits uploads to one upload per hour, and failed uploads are aborted. 441 442An experiment has been conducted to lift all throttling. Analysis on the 443aggregate data this produced shows that multiple crashes within a short timespan 444on the same client are nearly always due to the same cause. Therefore there is 445very little loss of signal due to the throttling, though the ability to 446reconstruct at least the full crash count is highly desirable. 447 448The lack of retry is expected to [change 449soon](https://crashpad.chromium.org/bug/23), as this creates blind spots for 450client crashes that exclusively occur on e.g. network down events, during 451suspend and resume and such. 452 453### Extensibility 454 455#### Client Extensibility 456 457Clients are able to extend the generated crash reports in two ways, by 458manipulating their CrashpadInfo structure. 459The two extensibility points are: 4601. Nominating a set of address ranges for inclusion in the crash report. 4612. Adding user-defined minidump streams for inclusion in the crash report. 462 463In both cases the CrashpadInfo structure has to be updated before a crash 464occurs. 465 466##### Embedder Extensibility 467 468Additionally, embedders of the handler can provide "user stream data source" 469instances to the handler's main function. Any time a minidump is written, these 470instances get called. 471 472Each data source may contribute a custom stream to the minidump, which can be 473computed from e.g. system or application state relevant to the crash. 474 475As a case in point, it can be handy to know whether the system was under memory 476or other resource duress at the time of crash. 477 478### Dependencies 479 480Aside from system headers and APIs, when used outside of Chromium, Crashpad has 481a dependency on “mini_chromium”, which is a subset of the Chromium base library. 482This is to allow non-Chromium clients to use Crashpad, without taking a direct 483dependency on the Chromium base, while allowing Chromium projects to use 484Crashpad with minimum code duplication or hassle. When using Crashpad as part of 485Chromium, Chromium’s own copy of the base library is used instead of 486mini_chromium. 487 488The downside to this is that mini_chromium must be kept up to date with 489interface and implementation changes in Chromium base, for the subset of 490functionality used by Crashpad. 491 492## Caveats 493 494TODO(anyone): You may need to describe what you did not do or why simpler 495approaches don't work. Mention other things to watch out for (if any). 496 497## Security Considerations 498 499Crashpad may be used to capture the state of sandboxed processes and it writes 500minidumps to disk. It may therefore straddle security boundaries, so it’s 501important that Crashpad handle all data it reads out of the crashed process with 502extreme care. The Crashpad handler takes care to access client address spaces 503through specially-designed accessors that check pointer validity and enforce 504accesses within prescribed bounds. The flow of information into the Crashpad 505handler is exclusively one-way: Crashpad never communicates anything back to 506its clients, aside from providing single-bit indications of completion. 507 508## Privacy Considerations 509 510Crashpad may capture arbitrary contents from crashed process’ memory, including 511user IDs and passwords, credit card information, URLs and whatever other content 512users have trusted the crashing program with. The client program must acquire 513and honor the user’s consent to upload crash reports, and appropriately manage 514the upload state in Crashpad’s database. 515 516Crashpad must also be careful not to upload crashes for arbitrary processes on 517the user’s system. To this end, Crashpad will never upload a process that hasn’t 518registered with the handler, but note that registrations are inherited by child 519processes on some operating systems. 520