1gRPC Connectivity Semantics and API 2=================================== 3 4This document describes the connectivity semantics for gRPC channels and the 5corresponding impact on RPCs. We then discuss an API. 6 7States of Connectivity 8---------------------- 9 10gRPC Channels provide the abstraction over which clients can communicate with 11servers.The client-side channel object can be constructed using little more 12than a DNS name. Channels encapsulate a range of functionality including name 13resolution, establishing a TCP connection (with retries and backoff) and TLS 14handshakes. Channels can also handle errors on established connections and 15reconnect, or in the case of HTTP/2 GO_AWAY, re-resolve the name and reconnect. 16 17To hide the details of all this activity from the user of the gRPC API (i.e., 18application code) while exposing meaningful information about the state of a 19channel, we use a state machine with five states, defined below: 20 21CONNECTING: The channel is trying to establish a connection and is waiting to 22make progress on one of the steps involved in name resolution, TCP connection 23establishment or TLS handshake. This may be used as the initial state for channels upon 24creation. 25 26READY: The channel has successfully established a connection all the way through 27TLS handshake (or equivalent) and protocol-level (HTTP/2, etc) handshaking, and 28all subsequent attempt to communicate have succeeded (or are pending without any 29known failure). 30 31TRANSIENT_FAILURE: There has been some transient failure (such as a TCP 3-way 32handshake timing out or a socket error). Channels in this state will eventually 33switch to the CONNECTING state and try to establish a connection again. Since 34retries are done with exponential backoff, channels that fail to connect will 35start out spending very little time in this state but as the attempts fail 36repeatedly, the channel will spend increasingly large amounts of time in this 37state. For many non-fatal failures (e.g., TCP connection attempts timing out 38because the server is not yet available), the channel may spend increasingly 39large amounts of time in this state. 40 41IDLE: This is the state where the channel is not even trying to create a 42connection because of a lack of new or pending RPCs. New RPCs MAY be created 43in this state. Any attempt to start an RPC on the channel will push the channel 44out of this state to connecting. When there has been no RPC activity on a channel 45for a specified IDLE_TIMEOUT, i.e., no new or pending (active) RPCs for this 46period, channels that are READY or CONNECTING switch to IDLE. Additionally, 47channels that receive a GOAWAY when there are no active or pending RPCs should 48also switch to IDLE to avoid connection overload at servers that are attempting 49to shed connections. We will use a default IDLE_TIMEOUT of 300 seconds (5 minutes). 50 51SHUTDOWN: This channel has started shutting down. Any new RPCs should fail 52immediately. Pending RPCs may continue running till the application cancels them. 53Channels may enter this state either because the application explicitly requested 54a shutdown or if a non-recoverable error has happened during attempts to connect 55communicate . (As of 6/12/2015, there are no known errors (while connecting or 56communicating) that are classified as non-recoverable.) Channels that enter this 57state never leave this state. 58 59The following table lists the legal transitions from one state to another and 60corresponding reasons. Empty cells denote disallowed transitions. 61 62<table style='border: 1px solid black'> 63 <tr> 64 <th>From/To</th> 65 <th>CONNECTING</th> 66 <th>READY</th> 67 <th>TRANSIENT_FAILURE</th> 68 <th>IDLE</th> 69 <th>SHUTDOWN</th> 70 </tr> 71 <tr> 72 <th>CONNECTING</th> 73 <td>Incremental progress during connection establishment</td> 74 <td>All steps needed to establish a connection succeeded</td> 75 <td>Any failure in any of the steps needed to establish connection</td> 76 <td>No RPC activity on channel for IDLE_TIMEOUT</td> 77 <td>Shutdown triggered by application.</td> 78 </tr> 79 <tr> 80 <th>READY</th> 81 <td></td> 82 <td>Incremental successful communication on established channel.</td> 83 <td>Any failure encountered while expecting successful communication on 84 established channel.</td> 85 <td>No RPC activity on channel for IDLE_TIMEOUT <br>OR<br>upon receiving a GOAWAY while there are no pending RPCs.</td> 86 <td>Shutdown triggered by application.</td> 87 </tr> 88 <tr> 89 <th>TRANSIENT_FAILURE</th> 90 <td>Wait time required to implement (exponential) backoff is over.</td> 91 <td></td> 92 <td></td> 93 <td></td> 94 <td>Shutdown triggered by application.</td> 95 </tr> 96 <tr> 97 <th>IDLE</th> 98 <td>Any new RPC activity on the channel</td> 99 <td></td> 100 <td></td> 101 <td></td> 102 <td>Shutdown triggered by application.</td> 103 </tr> 104 <tr> 105 <th>SHUTDOWN</th> 106 <td></td> 107 <td></td> 108 <td></td> 109 <td></td> 110 <td></td> 111 </tr> 112</table> 113 114 115Channel State API 116----------------- 117 118All gRPC libraries will expose a channel-level API method to poll the current 119state of a channel. In C++, this method is called GetState and returns an enum 120for one of the five legal states. It also accepts a boolean `try_to_connect` to 121transition to CONNECTING if the channel is currently IDLE. The boolean should 122act as if an RPC occurred, so it should also reset IDLE_TIMEOUT. 123 124```cpp 125grpc_connectivity_state GetState(bool try_to_connect); 126``` 127 128All libraries should also expose an API that enables the application (user of 129the gRPC API) to be notified when the channel state changes. Since state 130changes can be rapid and race with any such notification, the notification 131should just inform the user that some state change has happened, leaving it to 132the user to poll the channel for the current state. 133 134The synchronous version of this API is: 135 136```cpp 137bool WaitForStateChange(grpc_connectivity_state source_state, gpr_timespec deadline); 138``` 139 140which returns `true` when the state is something other than the 141`source_state` and `false` if the deadline expires. Asynchronous- and futures-based 142APIs should have a corresponding method that allows the application to be 143notified when the state of a channel changes. 144 145Note that a notification is delivered every time there is a transition from any 146state to any *other* state. On the other hand the rules for legal state 147transition, require a transition from CONNECTING to TRANSIENT_FAILURE and back 148to CONNECTING for every recoverable failure, even if the corresponding 149exponential backoff requires no wait before retry. The combined effect is that 150the application may receive state change notifications that appear spurious. 151e.g., an application waiting for state changes on a channel that is CONNECTING 152may receive a state change notification but find the channel in the same 153CONNECTING state on polling for current state because the channel may have 154spent infinitesimally small amount of time in the TRANSIENT_FAILURE state. 155