• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..15-Sep-2021-

.gitignoreH A D15-Sep-20213.2 KiB202157

.travis.ymlH A D15-Sep-2021632 2623

CHANGELOG.mdH A D15-Sep-2021894 2515

LICENSEH A D15-Sep-202111.1 KiB201169

README.mdH A D15-Sep-202113.2 KiB248192

client.goH A D15-Sep-20211.5 KiB4019

client_metrics.goH A D15-Sep-20216.2 KiB171136

client_reporter.goH A D15-Sep-20211.4 KiB4736

makefileH A D15-Sep-2021223 1710

metric_options.goH A D15-Sep-20211.1 KiB4226

server.goH A D15-Sep-20211.9 KiB4923

server_metrics.goH A D15-Sep-20217.3 KiB186147

server_reporter.goH A D15-Sep-20211.4 KiB4736

util.goH A D15-Sep-20211.3 KiB5140

README.md

1# Go gRPC Interceptors for Prometheus monitoring
2
3[![Travis Build](https://travis-ci.org/grpc-ecosystem/go-grpc-prometheus.svg)](https://travis-ci.org/grpc-ecosystem/go-grpc-prometheus)
4[![Go Report Card](https://goreportcard.com/badge/github.com/grpc-ecosystem/go-grpc-prometheus)](http://goreportcard.com/report/grpc-ecosystem/go-grpc-prometheus)
5[![GoDoc](http://img.shields.io/badge/GoDoc-Reference-blue.svg)](https://godoc.org/github.com/grpc-ecosystem/go-grpc-prometheus)
6[![SourceGraph](https://sourcegraph.com/github.com/grpc-ecosystem/go-grpc-prometheus/-/badge.svg)](https://sourcegraph.com/github.com/grpc-ecosystem/go-grpc-prometheus/?badge)
7[![codecov](https://codecov.io/gh/grpc-ecosystem/go-grpc-prometheus/branch/master/graph/badge.svg)](https://codecov.io/gh/grpc-ecosystem/go-grpc-prometheus)
8[![Apache 2.0 License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
9
10[Prometheus](https://prometheus.io/) monitoring for your [gRPC Go](https://github.com/grpc/grpc-go) servers and clients.
11
12A sister implementation for [gRPC Java](https://github.com/grpc/grpc-java) (same metrics, same semantics) is in [grpc-ecosystem/java-grpc-prometheus](https://github.com/grpc-ecosystem/java-grpc-prometheus).
13
14## Interceptors
15
16[gRPC Go](https://github.com/grpc/grpc-go) recently acquired support for Interceptors, i.e. middleware that is executed
17by a gRPC Server before the request is passed onto the user's application logic. It is a perfect way to implement
18common patterns: auth, logging and... monitoring.
19
20To use Interceptors in chains, please see [`go-grpc-middleware`](https://github.com/mwitkow/go-grpc-middleware).
21
22## Usage
23
24There are two types of interceptors: client-side and server-side. This package provides monitoring Interceptors for both.
25
26### Server-side
27
28```go
29import "github.com/grpc-ecosystem/go-grpc-prometheus"
30...
31    // Initialize your gRPC server's interceptor.
32    myServer := grpc.NewServer(
33        grpc.StreamInterceptor(grpc_prometheus.StreamServerInterceptor),
34        grpc.UnaryInterceptor(grpc_prometheus.UnaryServerInterceptor),
35    )
36    // Register your gRPC service implementations.
37    myservice.RegisterMyServiceServer(s.server, &myServiceImpl{})
38    // After all your registrations, make sure all of the Prometheus metrics are initialized.
39    grpc_prometheus.Register(myServer)
40    // Register Prometheus metrics handler.
41    http.Handle("/metrics", promhttp.Handler())
42...
43```
44
45### Client-side
46
47```go
48import "github.com/grpc-ecosystem/go-grpc-prometheus"
49...
50   clientConn, err = grpc.Dial(
51       address,
52		   grpc.WithUnaryInterceptor(grpc_prometheus.UnaryClientInterceptor),
53		   grpc.WithStreamInterceptor(grpc_prometheus.StreamClientInterceptor)
54   )
55   client = pb_testproto.NewTestServiceClient(clientConn)
56   resp, err := client.PingEmpty(s.ctx, &myservice.Request{Msg: "hello"})
57...
58```
59
60# Metrics
61
62## Labels
63
64All server-side metrics start with `grpc_server` as Prometheus subsystem name. All client-side metrics start with `grpc_client`. Both of them have mirror-concepts. Similarly all methods
65contain the same rich labels:
66
67  * `grpc_service` - the [gRPC service](http://www.grpc.io/docs/#defining-a-service) name, which is the combination of protobuf `package` and
68    the `grpc_service` section name. E.g. for `package = mwitkow.testproto` and
69     `service TestService` the label will be `grpc_service="mwitkow.testproto.TestService"`
70  * `grpc_method` - the name of the method called on the gRPC service. E.g.
71    `grpc_method="Ping"`
72  * `grpc_type` - the gRPC [type of request](http://www.grpc.io/docs/guides/concepts.html#rpc-life-cycle).
73    Differentiating between the two is important especially for latency measurements.
74
75     - `unary` is single request, single response RPC
76     - `client_stream` is a multi-request, single response RPC
77     - `server_stream` is a single request, multi-response RPC
78     - `bidi_stream` is a multi-request, multi-response RPC
79
80
81Additionally for completed RPCs, the following labels are used:
82
83  * `grpc_code` - the human-readable [gRPC status code](https://github.com/grpc/grpc-go/blob/master/codes/codes.go).
84    The list of all statuses is to long, but here are some common ones:
85
86      - `OK` - means the RPC was successful
87      - `IllegalArgument` - RPC contained bad values
88      - `Internal` - server-side error not disclosed to the clients
89
90## Counters
91
92The counters and their up to date documentation is in [server_reporter.go](server_reporter.go) and [client_reporter.go](client_reporter.go)
93the respective Prometheus handler (usually `/metrics`).
94
95For the purpose of this documentation we will only discuss `grpc_server` metrics. The `grpc_client` ones contain mirror concepts.
96
97For simplicity, let's assume we're tracking a single server-side RPC call of [`mwitkow.testproto.TestService`](examples/testproto/test.proto),
98calling the method `PingList`. The call succeeds and returns 20 messages in the stream.
99
100First, immediately after the server receives the call it will increment the
101`grpc_server_started_total` and start the handling time clock (if histograms are enabled).
102
103```jsoniq
104grpc_server_started_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
105```
106
107Then the user logic gets invoked. It receives one message from the client containing the request
108(it's a `server_stream`):
109
110```jsoniq
111grpc_server_msg_received_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
112```
113
114The user logic may return an error, or send multiple messages back to the client. In this case, on
115each of the 20 messages sent back, a counter will be incremented:
116
117```jsoniq
118grpc_server_msg_sent_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 20
119```
120
121After the call completes, its status (`OK` or other [gRPC status code](https://github.com/grpc/grpc-go/blob/master/codes/codes.go))
122and the relevant call labels increment the `grpc_server_handled_total` counter.
123
124```jsoniq
125grpc_server_handled_total{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
126```
127
128## Histograms
129
130[Prometheus histograms](https://prometheus.io/docs/concepts/metric_types/#histogram) are a great way
131to measure latency distributions of your RPCs. However, since it is bad practice to have metrics
132of [high cardinality](https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels)
133the latency monitoring metrics are disabled by default. To enable them please call the following
134in your server initialization code:
135
136```jsoniq
137grpc_prometheus.EnableHandlingTimeHistogram()
138```
139
140After the call completes, its handling time will be recorded in a [Prometheus histogram](https://prometheus.io/docs/concepts/metric_types/#histogram)
141variable `grpc_server_handling_seconds`. The histogram variable contains three sub-metrics:
142
143 * `grpc_server_handling_seconds_count` - the count of all completed RPCs by status and method
144 * `grpc_server_handling_seconds_sum` - cumulative time of RPCs by status and method, useful for
145   calculating average handling times
146 * `grpc_server_handling_seconds_bucket` - contains the counts of RPCs by status and method in respective
147   handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see [here](https://prometheus.io/docs/practices/histograms/))
148
149The counter values will look as follows:
150
151```jsoniq
152grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.005"} 1
153grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.01"} 1
154grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.025"} 1
155grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.05"} 1
156grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.1"} 1
157grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.25"} 1
158grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.5"} 1
159grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="1"} 1
160grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="2.5"} 1
161grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="5"} 1
162grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="10"} 1
163grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="+Inf"} 1
164grpc_server_handling_seconds_sum{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 0.0003866430000000001
165grpc_server_handling_seconds_count{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
166```
167
168
169## Useful query examples
170
171Prometheus philosophy is to provide raw metrics to the monitoring system, and
172let the aggregations be handled there. The verbosity of above metrics make it possible to have that
173flexibility. Here's a couple of useful monitoring queries:
174
175
176### request inbound rate
177```jsoniq
178sum(rate(grpc_server_started_total{job="foo"}[1m])) by (grpc_service)
179```
180For `job="foo"` (common label to differentiate between Prometheus monitoring targets), calculate the
181rate of requests per second (1 minute window) for each gRPC `grpc_service` that the job has. Please note
182how the `grpc_method` is being omitted here: all methods of a given gRPC service will be summed together.
183
184### unary request error rate
185```jsoniq
186sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)
187```
188For `job="foo"`, calculate the per-`grpc_service` rate of `unary` (1:1) RPCs that failed, i.e. the
189ones that didn't finish with `OK` code.
190
191### unary request error percentage
192```jsoniq
193sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)
194 /
195sum(rate(grpc_server_started_total{job="foo",grpc_type="unary"}[1m])) by (grpc_service)
196 * 100.0
197```
198For `job="foo"`, calculate the percentage of failed requests by service. It's easy to notice that
199this is a combination of the two above examples. This is an example of a query you would like to
200[alert on](https://prometheus.io/docs/alerting/rules/) in your system for SLA violations, e.g.
201"no more than 1% requests should fail".
202
203### average response stream size
204```jsoniq
205sum(rate(grpc_server_msg_sent_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)
206 /
207sum(rate(grpc_server_started_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)
208```
209For `job="foo"` what is the `grpc_service`-wide `10m` average of messages returned for all `
210server_stream` RPCs. This allows you to track the stream sizes returned by your system, e.g. allows
211you to track when clients started to send "wide" queries that ret
212Note the divisor is the number of started RPCs, in order to account for in-flight requests.
213
214### 99%-tile latency of unary requests
215```jsoniq
216histogram_quantile(0.99,
217  sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary"}[5m])) by (grpc_service,le)
218)
219```
220For `job="foo"`, returns an 99%-tile [quantile estimation](https://prometheus.io/docs/practices/histograms/#quantiles)
221of the handling time of RPCs per service. Please note the `5m` rate, this means that the quantile
222estimation will take samples in a rolling `5m` window. When combined with other quantiles
223(e.g. 50%, 90%), this query gives you tremendous insight into the responsiveness of your system
224(e.g. impact of caching).
225
226### percentage of slow unary queries (>250ms)
227```jsoniq
228100.0 - (
229sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary",le="0.25"}[5m])) by (grpc_service)
230 /
231sum(rate(grpc_server_handling_seconds_count{job="foo",grpc_type="unary"}[5m])) by (grpc_service)
232) * 100.0
233```
234For `job="foo"` calculate the by-`grpc_service` fraction of slow requests that took longer than `0.25`
235seconds. This query is relatively complex, since the Prometheus aggregations use `le` (less or equal)
236buckets, meaning that counting "fast" requests fractions is easier. However, simple maths helps.
237This is an example of a query you would like to alert on in your system for SLA violations,
238e.g. "less than 1% of requests are slower than 250ms".
239
240
241## Status
242
243This code has been used since August 2015 as the basis for monitoring of *production* gRPC micro services  at [Improbable](https://improbable.io).
244
245## License
246
247`go-grpc-prometheus` is released under the Apache 2.0 license. See the [LICENSE](LICENSE) file for details.
248