• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

examples/H02-Oct-2019-

packages/grpcstatus/H02-Oct-2019-

scripts/H02-Oct-2019-

.gitignoreH A D02-Oct-20193.2 KiB

.travis.ymlH A D02-Oct-2019611

CHANGELOG.mdH A D02-Oct-20191.1 KiB

LICENSEH A D02-Oct-201911.1 KiB

README.mdH A D02-Oct-201913.5 KiB

client.goH A D02-Oct-20192.4 KiB

client_metrics.goH A D02-Oct-20198.7 KiB

client_reporter.goH A D02-Oct-20192.2 KiB

client_test.goH A D02-Oct-20196.6 KiB

go.modH A D02-Oct-2019336

go.sumH A D02-Oct-20194.7 KiB

makefileH A D02-Oct-2019204

metric_options.goH A D02-Oct-20191.1 KiB

server.goH A D02-Oct-20191.9 KiB

server_metrics.goH A D02-Oct-20197.3 KiB

server_reporter.goH A D02-Oct-20191.4 KiB

server_test.goH A D02-Oct-201912.8 KiB

util.goH A D02-Oct-20191.3 KiB

README.md

1# Go gRPC Interceptors for Prometheus monitoring
2
3[![Travis Build](https://travis-ci.org/grpc-ecosystem/go-grpc-prometheus.svg)](https://travis-ci.org/grpc-ecosystem/go-grpc-prometheus)
4[![Go Report Card](https://goreportcard.com/badge/github.com/grpc-ecosystem/go-grpc-prometheus)](http://goreportcard.com/report/grpc-ecosystem/go-grpc-prometheus)
5[![GoDoc](http://img.shields.io/badge/GoDoc-Reference-blue.svg)](https://godoc.org/github.com/grpc-ecosystem/go-grpc-prometheus)
6[![SourceGraph](https://sourcegraph.com/github.com/grpc-ecosystem/go-grpc-prometheus/-/badge.svg)](https://sourcegraph.com/github.com/grpc-ecosystem/go-grpc-prometheus/?badge)
7[![codecov](https://codecov.io/gh/grpc-ecosystem/go-grpc-prometheus/branch/master/graph/badge.svg)](https://codecov.io/gh/grpc-ecosystem/go-grpc-prometheus)
8[![Slack](https://img.shields.io/badge/join%20slack-%23go--grpc--prometheus-brightgreen.svg)](https://join.slack.com/t/improbable-eng/shared_invite/enQtMzQ1ODcyMzQ5MjM4LWY5ZWZmNGM2ODc5MmViNmQ3ZTA3ZTY3NzQwOTBlMTkzZmIxZTIxODk0OWU3YjZhNWVlNDU3MDlkZGViZjhkMjc)
9[![Apache 2.0 License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
10
11[Prometheus](https://prometheus.io/) monitoring for your [gRPC Go](https://github.com/grpc/grpc-go) servers and clients.
12
13A sister implementation for [gRPC Java](https://github.com/grpc/grpc-java) (same metrics, same semantics) is in [grpc-ecosystem/java-grpc-prometheus](https://github.com/grpc-ecosystem/java-grpc-prometheus).
14
15## Interceptors
16
17[gRPC Go](https://github.com/grpc/grpc-go) recently acquired support for Interceptors, i.e. middleware that is executed
18by a gRPC Server before the request is passed onto the user's application logic. It is a perfect way to implement
19common patterns: auth, logging and... monitoring.
20
21To use Interceptors in chains, please see [`go-grpc-middleware`](https://github.com/mwitkow/go-grpc-middleware).
22
23This library requires Go 1.9 or later.
24
25## Usage
26
27There are two types of interceptors: client-side and server-side. This package provides monitoring Interceptors for both.
28
29### Server-side
30
31```go
32import "github.com/grpc-ecosystem/go-grpc-prometheus"
33...
34    // Initialize your gRPC server's interceptor.
35    myServer := grpc.NewServer(
36        grpc.StreamInterceptor(grpc_prometheus.StreamServerInterceptor),
37        grpc.UnaryInterceptor(grpc_prometheus.UnaryServerInterceptor),
38    )
39    // Register your gRPC service implementations.
40    myservice.RegisterMyServiceServer(s.server, &myServiceImpl{})
41    // After all your registrations, make sure all of the Prometheus metrics are initialized.
42    grpc_prometheus.Register(myServer)
43    // Register Prometheus metrics handler.
44    http.Handle("/metrics", promhttp.Handler())
45...
46```
47
48### Client-side
49
50```go
51import "github.com/grpc-ecosystem/go-grpc-prometheus"
52...
53   clientConn, err = grpc.Dial(
54       address,
55		   grpc.WithUnaryInterceptor(grpc_prometheus.UnaryClientInterceptor),
56		   grpc.WithStreamInterceptor(grpc_prometheus.StreamClientInterceptor)
57   )
58   client = pb_testproto.NewTestServiceClient(clientConn)
59   resp, err := client.PingEmpty(s.ctx, &myservice.Request{Msg: "hello"})
60...
61```
62
63# Metrics
64
65## Labels
66
67All server-side metrics start with `grpc_server` as Prometheus subsystem name. All client-side metrics start with `grpc_client`. Both of them have mirror-concepts. Similarly all methods
68contain the same rich labels:
69
70  * `grpc_service` - the [gRPC service](http://www.grpc.io/docs/#defining-a-service) name, which is the combination of protobuf `package` and
71    the `grpc_service` section name. E.g. for `package = mwitkow.testproto` and
72     `service TestService` the label will be `grpc_service="mwitkow.testproto.TestService"`
73  * `grpc_method` - the name of the method called on the gRPC service. E.g.
74    `grpc_method="Ping"`
75  * `grpc_type` - the gRPC [type of request](http://www.grpc.io/docs/guides/concepts.html#rpc-life-cycle).
76    Differentiating between the two is important especially for latency measurements.
77
78     - `unary` is single request, single response RPC
79     - `client_stream` is a multi-request, single response RPC
80     - `server_stream` is a single request, multi-response RPC
81     - `bidi_stream` is a multi-request, multi-response RPC
82
83
84Additionally for completed RPCs, the following labels are used:
85
86  * `grpc_code` - the human-readable [gRPC status code](https://github.com/grpc/grpc-go/blob/master/codes/codes.go).
87    The list of all statuses is to long, but here are some common ones:
88
89      - `OK` - means the RPC was successful
90      - `IllegalArgument` - RPC contained bad values
91      - `Internal` - server-side error not disclosed to the clients
92
93## Counters
94
95The counters and their up to date documentation is in [server_reporter.go](server_reporter.go) and [client_reporter.go](client_reporter.go)
96the respective Prometheus handler (usually `/metrics`).
97
98For the purpose of this documentation we will only discuss `grpc_server` metrics. The `grpc_client` ones contain mirror concepts.
99
100For simplicity, let's assume we're tracking a single server-side RPC call of [`mwitkow.testproto.TestService`](examples/testproto/test.proto),
101calling the method `PingList`. The call succeeds and returns 20 messages in the stream.
102
103First, immediately after the server receives the call it will increment the
104`grpc_server_started_total` and start the handling time clock (if histograms are enabled).
105
106```jsoniq
107grpc_server_started_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
108```
109
110Then the user logic gets invoked. It receives one message from the client containing the request
111(it's a `server_stream`):
112
113```jsoniq
114grpc_server_msg_received_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
115```
116
117The user logic may return an error, or send multiple messages back to the client. In this case, on
118each of the 20 messages sent back, a counter will be incremented:
119
120```jsoniq
121grpc_server_msg_sent_total{grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 20
122```
123
124After the call completes, its status (`OK` or other [gRPC status code](https://github.com/grpc/grpc-go/blob/master/codes/codes.go))
125and the relevant call labels increment the `grpc_server_handled_total` counter.
126
127```jsoniq
128grpc_server_handled_total{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
129```
130
131## Histograms
132
133[Prometheus histograms](https://prometheus.io/docs/concepts/metric_types/#histogram) are a great way
134to measure latency distributions of your RPCs. However, since it is bad practice to have metrics
135of [high cardinality](https://prometheus.io/docs/practices/instrumentation/#do-not-overuse-labels)
136the latency monitoring metrics are disabled by default. To enable them please call the following
137in your server initialization code:
138
139```jsoniq
140grpc_prometheus.EnableHandlingTimeHistogram()
141```
142
143After the call completes, its handling time will be recorded in a [Prometheus histogram](https://prometheus.io/docs/concepts/metric_types/#histogram)
144variable `grpc_server_handling_seconds`. The histogram variable contains three sub-metrics:
145
146 * `grpc_server_handling_seconds_count` - the count of all completed RPCs by status and method
147 * `grpc_server_handling_seconds_sum` - cumulative time of RPCs by status and method, useful for
148   calculating average handling times
149 * `grpc_server_handling_seconds_bucket` - contains the counts of RPCs by status and method in respective
150   handling-time buckets. These buckets can be used by Prometheus to estimate SLAs (see [here](https://prometheus.io/docs/practices/histograms/))
151
152The counter values will look as follows:
153
154```jsoniq
155grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.005"} 1
156grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.01"} 1
157grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.025"} 1
158grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.05"} 1
159grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.1"} 1
160grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.25"} 1
161grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="0.5"} 1
162grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="1"} 1
163grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="2.5"} 1
164grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="5"} 1
165grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="10"} 1
166grpc_server_handling_seconds_bucket{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream",le="+Inf"} 1
167grpc_server_handling_seconds_sum{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 0.0003866430000000001
168grpc_server_handling_seconds_count{grpc_code="OK",grpc_method="PingList",grpc_service="mwitkow.testproto.TestService",grpc_type="server_stream"} 1
169```
170
171
172## Useful query examples
173
174Prometheus philosophy is to provide raw metrics to the monitoring system, and
175let the aggregations be handled there. The verbosity of above metrics make it possible to have that
176flexibility. Here's a couple of useful monitoring queries:
177
178
179### request inbound rate
180```jsoniq
181sum(rate(grpc_server_started_total{job="foo"}[1m])) by (grpc_service)
182```
183For `job="foo"` (common label to differentiate between Prometheus monitoring targets), calculate the
184rate of requests per second (1 minute window) for each gRPC `grpc_service` that the job has. Please note
185how the `grpc_method` is being omitted here: all methods of a given gRPC service will be summed together.
186
187### unary request error rate
188```jsoniq
189sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)
190```
191For `job="foo"`, calculate the per-`grpc_service` rate of `unary` (1:1) RPCs that failed, i.e. the
192ones that didn't finish with `OK` code.
193
194### unary request error percentage
195```jsoniq
196sum(rate(grpc_server_handled_total{job="foo",grpc_type="unary",grpc_code!="OK"}[1m])) by (grpc_service)
197 /
198sum(rate(grpc_server_started_total{job="foo",grpc_type="unary"}[1m])) by (grpc_service)
199 * 100.0
200```
201For `job="foo"`, calculate the percentage of failed requests by service. It's easy to notice that
202this is a combination of the two above examples. This is an example of a query you would like to
203[alert on](https://prometheus.io/docs/alerting/rules/) in your system for SLA violations, e.g.
204"no more than 1% requests should fail".
205
206### average response stream size
207```jsoniq
208sum(rate(grpc_server_msg_sent_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)
209 /
210sum(rate(grpc_server_started_total{job="foo",grpc_type="server_stream"}[10m])) by (grpc_service)
211```
212For `job="foo"` what is the `grpc_service`-wide `10m` average of messages returned for all `
213server_stream` RPCs. This allows you to track the stream sizes returned by your system, e.g. allows
214you to track when clients started to send "wide" queries that ret
215Note the divisor is the number of started RPCs, in order to account for in-flight requests.
216
217### 99%-tile latency of unary requests
218```jsoniq
219histogram_quantile(0.99,
220  sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary"}[5m])) by (grpc_service,le)
221)
222```
223For `job="foo"`, returns an 99%-tile [quantile estimation](https://prometheus.io/docs/practices/histograms/#quantiles)
224of the handling time of RPCs per service. Please note the `5m` rate, this means that the quantile
225estimation will take samples in a rolling `5m` window. When combined with other quantiles
226(e.g. 50%, 90%), this query gives you tremendous insight into the responsiveness of your system
227(e.g. impact of caching).
228
229### percentage of slow unary queries (>250ms)
230```jsoniq
231100.0 - (
232sum(rate(grpc_server_handling_seconds_bucket{job="foo",grpc_type="unary",le="0.25"}[5m])) by (grpc_service)
233 /
234sum(rate(grpc_server_handling_seconds_count{job="foo",grpc_type="unary"}[5m])) by (grpc_service)
235) * 100.0
236```
237For `job="foo"` calculate the by-`grpc_service` fraction of slow requests that took longer than `0.25`
238seconds. This query is relatively complex, since the Prometheus aggregations use `le` (less or equal)
239buckets, meaning that counting "fast" requests fractions is easier. However, simple maths helps.
240This is an example of a query you would like to alert on in your system for SLA violations,
241e.g. "less than 1% of requests are slower than 250ms".
242
243
244## Status
245
246This code has been used since August 2015 as the basis for monitoring of *production* gRPC micro services  at [Improbable](https://improbable.io).
247
248## License
249
250`go-grpc-prometheus` is released under the Apache 2.0 license. See the [LICENSE](LICENSE) file for details.
251