• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..04-Nov-2017-

azure/H04-Nov-2017-280213

consul/H04-Nov-2017-381275

dns/H04-Nov-2017-300188

ec2/H04-Nov-2017-218179

file/H04-Nov-2017-463336

gce/H04-Nov-2017-210168

kubernetes/H04-Nov-2017-2,8172,328

marathon/H04-Nov-2017-829697

openstack/H04-Nov-2017-1,175976

triton/H04-Nov-2017-354274

zookeeper/H04-Nov-2017-205154

README.mdH A D04-Nov-20179.3 KiB216175

discovery.goH A D04-Nov-20178.9 KiB319227

discovery_test.goH A D04-Nov-20172.2 KiB8758

README.md

1### Service Discovery
2
3This directory contains the service discovery (SD) component of Prometheus.
4
5
6
7## Design of a Prometheus SD
8
9There are many requests to add new SDs to Prometheus, this section looks at
10what makes a good SD and covers some of the common implementation issues.
11
12### Does this make sense as an SD?
13
14The first question to be asked is does it make sense to add this particular
15SD? An SD mechanism should be reasonably well established, and at a minimum in
16use across multiple organisations. It should allow discovering of machines
17and/or services running somewhere. When exactly an SD is popular enough to
18justify being added to Prometheus natively is an open question.
19
20It should not be a brand new SD mechanism, or a variant of an established
21mechanism. We want to integrate Prometheus with the SD that's already there in
22your infrastructure, not invent yet more ways to do service discovery. We also
23do not add mechanisms to work around users lacking service discovery and/or
24configuration management infrastructure.
25
26SDs that merely discover other applications running the same software (e.g.
27talk to one Kafka or Cassandra server to find the others) are not service
28discovery. In that case the SD you should be looking at is whatever decides
29that a machine is going to be a Kafka server, likely a machine database or
30configuration management system.
31
32If something is particularly custom or unusual, `file_sd` is the generic
33mechanism provided for users to hook in. Generally with Prometheus we offer a
34single generic mechanism for things with infinite variations, rather than
35trying to support everything natively (see also, alertmanager webhook, remote
36read, remote write, node exporter textfile collector). For example anything
37that would involve talking to a relational database should use `file_sd`
38instead.
39
40For configuration management systems like Chef, while they do have a
41database/API that'd in principle make sense to talk to for service discovery,
42the idiomatic approach is to use Chef's templating facilities to write out a
43file for use with `file_sd`.
44
45
46### Mapping from SD to Prometheus
47
48The general principle with SD is to extract all the potentially useful
49information we can out of the SD, and let the user choose what they need of it
50using
51[relabelling](https://prometheus.io/docs/operating/configuration/#<relabel_config>).
52This information is generally termed metadata.
53
54Metadata is exposed as a set of key/value pairs (labels) per target. The keys
55are prefixed with `__meta_<sdname>_<key>`, and there should also be an `__address__`
56label with the host:port of the target (preferably an IP address to avoid DNS
57lookups). No other labelnames should be exposed.
58
59It is very common for initial pull requests for new SDs to include hardcoded
60assumptions that make sense for the the author's setup. SD should be generic,
61any customisation should be handled via relabelling. There should be basically
62no business logic, filtering, or transformations of the data from the SD beyond
63that which is needed to fit it into the metadata data model.
64
65Arrays (e.g. a list of tags) should be converted to a single label with the
66array values joined with a comma. Also prefix and suffix the value with a
67comma. So for example the array `[a, b, c]` would become `,a,b,c,`. As
68relabelling regexes are fully anchored, this makes it easier to write correct
69regexes against (`.*,a,.*` works no matter where `a` appears in the list). The
70canonical example of this is `__meta_consul_tags`.
71
72Maps, hashes and other forms of key/value pairs should be all prefixed and
73exposed as labels. For example for EC2 tags, there would be
74`__meta_ec2_tag_Description=mydescription` for the Description tag. Labelnames
75may only contain `[_a-zA-Z0-9]`, sanitize by replacing with underscores as needed.
76
77For targets with multiple potential ports, you can a) expose them as a list, b)
78if they're named expose them as a map or c) expose them each as their own
79target. Kubernetes SD takes the target per port approach. a) and b) can be
80combined.
81
82For machine-like SDs (OpenStack, EC2, Kubernetes to some extent) there may
83be multiple network interfaces for a target. Thus far reporting the details
84of only the first/primary network interface has sufficed.
85
86
87### Other implementation considerations
88
89SDs are intended to dump all possible targets. For example the optional use of
90EC2 service discovery would be to take the entire region's worth of EC2
91instances it provides and do everything needed in one `scrape_config`. For
92large deployments where you are only interested in a small proportion of the
93returned targets, this may cause performance issues. If this occurs it is
94acceptable to also offer filtering via whatever mechanisms the SD exposes. For
95EC2 that would be the `Filter` option on `DescribeInstances`. Keep in mind that
96this is a performance optimisation, it should be possible to do the same
97filtering using relabelling alone. As with SD generally, we do not invent new
98ways to filter targets (that is what relabelling is for), merely offer up
99whatever functionality the SD itself offers.
100
101It is a general rule with Prometheus that all configuration comes from the
102configuration file. While the libraries you use to talk to the SD may also
103offer other mechanisms for providing configuration/authentication under the
104covers (EC2's use of environment variables being a prime example), using your SD
105mechanism should not require this. Put another way, your SD implementation
106should not read environment variables or files to obtain configuration.
107
108Some SD mechanisms have rate limits that make them challenging to use. As an
109example we have unfortunately had to reject Amazon ECS service discovery due to
110the rate limits being so low that it would not be usable for anything beyond
111small setups.
112
113If a system offers multiple distinct types of SD, select which is in use with a
114configuration option rather than returning them all from one mega SD that
115requires relabelling to select just the one you want. So far we have only seen
116this with Kubernetes. When a single SD with a selector vs.  multiple distinct
117SDs makes sense is an open question.
118
119If there is a failure while processing talking to the SD, abort rather than
120returning partial data. It is better to work from stale targets than partial
121or incorrect metadata.
122
123The information obtained from service discovery is not considered sensitive
124security wise. Do not return secrets in metadata, anyone with access to
125the Prometheus server will be able to see them.
126
127
128## Writing an SD mechanism
129
130### The SD interface
131
132A Service Discovery (SD) mechanism has to discover targets and provide them to Prometheus. We expect similar targets to be grouped together, in the form of a [`TargetGroup`](https://godoc.org/github.com/prometheus/prometheus/config#TargetGroup). The SD mechanism sends the targets down to prometheus as list of `TargetGroups`.
133
134An SD mechanism has to implement the `TargetProvider` Interface:
135```go
136type TargetProvider interface {
137	Run(ctx context.Context, up chan<- []*config.TargetGroup)
138}
139```
140
141Prometheus will call the `Run()` method on a provider to initialise the discovery mechanism. The mechanism will then send *all* the `TargetGroup`s into the channel. Now the mechanism will watch for changes and then send only changed and new `TargetGroup`s down the channel.
142
143For example if we had a discovery mechanism and it retrieves the following groups:
144
145```
146[]config.TargetGroup{
147  {
148    Targets: []model.LabelSet{
149       {
150          "__instance__": "10.11.150.1:7870",
151          "hostname": "demo-target-1",
152          "test": "simple-test",
153       },
154       {
155          "__instance__": "10.11.150.4:7870",
156          "hostname": "demo-target-2",
157          "test": "simple-test",
158       },
159    },
160    Labels: map[LabelName][LabelValue] {
161      "job": "mysql",
162    },
163    "Source": "file1",
164  },
165  {
166    Targets: []model.LabelSet{
167       {
168          "__instance__": "10.11.122.11:6001",
169          "hostname": "demo-postgres-1",
170          "test": "simple-test",
171       },
172       {
173          "__instance__": "10.11.122.15:6001",
174          "hostname": "demo-postgres-2",
175          "test": "simple-test",
176       },
177    },
178    Labels: map[LabelName][LabelValue] {
179      "job": "postgres",
180    },
181    "Source": "file2",
182  },
183}
184```
185
186Here there are two `TargetGroups` one group with source `file1` and another with `file2`. The grouping is implementation specific and could even be one target per group. But, one has to make sure every target group sent by an SD instance should have a `Source` which is unique across all the `TargetGroup`s of that SD instance.
187
188In this case, both the `TargetGroup`s are sent down the channel the first time `Run()` is called. Now, for an update, we need to send the whole _changed_ `TargetGroup` down the channel. i.e, if the target with `hostname: demo-postgres-2` goes away, we send:
189```
190&config.TargetGroup{
191  Targets: []model.LabelSet{
192     {
193        "__instance__": "10.11.122.11:6001",
194        "hostname": "demo-postgres-1",
195        "test": "simple-test",
196     },
197  },
198  Labels: map[LabelName][LabelValue] {
199    "job": "postgres",
200  },
201  "Source": "file2",
202}
203```
204down the channel.
205
206If all the targets in a group go away, we need to send the target groups with empty `Targets` down the channel. i.e, if all targets with `job: postgres` go away, we send:
207```
208&config.TargetGroup{
209  Targets: nil,
210  "Source": "file2",
211}
212```
213down the channel.
214
215<!-- TODO: Add best-practices -->
216