1# Ceph plugin
2Provides native Zabbix solution for monitoring Ceph clusters (distributed storage system). It can monitor several
3Ceph instances simultaneously, remote or local to the Zabbix Agent.
4Best for use in conjunction with the official
5[Ceph template.](https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/ceph_agent2)
6You can extend it or create your template for your specific needs.
7
8## Requirements
9* Zabbix Agent 2
10* Go >= 1.13 (required only to build from source)
11
12## Supported versions
13* Ceph, version 14+
14
15## Installation
16* Configure the Ceph RESTful Module according to [documentation.](https://docs.ceph.com/en/latest/mgr/restful/)
17* Make sure a RESTful API endpoint is available for connection.
18
19## Configuration
20The Zabbix agent 2 configuration file is used to configure plugins.
21
22**Plugins.Ceph.InsecureSkipVerify** — InsecureSkipVerify controls whether an http client verifies the
23server's certificate chain and host name. If InsecureSkipVerify is true, TLS accepts any certificate presented by
24the server and any host name in that certificate. In this mode, TLS is susceptible to man-in-the-middle attacks.
25**This should be used only for testing.**
26*Default value:* false
27*Limits:* false | true
28
29**Plugins.Ceph.Timeout** — The maximum time in seconds for waiting when a request has to be done. The timeout includes
30connection time, any redirects, and reading the response body.
31*Default value:* equals the global Timeout configuration parameter.
32*Limits:* 1-30
33
34**Plugins.Ceph.KeepAlive** — Sets a time for waiting before unused connections will be closed.
35*Default value:* 300 sec.
36*Limits:* 60-900
37
38### Configuring connection
39A connection can be configured using either keys' parameters or named sessions.
40
41*Notes*:
42* It is not possible to mix configuration using named sessions and keys' parameters simultaneously.
43* You can leave any connection parameter empty, a default hard-coded value will be used in the such case.
44* Embedded URI credentials (userinfo) are forbidden and will be ignored. So, you can't pass the credentials by this:
45
46      ceph.ping[https://user:apikey@127.0.0.1] — WRONG
47
48  The correct way is:
49
50      ceph.ping[https://127.0.0.1,user,apikey]
51
52* The only supported network schema for a URI is "https".
53Examples of valid URIs:
54    - https://127.0.0.1:8003
55    - https://localhost
56    - localhost
57
58#### Using keys' parameters
59The common parameters for all keys are: [ConnString][,User][,ApiKey]
60Where ConnString can be either a URI or a session name.
61ConnString will be treated as a URI if no session with the given name is found.
62If you use ConnString as a session name, just skip the rest of the connection parameters.
63
64#### Using named sessions
65Named sessions allow you to define specific parameters for each Ceph instance. Currently, there are only three supported
66parameters: Uri, User and ApiKey. It's a bit more secure way to store credentials compared to item keys or macros.
67
68E.g: suppose you have two Ceph clusters: "Prod" and "Test".
69You should add the following options to the agent configuration file:
70
71    Plugins.Ceph.Sessions.Prod.Uri=https://192.168.1.1:8003
72    Plugins.Ceph.Sessions.Prod.User=<UserForProd>
73    Plugins.Ceph.Sessions.Prod.ApiKey=<ApiKeyForProd>
74
75    Plugins.Ceph.Sessions.Test.Uri=https://192.168.0.1:8003
76    Plugins.Ceph.Sessions.Test.User=<UserForTest>
77    Plugins.Ceph.Sessions.Test.ApiKey=<ApiKeyForTest>
78
79Then you will be able to use these names as the 1st parameter (ConnString) in keys instead of URIs, e.g:
80
81    ceph.ping[Prod]
82    ceph.ping[Test]
83
84*Note*: sessions names are case-sensitive.
85
86## Supported keys
87**ceph.df.details[\<commonParams\>]** — Returns information about cluster’s data usage and distribution among pools.
88Uses data provided by "df detail" command.
89*Output sample:*
90```json
91{
92    "pools": {
93        "device_health_metrics": {
94            "percent_used": 0,
95            "objects": 0,
96            "bytes_used": 0,
97            "rd_ops": 0,
98            "rd_bytes": 0,
99            "wr_ops": 0,
100            "wr_bytes": 0,
101            "stored_raw": 0,
102            "max_avail": 1390035968
103        },
104        "new_pool": {
105            "percent_used": 0,
106            "objects": 0,
107            "bytes_used": 0,
108            "rd_ops": 0,
109            "rd_bytes": 0,
110            "wr_ops": 0,
111            "wr_bytes": 0,
112            "stored_raw": 0,
113            "max_avail": 695039808
114        },
115        "test_zabbix": {
116            "percent_used": 0,
117            "objects": 4,
118            "bytes_used": 786432,
119            "rd_ops": 0,
120            "rd_bytes": 0,
121            "wr_ops": 4,
122            "wr_bytes": 24576,
123            "stored_raw": 66618,
124            "max_avail": 1390035968
125        },
126        "zabbix": {
127            "percent_used": 0,
128            "objects": 0,
129            "bytes_used": 0,
130            "rd_ops": 0,
131            "rd_bytes": 0,
132            "wr_ops": 0,
133            "wr_bytes": 0,
134            "stored_raw": 0,
135            "max_avail": 1390035968
136        }
137    },
138    "rd_ops": 0,
139    "rd_bytes": 0,
140    "wr_ops": 4,
141    "wr_bytes": 24576,
142    "num_pools": 4,
143    "total_bytes": 12872318976,
144    "total_avail_bytes": 6898843648,
145    "total_used_bytes": 2752249856,
146    "total_objects": 4
147}
148```
149
150**ceph.osd.stats[\<commonParams\>]** — Returns aggregated and per OSD statistics.
151Uses data provided by "pg dump" command.
152*Output sample:*
153```json
154{
155    "osd_latency_apply": {
156        "min": 0,
157        "max": 0,
158        "avg": 0
159    },
160    "osd_latency_commit": {
161        "min": 0,
162        "max": 0,
163        "avg": 0
164    },
165    "osd_fill": {
166        "min": 47,
167        "max": 47,
168        "avg": 47
169    },
170    "osd_pgs": {
171        "min": 65,
172        "max": 65,
173        "avg": 65
174    },
175    "osds": {
176        "0": {
177            "osd_latency_apply": 0,
178            "osd_latency_commit": 0,
179            "num_pgs": 65,
180            "osd_fill": 47
181        },
182        "1": {
183            "osd_latency_apply": 0,
184            "osd_latency_commit": 0,
185            "num_pgs": 65,
186            "osd_fill": 47
187        },
188        "2": {
189            "osd_latency_apply": 0,
190            "osd_latency_commit": 0,
191            "num_pgs": 65,
192            "osd_fill": 47
193        }
194    }
195}
196```
197
198**ceph.osd.discovery[\<commonParams\>]** — Returns a list of discovered OSDs in LLD format.
199Can be used in conjunction with "ceph.osd.stats" and "ceph.osd.dump" in order to create "per osd" items.
200Uses data provided by "osd crush tree" command.
201*Output sample:*
202```json
203[
204  {
205    "{#OSDNAME}": "0",
206    "{#CLASS}": "hdd",
207    "{#HOST}": "node1"
208  },
209  {
210    "{#OSDNAME}": "1",
211    "{#CLASS}": "hdd",
212    "{#HOST}": "node2"
213  },
214  {
215    "{#OSDNAME}": "2",
216    "{#CLASS}": "hdd",
217    "{#HOST}": "node3"
218  }
219]
220```
221
222**ceph.osd.dump[\<commonParams\>]** — Returns usage thresholds and statuses of OSDs.
223Uses data provided by "osd dump" command.
224*Output sample:*
225```json
226{
227    "osd_backfillfull_ratio": 0.9,
228    "osd_full_ratio": 0.95,
229    "osd_nearfull_ratio": 0.85,
230    "num_pg_temp": 65,
231    "osds": {
232        "0": {
233            "in": 1,
234            "up": 1
235        },
236        "1": {
237            "in": 1,
238            "up": 1
239        },
240        "2": {
241            "in": 1,
242            "up": 1
243        }
244    }
245}
246```
247
248**ceph.ping[\<commonParams\>]** — Tests if a connection is alive or not.
249Uses data provided by "health" command.
250*Returns:*
251- "1" if a connection is alive.
252- "0" if a connection is broken (if there is any error presented including AUTH and configuration issues).
253
254**ceph.pool.discovery[\<commonParams\>]** — Returns a list of discovered pools in LLD format.
255Can be used in conjunction with "ceph.df.details" in order to create "per pool" items.
256Uses data provided by "osd dump" and "osd crush rule dump" commands.
257*Output sample:*
258```json
259[
260    {
261        "{#POOLNAME}": "device_health_metrics",
262        "{#CRUSHRULE}": "default"
263    },
264    {
265        "{#POOLNAME}": "test_zabbix",
266        "{#CRUSHRULE}": "default"
267    },
268    {
269        "{#POOLNAME}": "zabbix",
270        "{#CRUSHRULE}": "default"
271    },
272    {
273        "{#POOLNAME}": "new_pool",
274        "{#CRUSHRULE}": "newbucket"
275    }
276]
277```
278
279**ceph.status[\<commonParams\>]** — Returns an overall cluster's status.
280Uses data provided by "status" command.
281*Output sample:*
282```json
283{
284    "overall_status": 2,
285    "num_mon": 3,
286    "num_osd": 3,
287    "num_osd_in": 2,
288    "num_osd_up": 1,
289    "num_pg": 66,
290    "pg_states": {
291        "activating": 0,
292        "active": 0,
293        "backfill_toofull": 0,
294        "backfill_unfound": 0,
295        "backfill_wait": 0,
296        "backfilling": 0,
297        "clean": 0,
298        "creating": 0,
299        "deep": 0,
300        "degraded": 36,
301        "down": 0,
302        "forced_backfill": 0,
303        "forced_recovery": 0,
304        "incomplete": 0,
305        "inconsistent": 0,
306        "laggy": 0,
307        "peered": 65,
308        "peering": 0,
309        "recovering": 0,
310        "recovery_toofull": 0,
311        "recovery_unfound": 1,
312        "recovery_wait": 0,
313        "remapped": 0,
314        "repair": 0,
315        "scrubbing": 0,
316        "snaptrim": 0,
317        "snaptrim_error": 0,
318        "snaptrim_wait": 0,
319        "stale": 0,
320        "undersized": 65,
321        "unknown": 1,
322        "wait": 0
323    },
324    "min_mon_release_name": "octopus"
325}
326```
327
328## Troubleshooting
329The plugin uses Zabbix agent's logs. You can increase debugging level of Zabbix Agent if you need more details about
330what is happening.
331
332If you get the error "x509: cannot validate certificate for x.x.x.x because it doesn't contain any IP SANs",
333probably you need to set the InsecureSkipVerify option to "true" or use a certificate that is signed by the
334organization’s certificate authority.
335