1# Ceph plugin 2Provides native Zabbix solution for monitoring Ceph clusters (distributed storage system). It can monitor several 3Ceph instances simultaneously, remote or local to the Zabbix Agent. 4Best for use in conjunction with the official 5[Ceph template.](https://git.zabbix.com/projects/ZBX/repos/zabbix/browse/templates/app/ceph_agent2) 6You can extend it or create your template for your specific needs. 7 8## Requirements 9* Zabbix Agent 2 10* Go >= 1.13 (required only to build from source) 11 12## Supported versions 13* Ceph, version 14+ 14 15## Installation 16* Configure the Ceph RESTful Module according to [documentation.](https://docs.ceph.com/en/latest/mgr/restful/) 17* Make sure a RESTful API endpoint is available for connection. 18 19## Configuration 20The Zabbix agent 2 configuration file is used to configure plugins. 21 22**Plugins.Ceph.InsecureSkipVerify** — InsecureSkipVerify controls whether an http client verifies the 23server's certificate chain and host name. If InsecureSkipVerify is true, TLS accepts any certificate presented by 24the server and any host name in that certificate. In this mode, TLS is susceptible to man-in-the-middle attacks. 25**This should be used only for testing.** 26*Default value:* false 27*Limits:* false | true 28 29**Plugins.Ceph.Timeout** — The maximum time in seconds for waiting when a request has to be done. The timeout includes 30connection time, any redirects, and reading the response body. 31*Default value:* equals the global Timeout configuration parameter. 32*Limits:* 1-30 33 34**Plugins.Ceph.KeepAlive** — Sets a time for waiting before unused connections will be closed. 35*Default value:* 300 sec. 36*Limits:* 60-900 37 38### Configuring connection 39A connection can be configured using either keys' parameters or named sessions. 40 41*Notes*: 42* It is not possible to mix configuration using named sessions and keys' parameters simultaneously. 43* You can leave any connection parameter empty, a default hard-coded value will be used in the such case. 44* Embedded URI credentials (userinfo) are forbidden and will be ignored. So, you can't pass the credentials by this: 45 46 ceph.ping[https://user:apikey@127.0.0.1] — WRONG 47 48 The correct way is: 49 50 ceph.ping[https://127.0.0.1,user,apikey] 51 52* The only supported network schema for a URI is "https". 53Examples of valid URIs: 54 - https://127.0.0.1:8003 55 - https://localhost 56 - localhost 57 58#### Using keys' parameters 59The common parameters for all keys are: [ConnString][,User][,ApiKey] 60Where ConnString can be either a URI or a session name. 61ConnString will be treated as a URI if no session with the given name is found. 62If you use ConnString as a session name, just skip the rest of the connection parameters. 63 64#### Using named sessions 65Named sessions allow you to define specific parameters for each Ceph instance. Currently, there are only three supported 66parameters: Uri, User and ApiKey. It's a bit more secure way to store credentials compared to item keys or macros. 67 68E.g: suppose you have two Ceph clusters: "Prod" and "Test". 69You should add the following options to the agent configuration file: 70 71 Plugins.Ceph.Sessions.Prod.Uri=https://192.168.1.1:8003 72 Plugins.Ceph.Sessions.Prod.User=<UserForProd> 73 Plugins.Ceph.Sessions.Prod.ApiKey=<ApiKeyForProd> 74 75 Plugins.Ceph.Sessions.Test.Uri=https://192.168.0.1:8003 76 Plugins.Ceph.Sessions.Test.User=<UserForTest> 77 Plugins.Ceph.Sessions.Test.ApiKey=<ApiKeyForTest> 78 79Then you will be able to use these names as the 1st parameter (ConnString) in keys instead of URIs, e.g: 80 81 ceph.ping[Prod] 82 ceph.ping[Test] 83 84*Note*: sessions names are case-sensitive. 85 86## Supported keys 87**ceph.df.details[\<commonParams\>]** — Returns information about cluster’s data usage and distribution among pools. 88Uses data provided by "df detail" command. 89*Output sample:* 90```json 91{ 92 "pools": { 93 "device_health_metrics": { 94 "percent_used": 0, 95 "objects": 0, 96 "bytes_used": 0, 97 "rd_ops": 0, 98 "rd_bytes": 0, 99 "wr_ops": 0, 100 "wr_bytes": 0, 101 "stored_raw": 0, 102 "max_avail": 1390035968 103 }, 104 "new_pool": { 105 "percent_used": 0, 106 "objects": 0, 107 "bytes_used": 0, 108 "rd_ops": 0, 109 "rd_bytes": 0, 110 "wr_ops": 0, 111 "wr_bytes": 0, 112 "stored_raw": 0, 113 "max_avail": 695039808 114 }, 115 "test_zabbix": { 116 "percent_used": 0, 117 "objects": 4, 118 "bytes_used": 786432, 119 "rd_ops": 0, 120 "rd_bytes": 0, 121 "wr_ops": 4, 122 "wr_bytes": 24576, 123 "stored_raw": 66618, 124 "max_avail": 1390035968 125 }, 126 "zabbix": { 127 "percent_used": 0, 128 "objects": 0, 129 "bytes_used": 0, 130 "rd_ops": 0, 131 "rd_bytes": 0, 132 "wr_ops": 0, 133 "wr_bytes": 0, 134 "stored_raw": 0, 135 "max_avail": 1390035968 136 } 137 }, 138 "rd_ops": 0, 139 "rd_bytes": 0, 140 "wr_ops": 4, 141 "wr_bytes": 24576, 142 "num_pools": 4, 143 "total_bytes": 12872318976, 144 "total_avail_bytes": 6898843648, 145 "total_used_bytes": 2752249856, 146 "total_objects": 4 147} 148``` 149 150**ceph.osd.stats[\<commonParams\>]** — Returns aggregated and per OSD statistics. 151Uses data provided by "pg dump" command. 152*Output sample:* 153```json 154{ 155 "osd_latency_apply": { 156 "min": 0, 157 "max": 0, 158 "avg": 0 159 }, 160 "osd_latency_commit": { 161 "min": 0, 162 "max": 0, 163 "avg": 0 164 }, 165 "osd_fill": { 166 "min": 47, 167 "max": 47, 168 "avg": 47 169 }, 170 "osd_pgs": { 171 "min": 65, 172 "max": 65, 173 "avg": 65 174 }, 175 "osds": { 176 "0": { 177 "osd_latency_apply": 0, 178 "osd_latency_commit": 0, 179 "num_pgs": 65, 180 "osd_fill": 47 181 }, 182 "1": { 183 "osd_latency_apply": 0, 184 "osd_latency_commit": 0, 185 "num_pgs": 65, 186 "osd_fill": 47 187 }, 188 "2": { 189 "osd_latency_apply": 0, 190 "osd_latency_commit": 0, 191 "num_pgs": 65, 192 "osd_fill": 47 193 } 194 } 195} 196``` 197 198**ceph.osd.discovery[\<commonParams\>]** — Returns a list of discovered OSDs in LLD format. 199Can be used in conjunction with "ceph.osd.stats" and "ceph.osd.dump" in order to create "per osd" items. 200Uses data provided by "osd crush tree" command. 201*Output sample:* 202```json 203[ 204 { 205 "{#OSDNAME}": "0", 206 "{#CLASS}": "hdd", 207 "{#HOST}": "node1" 208 }, 209 { 210 "{#OSDNAME}": "1", 211 "{#CLASS}": "hdd", 212 "{#HOST}": "node2" 213 }, 214 { 215 "{#OSDNAME}": "2", 216 "{#CLASS}": "hdd", 217 "{#HOST}": "node3" 218 } 219] 220``` 221 222**ceph.osd.dump[\<commonParams\>]** — Returns usage thresholds and statuses of OSDs. 223Uses data provided by "osd dump" command. 224*Output sample:* 225```json 226{ 227 "osd_backfillfull_ratio": 0.9, 228 "osd_full_ratio": 0.95, 229 "osd_nearfull_ratio": 0.85, 230 "num_pg_temp": 65, 231 "osds": { 232 "0": { 233 "in": 1, 234 "up": 1 235 }, 236 "1": { 237 "in": 1, 238 "up": 1 239 }, 240 "2": { 241 "in": 1, 242 "up": 1 243 } 244 } 245} 246``` 247 248**ceph.ping[\<commonParams\>]** — Tests if a connection is alive or not. 249Uses data provided by "health" command. 250*Returns:* 251- "1" if a connection is alive. 252- "0" if a connection is broken (if there is any error presented including AUTH and configuration issues). 253 254**ceph.pool.discovery[\<commonParams\>]** — Returns a list of discovered pools in LLD format. 255Can be used in conjunction with "ceph.df.details" in order to create "per pool" items. 256Uses data provided by "osd dump" and "osd crush rule dump" commands. 257*Output sample:* 258```json 259[ 260 { 261 "{#POOLNAME}": "device_health_metrics", 262 "{#CRUSHRULE}": "default" 263 }, 264 { 265 "{#POOLNAME}": "test_zabbix", 266 "{#CRUSHRULE}": "default" 267 }, 268 { 269 "{#POOLNAME}": "zabbix", 270 "{#CRUSHRULE}": "default" 271 }, 272 { 273 "{#POOLNAME}": "new_pool", 274 "{#CRUSHRULE}": "newbucket" 275 } 276] 277``` 278 279**ceph.status[\<commonParams\>]** — Returns an overall cluster's status. 280Uses data provided by "status" command. 281*Output sample:* 282```json 283{ 284 "overall_status": 2, 285 "num_mon": 3, 286 "num_osd": 3, 287 "num_osd_in": 2, 288 "num_osd_up": 1, 289 "num_pg": 66, 290 "pg_states": { 291 "activating": 0, 292 "active": 0, 293 "backfill_toofull": 0, 294 "backfill_unfound": 0, 295 "backfill_wait": 0, 296 "backfilling": 0, 297 "clean": 0, 298 "creating": 0, 299 "deep": 0, 300 "degraded": 36, 301 "down": 0, 302 "forced_backfill": 0, 303 "forced_recovery": 0, 304 "incomplete": 0, 305 "inconsistent": 0, 306 "laggy": 0, 307 "peered": 65, 308 "peering": 0, 309 "recovering": 0, 310 "recovery_toofull": 0, 311 "recovery_unfound": 1, 312 "recovery_wait": 0, 313 "remapped": 0, 314 "repair": 0, 315 "scrubbing": 0, 316 "snaptrim": 0, 317 "snaptrim_error": 0, 318 "snaptrim_wait": 0, 319 "stale": 0, 320 "undersized": 65, 321 "unknown": 1, 322 "wait": 0 323 }, 324 "min_mon_release_name": "octopus" 325} 326``` 327 328## Troubleshooting 329The plugin uses Zabbix agent's logs. You can increase debugging level of Zabbix Agent if you need more details about 330what is happening. 331 332If you get the error "x509: cannot validate certificate for x.x.x.x because it doesn't contain any IP SANs", 333probably you need to set the InsecureSkipVerify option to "true" or use a certificate that is signed by the 334organization’s certificate authority. 335