1# Icinga 2 Troubleshooting <a id="troubleshooting"></a> 2 3## Required Information <a id="troubleshooting-information-required"></a> 4 5Please ensure to provide any detail which may help reproduce and understand your issue. 6Whether you ask on the [community channels](https://community.icinga.com) or you 7create an issue at [GitHub](https://github.com/Icinga), make sure 8that others can follow your explanations. If necessary, draw a picture and attach it for 9better illustration. This is especially helpful if you are troubleshooting a distributed 10setup. 11 12We've come around many community questions and compiled this list. Add your own 13findings and details please. 14 15* Describe the expected behavior in your own words. 16* Describe the actual behavior in one or two sentences. 17* Ensure to provide general information such as: 18 * How was Icinga 2 installed (and which repository in case) and which distribution are you using 19 * `icinga2 --version` 20 * `icinga2 feature list` 21 * `icinga2 daemon -C` 22 * [Icinga Web 2](https://icinga.com/products/icinga-web-2/) version (screenshot from System - About) 23 * [Icinga Web 2 modules](https://icinga.com/products/icinga-web-2-modules/) e.g. the Icinga Director (optional) 24* Configuration insights: 25 * Provide complete configuration snippets explaining your problem in detail 26 * Your [icinga2.conf](04-configuration.md#icinga2-conf) file 27 * If you run multiple Icinga 2 instances, the [zones.conf](04-configuration.md#zones-conf) file (or `icinga2 object list --type Endpoint` and `icinga2 object list --type Zone`) from all affected nodes. 28* Logs 29 * Relevant output from your main and [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) in `/var/log/icinga2`. Please add step-by-step explanations with timestamps if required. 30 * The newest Icinga 2 crash log if relevant, located in `/var/log/icinga2/crash` 31* Additional details 32 * If the check command failed, what's the output of your manual plugin tests? 33 * In case of [debugging](21-development.md#development) Icinga 2, the full back traces and outputs 34 35## Analyze your Environment <a id="troubleshooting-analyze-environment"></a> 36 37There are many components involved on a server running Icinga 2. When you 38analyze a problem, keep in mind that basic system administration knowledge 39is also key to identify bottlenecks and issues. 40 41> **Tip** 42> 43> [Monitor Icinga 2](08-advanced-topics.md#monitoring-icinga) and use the hints for further analysis. 44 45* Analyze the system's performance and dentify bottlenecks and issues. 46* Collect details about all applications (e.g. Icinga 2, MySQL, Apache, Graphite, Elastic, etc.). 47* If data is exchanged via network (e.g. central MySQL cluster) ensure to monitor the bandwidth capabilities too. 48* Add graphs from Grafana or Graphite as screenshots to your issue description 49 50Install tools which help you to do so. Opinions differ, let us know if you have any additions here! 51 52### Analyse your Linux/Unix Environment <a id="troubleshooting-analyze-environment-linux"></a> 53 54[htop](https://hisham.hm/htop/) is a better replacement for `top` and helps to analyze processes 55interactively. 56 57```bash 58yum install htop 59apt-get install htop 60``` 61 62If you are for example experiencing performance issues, open `htop` and take a screenshot. 63Add it to your question and/or bug report. 64 65Analyse disk I/O performance in Grafana, take a screenshot and obfuscate any sensitive details. 66Attach it when posting a question to the community channels. 67 68The [sysstat](https://github.com/sysstat/sysstat) package provides a number of tools to 69analyze the performance on Linux. On FreeBSD you could use `systat` for example. 70 71```bash 72yum install sysstat 73apt-get install sysstat 74``` 75 76Example for `vmstat` (summary of memory, processes, etc.): 77 78```bash 79# summary 80vmstat -s 81# print timestamps, format in MB, stats every 1 second, 5 times 82vmstat -t -S M 1 5 83``` 84 85Example for `iostat`: 86 87```bash 88watch -n 1 iostat 89``` 90 91Example for `sar`: 92 93```bash 94sar # cpu 95sar -r # ram 96sar -q # load avg 97sar -b # I/O 98``` 99 100`sysstat` also provides the `iostat` binary. On FreeBSD you could use `systat` for example. 101 102If you are missing checks and metrics found in your analysis, add them to your monitoring! 103 104### Analyze your Windows Environment <a id="troubleshooting-analyze-environment-windows"></a> 105 106A good tip for Windows are the tools found inside the [Sysinternals Suite](https://technet.microsoft.com/en-us/sysinternals/bb842062.aspx). 107 108You can also start `perfmon` and analyze specific performance counters. 109Keep notes which could be important for your monitoring, and add service 110checks later on. 111 112> **Tip** 113> 114> Use an administrative Powershell to gain more insights. 115 116``` 117cd C:\ProgramData\icinga2\var\log\icinga2 118 119Get-Content .\icinga2.log -tail 10 -wait 120``` 121 122## Enable Debug Output <a id="troubleshooting-enable-debug-output"></a> 123 124### Enable Debug Output on Linux/Unix <a id="troubleshooting-enable-debug-output-linux"></a> 125 126Enable the `debuglog` feature: 127 128```bash 129icinga2 feature enable debuglog 130service icinga2 restart 131``` 132 133The debug log file can be found in `/var/log/icinga2/debug.log`. 134 135You can tail the log files with an administrative shell: 136 137```bash 138cd /var/log/icinga2 139tail -f debug.log 140``` 141 142Alternatively you may run Icinga 2 in the foreground with debugging enabled. Specify the console 143log severity as an additional parameter argument to `-x`. 144 145```bash 146/usr/sbin/icinga2 daemon -x notice 147``` 148 149The [log severity](09-object-types.md#objecttype-filelogger) can be one of `critical`, `warning`, `information`, `notice` 150and `debug`. 151 152### Enable Debug Output on Windows <a id="troubleshooting-enable-debug-output-windows"></a> 153 154Open a Powershell with administrative privileges and enable the debug log feature. 155 156``` 157C:\> cd C:\Program Files\ICINGA2\sbin 158 159C:\Program Files\ICINGA2\sbin> .\icinga2.exe feature enable debuglog 160``` 161 162Ensure that the Icinga 2 service already writes the main log into `C:\ProgramData\icinga2\var\log\icinga2`. 163Restart the Icinga 2 service in an administrative Powershell and open the newly created `debug.log` file. 164 165``` 166C:\> Restart-Service icinga2 167 168C:\> Get-Service icinga2 169``` 170 171You can tail the log files with an administrative Powershell: 172 173``` 174C:\> cd C:\ProgramData\icinga2\var\log\icinga2 175 176C:\ProgramData\icinga2\var\log\icinga2> Get-Content .\debug.log -tail 10 -wait 177``` 178 179## Configuration Troubleshooting <a id="troubleshooting-configuration"></a> 180 181### List Configuration Objects <a id="troubleshooting-list-configuration-objects"></a> 182 183The `icinga2 object list` CLI command can be used to list all configuration objects and their 184attributes. The tool also shows where each of the attributes was modified. 185 186> **Tip** 187> 188> Use the Icinga 2 API to access [config objects at runtime](12-icinga2-api.md#icinga2-api-config-objects) directly. 189 190That way you can also identify which objects have been created from your [apply rules](17-language-reference.md#apply). 191 192``` 193# icinga2 object list 194 195Object 'localhost!ssh' of type 'Service': 196 * __name = 'localhost!ssh' 197 * check_command = 'ssh' 198 % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 5:3-5:23 199 * check_interval = 60 200 % = modified in '/etc/icinga2/conf.d/templates.conf', lines 24:3-24:21 201 * host_name = 'localhost' 202 % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 4:3-4:25 203 * max_check_attempts = 3 204 % = modified in '/etc/icinga2/conf.d/templates.conf', lines 23:3-23:24 205 * name = 'ssh' 206 * retry_interval = 30 207 % = modified in '/etc/icinga2/conf.d/templates.conf', lines 25:3-25:22 208 * templates = [ 'ssh', 'generic-service' ] 209 % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 1:0-7:1 210 % += modified in '/etc/icinga2/conf.d/templates.conf', lines 22:1-26:1 211 * type = 'Service' 212 * vars 213 % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19 214 * sla = '24x7' 215 % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19 216 217[...] 218``` 219 220On Windows, use an administrative Powershell: 221 222``` 223C:\> cd C:\Program Files\ICINGA2\sbin 224 225C:\Program Files\ICINGA2\sbin> .\icinga2.exe object list 226``` 227 228You can also filter by name and type: 229 230``` 231# icinga2 object list --name *ssh* --type Service 232Object 'localhost!ssh' of type 'Service': 233 * __name = 'localhost!ssh' 234 * check_command = 'ssh' 235 % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 5:3-5:23 236 * check_interval = 60 237 % = modified in '/etc/icinga2/conf.d/templates.conf', lines 24:3-24:21 238 * host_name = 'localhost' 239 % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 4:3-4:25 240 * max_check_attempts = 3 241 % = modified in '/etc/icinga2/conf.d/templates.conf', lines 23:3-23:24 242 * name = 'ssh' 243 * retry_interval = 30 244 % = modified in '/etc/icinga2/conf.d/templates.conf', lines 25:3-25:22 245 * templates = [ 'ssh', 'generic-service' ] 246 % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 1:0-7:1 247 % += modified in '/etc/icinga2/conf.d/templates.conf', lines 22:1-26:1 248 * type = 'Service' 249 * vars 250 % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19 251 * sla = '24x7' 252 % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19 253 254Found 1 Service objects. 255 256[2014-10-15 14:27:19 +0200] information/cli: Parsed 175 objects. 257``` 258 259Runtime modifications via the [REST API](12-icinga2-api.md#icinga2-api-config-objects) 260are not immediately updated. Furthermore there is a known issue with 261[group assign expressions](17-language-reference.md#group-assign) which are not reflected in the host object output. 262You need to restart Icinga 2 in order to update the `icinga2.debug` cache file. 263 264### Apply rules do not match <a id="apply-rules-do-not-match"></a> 265 266You can analyze apply rules and matching objects by using the [script debugger](20-script-debugger.md#script-debugger). 267 268### Where are the check command definitions? <a id="check-command-definitions"></a> 269 270Icinga 2 features a number of built-in [check command definitions](10-icinga-template-library.md#icinga-template-library) which are 271included with 272 273``` 274include <itl> 275include <plugins> 276``` 277 278in the [icinga2.conf](04-configuration.md#icinga2-conf) configuration file. These files are not considered 279configuration files and will be overridden on upgrade, so please send modifications as proposed patches upstream. 280The default include path is set to `/usr/share/icinga2/includes` with the constant `IncludeConfDir`. 281 282You should add your own command definitions to a new file in `conf.d/` called `commands.conf` 283or similar. 284 285### Configuration is ignored <a id="configuration-ignored"></a> 286 287* Make sure that the line(s) are not [commented out](17-language-reference.md#comments) (starting with `//` or `#`, or 288encapsulated by `/* ... */`). 289* Is the configuration file included in [icinga2.conf](04-configuration.md#icinga2-conf)? 290 291Run the [configuration validation](11-cli-commands.md#config-validation) and add `notice` as log severity. 292Search for the file which should be included i.e. using the `grep` CLI command. 293 294```bash 295icinga2 daemon -C -x notice | grep command 296``` 297 298### Configuration attributes are inherited from <a id="configuration-attribute-inheritance"></a> 299 300Icinga 2 allows you to import templates using the [import](17-language-reference.md#template-imports) keyword. If these templates 301contain additional attributes, your objects will automatically inherit them. You can override 302or modify these attributes in the current object. 303 304The [object list](15-troubleshooting.md#troubleshooting-list-configuration-objects) CLI command allows you to verify the attribute origin. 305 306### Configuration Value with Single Dollar Sign <a id="configuration-value-dollar-sign"></a> 307 308In case your configuration validation fails with a missing closing dollar sign error message, you 309did not properly escape the single dollar sign preventing its usage as [runtime macro](03-monitoring-basics.md#runtime-macros). 310 311``` 312critical/config: Error: Validation failed for Object 'ping4' (Type: 'Service') at /etc/icinga2/zones.d/global-templates/windows.conf:24: Closing $ not found in macro format string 'top-syntax=${list}'. 313``` 314 315Correct the custom variable value to 316 317``` 318"top-syntax=$${list}" 319``` 320 321 322## Checks Troubleshooting <a id="troubleshooting-checks"></a> 323 324### Executed Command for Checks <a id="checks-executed-command"></a> 325 326* Use the Icinga 2 API to [query](12-icinga2-api.md#icinga2-api-config-objects-query) host/service objects 327for their check result containing the executed shell command. 328* Use the Icinga 2 [console cli command](11-cli-commands.md#cli-command-console) 329to fetch the checkable object, its check result and the executed shell command. 330* Alternatively enable the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) and look for the executed command. 331 332Example for a service object query using a [regex match](18-library-reference.md#global-functions-regex) 333on the name: 334 335``` 336$ curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/services' \ 337-d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^http" }, "attrs": [ "__name", "last_check_result" ], "pretty": true }' 338{ 339 "results": [ 340 { 341 "attrs": { 342 "__name": "example.localdomain!http", 343 "last_check_result": { 344 "active": true, 345 "check_source": "example.localdomain", 346 "command": [ 347 "/usr/local/sbin/check_http", 348 "-I", 349 "127.0.0.1", 350 "-u", 351 "/" 352 ], 353 354 ... 355 356 } 357 }, 358 "joins": {}, 359 "meta": {}, 360 "name": "example.localdomain!http", 361 "type": "Service" 362 } 363 ] 364} 365``` 366 367Alternatively when using the Director, navigate into the Service Detail View 368in Icinga Web and pick `Inspect` to query the details. 369 370Example for using the `icinga2 console` CLI command evaluation functionality: 371 372``` 373$ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/' \ 374--eval 'get_service("example.localdomain", "http").last_check_result.command' | python -m json.tool 375[ 376 "/usr/local/sbin/check_http", 377 "-I", 378 "127.0.0.1", 379 "-u", 380 "/" 381] 382``` 383 384Example for searching the debug log: 385 386```bash 387icinga2 feature enable debuglog 388systemctl restart icinga2 389tail -f /var/log/icinga2/debug.log | grep "notice/Process" 390``` 391 392 393### Checks are not executed <a id="checks-not-executed"></a> 394 395* First off, decide whether the checks are executed locally, or remote in a distributed setup. 396 397If the master does not receive check results from the satellite, move your analysis to the satellite 398and verify why the checks are not executed there. 399 400* Check the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) to see if the check command gets executed. 401* Verify that failed dependencies do not prevent command execution. 402* Make sure that the plugin is executable by the Icinga 2 user (run a manual test). 403* Make sure the [checker](11-cli-commands.md#enable-features) feature is enabled. 404* Use the Icinga 2 API [event streams](12-icinga2-api.md#icinga2-api-event-streams) to receive live check result streams. 405 406Test a plugin as icinga user. 407 408```bash 409sudo -u icinga /usr/lib/nagios/plugins/check_ping -4 -H 127.0.0.1 -c 5000,100% -w 3000,80% 410``` 411 412> **Note** 413> 414> **Never test plugins as root, but the icinga daemon user.** The environment and permissions differ. 415> 416> Also, the daemon user **does not** spawn a terminal shell (Bash, etc.) so it won't read anything from .bashrc 417> and variants. The Icinga daemon only relies on sysconfig environment variables being set. 418 419 420Enable the checker feature. 421 422``` 423# icinga2 feature enable checker 424The feature 'checker' is already enabled. 425``` 426 427Fetch all check result events matching the `event.service` name `random`: 428 429```bash 430curl -k -s -u root:icinga -H 'Accept: application/json' -X POST \ 431 'https://localhost:5665/v1/events?queue=debugchecks&types=CheckResult&filter=match%28%22random*%22,event.service%29' 432``` 433 434 435### Analyze Check Source <a id="checks-check-source"></a> 436 437Sometimes checks are not executed on the remote host, but on the master and so on. 438This could lead into unwanted results or NOT-OK states. 439 440The `check_source` attribute is the best indication where a check command 441was actually executed. This could be a satellite with synced configuration 442or a client as remote command bridge -- both will return the check source 443as where the plugin is called. 444 445Example for retrieving the check source from all `disk` services using a 446[regex match](18-library-reference.md#global-functions-regex) on the name: 447 448``` 449$ curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/services' \ 450-d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^disk" }, "attrs": [ "__name", "last_check_result" ], "pretty": true }' 451{ 452 "results": [ 453 { 454 "attrs": { 455 "__name": "icinga2-agent1.localdomain!disk", 456 "last_check_result": { 457 "active": true, 458 "check_source": "icinga2-agent1.localdomain", 459 460 ... 461 462 } 463 }, 464 "joins": {}, 465 "meta": {}, 466 "name": "icinga2-agent1.localdomain!disk", 467 "type": "Service" 468 } 469 ] 470} 471``` 472 473Alternatively when using the Director, navigate into the Service Detail View 474in Icinga Web and pick `Inspect` to query the details. 475 476Example with the debug console: 477 478``` 479$ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/' \ 480--eval 'get_service("icinga2-agent1.localdomain", "disk").last_check_result.check_source' | python -m json.tool 481 482"icinga2-agent1.localdomain" 483``` 484 485 486### NSClient++ Check Errors with nscp-local <a id="nsclient-check-errors-nscp-local"></a> 487 488The [nscp-local](10-icinga-template-library.md#nscp-check-local) CheckCommand object definitions call the local `nscp.exe` command. 489If a Windows client service check fails to find the `nscp.exe` command, the log output would look like this: 490 491``` 492Command ".\nscp.exe" "client" "-a" "drive=d" "-a" "show-all" "-b" "-q" "check_drivesize" failed to execute: 2, "The system cannot find the file specified." 493``` 494 495or 496 497``` 498Command ". 499scp.exe" "client" "-a" "drive=d" "-a" "show-all" "-b" "-q" "check_drivesize" failed to execute: 2, "The system cannot find the file specified." 500``` 501 502The above actually prints `.\\nscp.exe` where the escaped `\n` character gets interpreted as new line. 503 504Both errors lead to the assumption that the `NscpPath` constant is empty or set to a `.` character. 505This could mean the following: 506 507* The command is **not executed on the Windows client**. Check the [check_source](15-troubleshooting.md#checks-check-source) attribute from the check result. 508* You are using an outdated NSClient++ version (0.3.x or 0.4.x) which is not compatible with Icinga 2. 509* You are using a custom NSClient++ installer which does not register the correct GUID for NSClient++ 510 511More troubleshooting: 512 513Retrieve the `NscpPath` constant on your Windows client: 514 515``` 516C:\Program Files\ICINGA2\sbin\icinga2.exe variable get NscpPath 517``` 518 519If the variable is returned empty, manually test how Icinga 2 would resolve 520its path (this can be found inside the ITL): 521 522``` 523C:\Program Files\ICINGA2\sbin\icinga2.exe console --eval "dirname(msi_get_component_path(\"{5C45463A-4AE9-4325-96DB-6E239C034F93}\"))" 524``` 525 526If this command does not return anything, NSClient++ is not properly installed. 527Verify that inside the `Programs and Features` (`appwiz.cpl`) control panel. 528 529You can run the bundled NSClient++ installer from the Icinga 2 Windows package. 530The msi package is located in `C:\Program Files\ICINGA2\sbin`. 531 532The bundled NSClient++ version has properly been tested with Icinga 2. Keep that 533in mind when using a different package. 534 535 536### Check Thresholds Not Applied <a id="check-thresholds-not-applied"></a> 537 538This could happen with [clients as command endpoint execution](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint). 539 540If you have for example a client host `icinga2-agent1.localdomain` 541and a service `disk` check defined on the master, the warning and 542critical thresholds are sometimes to applied and unwanted notification 543alerts are raised. 544 545This happens because the client itself includes a host object with 546its `NodeName` and a basic set of checks in the [conf.d](04-configuration.md#conf-d) 547directory, i.e. `disk` with the default thresholds. 548 549Clients which have the `checker` feature enabled will attempt 550to execute checks for local services and send their results 551back to the master. 552 553If you now have the same host and service objects on the 554master you will receive wrong check results from the client. 555 556Solution: 557 558* Disable the `checker` feature on clients: `icinga2 feature disable checker`. 559* Remove the inclusion of [conf.d](04-configuration.md#conf-d) as suggested in the [client setup docs](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint). 560 561### Check Fork Errors <a id="check-fork-errors"></a> 562 563Newer versions of systemd on Linux limit spawned processes for 564services. 565 566* v227 introduces the `TasksMax` setting to units which allows to specify the spawned process limit. 567* v228 adds `DefaultTasksMax` in the global `systemd-system.conf` with a default setting of 512 processes. 568* v231 changes the default value to 15% 569 570This can cause problems with Icinga 2 in large environments with many 571commands executed in parallel starting with systemd v228. Some distributions 572also may have changed the defaults. 573 574The error message could look like this: 575 576``` 5772017-01-12T11:55:40.742685+01:00 icinga2-master1 kernel: [65567.582895] cgroup: fork rejected by pids controller in /system.slice/icinga2.service 578``` 579 580In order to solve the problem, increase the value for `DefaultTasksMax` 581or set it to `infinity`. 582 583```bash 584mkdir /etc/systemd/system/icinga2.service.d 585cat >/etc/systemd/system/icinga2.service.d/limits.conf <<EOF 586[Service] 587DefaultTasksMax=infinity 588EOF 589 590systemctl daemon-reload 591systemctl restart icinga2 592``` 593 594An example is available inside the GitHub repository in [etc/initsystem](https://github.com/Icinga/icinga2/tree/master/etc/initsystem). 595 596External Resources: 597 598* [Fork limit for cgroups](https://lwn.net/Articles/663873/) 599* [systemd changelog](https://github.com/systemd/systemd/blob/master/NEWS) 600* [Icinga 2 upstream issue](https://github.com/Icinga/icinga2/issues/5611) 601* [systemd upstream discussion](https://github.com/systemd/systemd/issues/3211) 602 603### Systemd Watchdog <a id="check-systemd-watchdog"></a> 604 605Usually Icinga 2 is a mission critical part of infrastructure and should be 606online at all times. In case of a recoverable crash (e.g. OOM) you may want to 607restart Icinga 2 automatically. With systemd it is as easy as overriding some 608settings of the Icinga 2 systemd service by creating 609`/etc/systemd/system/icinga2.service.d/override.conf` with the following 610content: 611 612``` 613[Service] 614Restart=always 615RestartSec=1 616StartLimitInterval=10 617StartLimitBurst=3 618``` 619 620Using the watchdog can also help with monitoring Icinga 2, to activate and use it add the following to the override: 621 622``` 623WatchdogSec=30s 624``` 625 626This way systemd will kill Icinga 2 if it does not notify for over 30 seconds. A timeout of less than 10 seconds is not 627recommended. When the watchdog is activated, `Restart=` can be set to `watchdog` to restart Icinga 2 in the case of a 628watchdog timeout. 629 630Run `systemctl daemon-reload && systemctl restart icinga2` to apply the changes. 631Now systemd will always try to restart Icinga 2 (except if you run 632`systemctl stop icinga2`). After three failures in ten seconds it will stop 633trying because you probably have a problem that requires manual intervention. 634 635### Late Check Results <a id="late-check-results"></a> 636 637[Icinga Web 2](https://icinga.com/products/icinga-web-2/) provides 638a dashboard overview for `overdue checks`. 639 640The REST API provides the [status](12-icinga2-api.md#icinga2-api-status) URL endpoint with some generic metrics 641on Icinga and its features. 642 643```bash 644curl -k -s -u root:icinga 'https://localhost:5665/v1/status?pretty=1' | less 645``` 646 647You can also calculate late check results via the REST API: 648 649* Fetch the `last_check` timestamp from each object 650* Compare the timestamp with the current time and add `check_interval` multiple times (change it to see which results are really late, like five times check_interval) 651 652You can use the [icinga2 console](11-cli-commands.md#cli-command-console) to connect to the instance, fetch all data 653and calculate the differences. More infos can be found in [this blogpost](https://icinga.com/2016/08/11/analyse-icinga-2-problems-using-the-console-api/). 654 655``` 656# ICINGA2_API_USERNAME=root ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://localhost:5665/' 657 658<1> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res 659 660[ [ "10807-host!10807-service", "2016-06-10 15:54:55 +0200" ], [ "mbmif.int.netways.de!disk /", "2016-01-26 16:32:29 +0100" ] ] 661``` 662 663Or if you are just interested in numbers, call [len](18-library-reference.md#array-len) on the result array `res`: 664 665``` 666<2> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len() 667 6682.000000 669``` 670 671If you need to analyze that problem multiple times, just add the current formatted timestamp 672and repeat the commands. 673 674``` 675<23> => DateTime(get_time()).to_string() 676 677"2017-04-04 16:09:39 +0200" 678 679<24> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len() 680 6818287.000000 682``` 683 684More details about the Icinga 2 DSL and its possibilities can be 685found in the [language](17-language-reference.md#language-reference) and [library](18-library-reference.md#library-reference) reference chapters. 686 687### Late Check Results in Distributed Environments <a id="late-check-results-distributed"></a> 688 689When it comes to a distributed HA setup, each node is responsible for a load-balanced amount of checks. 690Host and Service objects provide the attribute `paused`. If this is set to `false`, the current node 691actively attempts to schedule and execute checks. Otherwise the node does not feel responsible. 692 693``` 694<3> => var res = {}; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res[s.paused] += 1 }; res 695{ 696 @false = 2.000000 697 @true = 1.000000 698} 699``` 700 701You may ask why this analysis is important? Fair enough - if the numbers are not inverted in a HA zone 702with two members, this may give a hint that the cluster nodes are in a split-brain scenario, or you've 703found a bug in the cluster. 704 705 706If you are running a cluster setup where the master/satellite executes checks on the client via 707[top down command endpoint](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint) mode, 708you might want to know which zones are affected. 709 710This analysis assumes that clients which are not connected, have the string `connected` in their 711service check result output and their state is `UNKNOWN`. 712 713``` 714<4> => var res = {}; for (s in get_objects(Service)) { if (s.state==3) { if (match("*connected*", s.last_check_result.output)) { res[s.zone] += [s.host_name] } } }; for (k => v in res) { res[k] = len(v.unique()) }; res 715 716{ 717 Asia = 31.000000 718 Europe = 214.000000 719 USA = 207.000000 720} 721``` 722 723The result set shows the configured zones and their affected hosts in a unique list. The output also just prints the numbers 724but you can adjust this by omitting the `len()` call inside the for loop. 725 726## Notifications Troubleshooting <a id="troubleshooting-notifications"></a> 727 728### Notifications are not sent <a id="troubleshooting-notifications-not-sent"></a> 729 730* Check the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) to see if a notification is triggered. 731* If yes, verify that all conditions are satisfied. 732* Are any errors on the notification command execution logged? 733 734Please ensure to add these details with your own description 735to any question or issue posted to the community channels. 736 737Verify the following configuration: 738 739* Is the host/service `enable_notifications` attribute set, and if so, to which value? 740* Do the [notification](09-object-types.md#objecttype-notification) attributes `states`, `types`, `period` match the notification conditions? 741* Do the [user](09-object-types.md#objecttype-user) attributes `states`, `types`, `period` match the notification conditions? 742* Are there any notification `begin` and `end` times configured? 743* Make sure the [notification](11-cli-commands.md#enable-features) feature is enabled. 744* Does the referenced NotificationCommand work when executed as Icinga user on the shell? 745 746If notifications are to be sent via mail, make sure that the mail program specified inside the 747[NotificationCommand object](09-object-types.md#objecttype-notificationcommand) exists. 748The name and location depends on the distribution so the preconfigured setting might have to be 749changed on your system. 750 751 752Examples: 753 754``` 755# icinga2 feature enable notification 756The feature 'notification' is already enabled. 757``` 758 759```bash 760icinga2 feature enable debuglog 761systemctl restart icinga2 762 763grep Notification /var/log/icinga2/debug.log > /root/analyze_notification_problem.log 764``` 765 766You can use the Icinga 2 API [event streams](12-icinga2-api.md#icinga2-api-event-streams) to receive live notification streams: 767 768```bash 769curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/events?queue=debugnotifications&types=Notification' 770``` 771 772 773### Analyze Notification Result <a id="troubleshooting-notifications-result"></a> 774 775> **Note** 776> 777> This feature is available since v2.11 and requires all endpoints 778> being updated. 779 780Notifications inside a HA enabled zone are balanced between the endpoints, 781just like checks. 782 783Sometimes notifications may fail, and with looking into the (debug) logs 784for both masters, you cannot correlate this correctly. 785 786The `last_notification_result` runtime attribute is stored and synced for Notification 787objects and can be queried via REST API. 788 789Example for retrieving the notification object and result from all `disk` services using a 790[regex match](18-library-reference.md#global-functions-regex) on the name: 791 792``` 793$ curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/notifications' \ 794-d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^disk" }, "attrs": [ "__name", "last_notification_result" ], "pretty": true }' 795{ 796 "results": [ 797 798 { 799 "attrs": { 800 "last_notification_result": { 801 "active": true, 802 "command": [ 803 "/etc/icinga2/scripts/mail-service-notification.sh", 804 "-4", 805 "", 806 "-6", 807 "", 808 "-b", 809 "", 810 "-c", 811 "", 812 "-d", 813 "2019-08-02 10:54:16 +0200", 814 "-e", 815 "disk", 816 "-l", 817 "icinga2-agent1.localdomain", 818 "-n", 819 "icinga2-agent1.localdomain", 820 "-o", 821 "DISK OK - free space: / 38108 MB (90.84% inode=100%);", 822 "-r", 823 "user@localdomain", 824 "-s", 825 "OK", 826 "-t", 827 "RECOVERY", 828 "-u", 829 "disk" 830 ], 831 "execution_end": 1564736056.186217, 832 "execution_endpoint": "icinga2-master1.localdomain", 833 "execution_start": 1564736056.132323, 834 "exit_status": 0.0, 835 "output": "", 836 "type": "NotificationResult" 837 } 838 }, 839 "joins": {}, 840 "meta": {}, 841 "name": "icinga2-agent1.localdomain!disk!mail-service-notification", 842 "type": "Notification" 843 } 844 845... 846 847 ] 848} 849``` 850 851Example with the debug console: 852 853``` 854$ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/' --eval 'get_object(Notification, "icinga2-agent1.localdomain!disk!mail-service-notification").last_notification_result.execution_endpoint' | jq 855 856"icinga2-agent1.localdomain" 857``` 858 859Whenever a notification command failed to execute, you can fetch the output as well. 860 861 862## Feature Troubleshooting <a id="troubleshooting-features"></a> 863 864### Feature is not working <a id="feature-not-working"></a> 865 866* Make sure that the feature configuration is enabled by symlinking from `features-available/` 867to `features-enabled` and that the latter is included in [icinga2.conf](04-configuration.md#icinga2-conf). 868* Are the feature attributes set correctly according to the documentation? 869* Any errors on the logs? 870 871Look up the [object type](09-object-types.md#object-types) for the required feature and verify it is enabled: 872 873```bash 874icinga2 object list --type <feature object type> 875``` 876 877Example for the `graphite` feature: 878 879```bash 880icinga2 object list --type GraphiteWriter 881``` 882 883Look into the log and check whether the feature logs anything specific for this matter. 884 885```bash 886grep GraphiteWriter /var/log/icinga2/icinga2.log 887``` 888 889## REST API Troubleshooting <a id="troubleshooting-api"></a> 890 891In order to analyse errors on API requests, you can explicitly enable the [verbose parameter](12-icinga2-api.md#icinga2-api-parameters-global). 892 893``` 894$ curl -k -s -u root:icinga -H 'Accept: application/json' -X DELETE 'https://localhost:5665/v1/objects/hosts/example-cmdb?pretty=1&verbose=1' 895{ 896 "diagnostic_information": "Error: Object does not exist.\n\n ....", 897 "error": 404.0, 898 "status": "No objects found." 899} 900``` 901 902### REST API Troubleshooting: No Objects Found <a id="troubleshooting-api-no-objects-found"></a> 903 904Please note that the `404` status with no objects being found can also originate 905from missing or too strict object permissions for the authenticated user. 906 907This is a security feature to disable object name guessing. If this would not be the 908case, restricted users would be able to get a list of names of your objects just by 909trying every character combination. 910 911In order to analyse and fix the problem, please check the following: 912 913- use an administrative account with full permissions to check whether the objects are actually there. 914- verify the permissions on the affected ApiUser object and fix them. 915 916### Missing Runtime Objects (Hosts, Downtimes, etc.) <a id="troubleshooting-api-missing-runtime-objects"></a> 917 918Runtime objects consume the internal config packages shared with 919the REST API config packages. Each host, downtime, comment, service, etc. created 920via the REST API is stored in the `_api` package. 921 922This includes downtimes and comments, which where sometimes stored in the wrong 923directory path, because the active-stage file was empty/truncated/unreadable at 924this point. 925 926Wrong: 927 928``` 929/var/lib/icinga2/api/packages/_api//conf.d/downtimes/1234-5678-9012-3456.conf 930``` 931 932Correct: 933 934``` 935/var/lib/icinga2/api/packages/_api/dbe0bef8-c72c-4cc9-9779-da7c4527c5b2/conf.d/downtimes/1234-5678-9012-3456.conf 936``` 937 938At creation time, the object lives in memory but its storage is broken. Upon restart, 939it is missing and e.g. a missing downtime will re-enable unwanted notifications. 940 941`abcd-ef12-3456-7890` is the active stage name which wasn't correctly 942read by the Icinga daemon. This information is stored in `/var/lib/icinga2/api/packages/_api/active-stage`. 943 9442.11 now limits the direct active-stage file access (this is hidden from the user), 945and caches active stages for packages in-memory. 946 947It also tries to repair the broken package, and logs a new message: 948 949``` 950systemctl restart icinga2 951 952tail -f /var/log/icinga2/icinga2.log 953 954[2019-05-10 12:27:15 +0200] information/ConfigObjectUtility: Repairing config package '_api' with stage 'dbe0bef8-c72c-4cc9-9779-da7c4527c5b2'. 955``` 956 957If this does not happen, you can manually fix the broken config package, and mark a deployed stage as active 958again, carefully do the following steps with creating a backup before: 959 960Navigate into the API package prefix. 961 962```bash 963cd /var/lib/icinga2/api/packages 964``` 965 966Change into the broken package directory and list all directories and files 967ordered by latest changes. 968 969``` 970cd _api 971ls -lahtr 972 973drwx------ 4 michi wheel 128B Mar 27 14:39 .. 974-rw-r--r-- 1 michi wheel 25B Mar 27 14:39 include.conf 975-rw-r--r-- 1 michi wheel 405B Mar 27 14:39 active.conf 976drwx------ 7 michi wheel 224B Mar 27 15:01 dbe0bef8-c72c-4cc9-9779-da7c4527c5b2 977drwx------ 5 michi wheel 160B Apr 26 12:47 . 978``` 979 980As you can see, the `active-stage` file is missing. When it is there, verify that its content 981is set to the stage directory as follows. 982 983If you have more than one stage directory here, pick the latest modified 984directory. Copy the directory name `abcd-ef12-3456-7890` and 985add it into a new file `active-stage`. This can be done like this: 986 987```bash 988echo "dbe0bef8-c72c-4cc9-9779-da7c4527c5b2" > active-stage 989``` 990 991`active.conf` needs to have the correct active stage too, add it again 992like this. Note: This is deep down in the code, use with care! 993 994```bash 995sed -i 's/ActiveStages\["_api"\] = .*/ActiveStages\["_api"\] = "dbe0bef8-c72c-4cc9-9779-da7c4527c5b2"/g' /var/lib/icinga2/api/packages/_api/active.conf 996``` 997 998Restart Icinga 2. 999 1000```bash 1001systemctl restart icinga2 1002``` 1003 1004 1005> **Note** 1006> 1007> The internal `_api` config package structure may change in the future. Do not modify 1008> things in there manually or with scripts unless guided here or asked by a developer. 1009 1010 1011## Certificate Troubleshooting <a id="troubleshooting-certificate"></a> 1012 1013Tools for analysing certificates and TLS connections: 1014 1015- `openssl` binary on Linux/Unix, `openssl.exe` on Windows ([download](https://slproweb.com/products/Win32OpenSSL.html)) 1016- `sslscan` tool, available [here](https://github.com/rbsec/sslscan) (Linux/Windows) 1017 1018Note: You can also execute sslscan on Windows using Powershell. 1019 1020 1021### Certificate Verification <a id="troubleshooting-certificate-verification"></a> 1022 1023Whenever the TLS handshake fails when a client connects to the cluster or the REST API, 1024ensure to verify the used certificates. 1025 1026Print the CA and client certificate and ensure that the following attributes are set: 1027 1028* Version must be 3. 1029* Serial number is a hex-encoded string. 1030* Issuer should be your certificate authority (defaults to `Icinga CA` for all certificates generated by CLI commands and automated signing requests). 1031* Validity: The certificate must not be expired. 1032* Subject with the common name (CN) matches the client endpoint name and its FQDN. 1033* v3 extensions must set the basic constraint for `CA:TRUE` (ca.crt) or `CA:FALSE` (client certificate). 1034* Subject Alternative Name is set to the resolvable DNS name (required for REST API and browsers). 1035 1036Navigate into the local certificate store: 1037 1038```bash 1039cd /var/lib/icinga2/certs/ 1040``` 1041 1042Make sure to verify the agents' certificate and its stored `ca.crt` in `/var/lib/icinga2/certs` and ensure that 1043all instances (master, satellite, agent) are signed by the **same CA**. 1044 1045Compare the `ca.crt` file from the agent node and compare it to your master's `ca.crt` file. 1046 1047 1048Since 2.12, you can use the built-in CLI command `pki verify` to perform TLS certificate validation tasks. 1049 1050> **Hint** 1051> 1052> The CLI command uses exit codes aligned to the [Plugin API specification](05-service-monitoring.md#service-monitoring-plugin-api). 1053> Run the commands followed with `echo $?` to see the exit code. 1054 1055These CLI commands can be used on Windows agents too without requiring the OpenSSL binary. 1056 1057#### Print TLS Certificate <a id="troubleshooting-certificate-verification-print"></a> 1058 1059Pass the certificate file to the `--cert` CLI command parameter to print its details. 1060This prints a shorter version of `openssl x509 -in <file> -text`. 1061 1062``` 1063$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt 1064 1065information/cli: Printing certificate 'icinga2-agent2.localdomain.crt' 1066 1067 Version: 3 1068 Subject: CN = icinga2-agent2.localdomain 1069 Issuer: CN = Icinga CA 1070 Valid From: Feb 14 11:29:36 2020 GMT 1071 Valid Until: Feb 10 11:29:36 2035 GMT 1072 Serial: 12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0 1073 1074 Signature Algorithm: sha256WithRSAEncryption 1075 Subject Alt Names: icinga2-agent2.localdomain 1076 Fingerprint: 40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C 1077``` 1078 1079You can also print the `ca.crt` certificate without any further checks using the `--cert` parameter. 1080 1081#### Print and Verify CA Certificate <a id="troubleshooting-certificate-verification-print-verify-ca"></a> 1082 1083The `--cacert` CLI parameter allows to check whether the given certificate file is a public CA certificate. 1084 1085``` 1086$ icinga2 pki verify --cacert ca.crt 1087 1088information/cli: Checking whether certificate 'ca.crt' is a valid CA certificate. 1089 1090 Version: 3 1091 Subject: CN = Icinga CA 1092 Issuer: CN = Icinga CA 1093 Valid From: Jul 31 12:26:08 2019 GMT 1094 Valid Until: Jul 27 12:26:08 2034 GMT 1095 Serial: 89:fe:d6:12:66:25:3a:c5:07:c1:eb:d4:e6:f2:df:ca:13:6e:dc:e7 1096 1097 Signature Algorithm: sha256WithRSAEncryption 1098 Subject Alt Names: 1099 Fingerprint: 9A 11 29 A8 A3 89 F8 56 30 1A E4 0A B2 6B 28 46 07 F0 14 17 BD 19 A4 FC BD 41 40 B5 1A 8F BF 20 1100 1101information/cli: OK: CA certificate file 'ca.crt' was verified successfully. 1102``` 1103 1104In case you pass a wrong certificate, an error is shown and the exit code is `2` (Critical). 1105 1106``` 1107$ icinga2 pki verify --cacert icinga2-agent2.localdomain.crt 1108 1109information/cli: Checking whether certificate 'icinga2-agent2.localdomain.crt' is a valid CA certificate. 1110 1111 Version: 3 1112 Subject: CN = icinga2-agent2.localdomain 1113 Issuer: CN = Icinga CA 1114 Valid From: Feb 14 11:29:36 2020 GMT 1115 Valid Until: Feb 10 11:29:36 2035 GMT 1116 Serial: 12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0 1117 1118 Signature Algorithm: sha256WithRSAEncryption 1119 Subject Alt Names: icinga2-agent2.localdomain 1120 Fingerprint: 40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C 1121 1122critical/cli: CRITICAL: The file 'icinga2-agent2.localdomain.crt' does not seem to be a CA certificate file. 1123``` 1124 1125#### Verify Certificate is signed by CA Certificate <a id="troubleshooting-certificate-verification-signed-by-ca"></a> 1126 1127Pass the certificate file to the `--cert` CLI parameter, and the `ca.crt` file to the `--cacert` parameter. 1128Common troubleshooting scenarios involve self-signed certificates and untrusted agents resulting in disconnects. 1129 1130``` 1131$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt --cacert ca.crt 1132 1133information/cli: Verifying certificate 'icinga2-agent2.localdomain.crt' 1134 1135 Version: 3 1136 Subject: CN = icinga2-agent2.localdomain 1137 Issuer: CN = Icinga CA 1138 Valid From: Feb 14 11:29:36 2020 GMT 1139 Valid Until: Feb 10 11:29:36 2035 GMT 1140 Serial: 12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0 1141 1142 Signature Algorithm: sha256WithRSAEncryption 1143 Subject Alt Names: icinga2-agent2.localdomain 1144 Fingerprint: 40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C 1145 1146information/cli: with CA certificate 'ca.crt'. 1147 1148 Version: 3 1149 Subject: CN = Icinga CA 1150 Issuer: CN = Icinga CA 1151 Valid From: Jul 31 12:26:08 2019 GMT 1152 Valid Until: Jul 27 12:26:08 2034 GMT 1153 Serial: 89:fe:d6:12:66:25:3a:c5:07:c1:eb:d4:e6:f2:df:ca:13:6e:dc:e7 1154 1155 Signature Algorithm: sha256WithRSAEncryption 1156 Subject Alt Names: 1157 Fingerprint: 9A 11 29 A8 A3 89 F8 56 30 1A E4 0A B2 6B 28 46 07 F0 14 17 BD 19 A4 FC BD 41 40 B5 1A 8F BF 20 1158 1159information/cli: OK: Certificate with CN 'icinga2-agent2.localdomain' is signed by CA. 1160``` 1161 1162#### Verify Certificate matches Common Name (CN) <a id="troubleshooting-certificate-verification-common-name-match"></a> 1163 1164This allows to verify the common name inside the certificate with a given string parameter. 1165Typical troubleshooting involve upper/lower case CNs (Windows). 1166 1167``` 1168$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt --cn icinga2-agent2.localdomain 1169 1170information/cli: Verifying common name (CN) 'icinga2-agent2.localdomain in certificate 'icinga2-agent2.localdomain.crt'. 1171 1172 Version: 3 1173 Subject: CN = icinga2-agent2.localdomain 1174 Issuer: CN = Icinga CA 1175 Valid From: Feb 14 11:29:36 2020 GMT 1176 Valid Until: Feb 10 11:29:36 2035 GMT 1177 Serial: 12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0 1178 1179 Signature Algorithm: sha256WithRSAEncryption 1180 Subject Alt Names: icinga2-agent2.localdomain 1181 Fingerprint: 40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C 1182 1183information/cli: OK: CN 'icinga2-agent2.localdomain' matches certificate CN 'icinga2-agent2.localdomain'. 1184``` 1185 1186In the example below, the certificate uses an upper case CN. 1187 1188``` 1189$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt --cn icinga2-agent2.localdomain 1190 1191information/cli: Verifying common name (CN) 'icinga2-agent2.localdomain in certificate 'icinga2-agent2.localdomain.crt'. 1192 1193 Version: 3 1194 Subject: CN = ICINGA2-agent2.localdomain 1195 Issuer: CN = Icinga CA 1196 Valid From: Feb 14 11:29:36 2020 GMT 1197 Valid Until: Feb 10 11:29:36 2035 GMT 1198 Serial: 12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0 1199 1200 Signature Algorithm: sha256WithRSAEncryption 1201 Subject Alt Names: ICINGA2-agent2.localdomain 1202 Fingerprint: 40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C 1203 1204critical/cli: CRITICAL: CN 'icinga2-agent2.localdomain' does NOT match certificate CN 'icinga2-agent2.localdomain'. 1205``` 1206 1207 1208 1209### Certificate Signing <a id="troubleshooting-certificate-signing"></a> 1210 1211Icinga offers two methods: 1212 1213* [CSR Auto-Signing](06-distributed-monitoring.md#distributed-monitoring-setup-csr-auto-signing) which uses a client (an agent or a satellite) ticket generated on the master as trust identifier. 1214* [On-Demand CSR Signing](06-distributed-monitoring.md#distributed-monitoring-setup-on-demand-csr-signing) which allows to sign pending certificate requests on the master. 1215 1216Whenever a signed certificate is not received on the requesting clients, ensure to check the following: 1217 1218* The ticket was valid and the master's log shows nothing different (CSR Auto-Signing only) 1219* If the agent/satellite is directly connected to the CA master, check whether the master actually has performance problems to process the request. If the connection is closed without certificate response, analyse the master's health. It is also advised to upgrade to v2.11 where network stack problems have been fixed. 1220* If you're using a 3+ level cluster, check whether the satellite really forwarded the CSR signing request and the master processed it. 1221 1222Other common errors: 1223 1224* The generated ticket is invalid. The client receives this error message, as well as the master logs a warning message. 1225* The [api](09-object-types.md#objecttype-apilistener) feature does not have the `ticket_salt` attribute set to the generated `TicketSalt` constant by the CLI wizards. 1226 1227In case you are using On-Demand CSR Signing, `icinga2 ca list` on the master only lists 1228pending requests since v2.11. Add `--all` to also see signed requests. Keep in mind that 1229old requests are purged after 1 week automatically. 1230 1231 1232### TLS Handshake: Ciphers <a id="troubleshooting-certificate-handshake-ciphers"></a> 1233 1234Starting with v2.11, the default configured ciphers have been hardened to modern 1235standards. This includes TLS v1.2 as minimum protocol version too. 1236 1237In case the TLS handshake fails with `no shared cipher`, first analyse whether both 1238instances support the same ciphers. 1239 1240#### Client connects to Server <a id="troubleshooting-certificate-handshake-ciphers-client"></a> 1241 1242Connect using `openssl s_client` and try to reproduce the connection problem. 1243 1244> **Important** 1245> 1246> The endpoint with the server role **accepting** the connection picks the preferred 1247> cipher. E.g. when a satellite connects to the master, the master chooses the cipher. 1248> 1249> Keep this in mind where to simulate the client role connecting to a server with 1250> CLI tools such as `openssl s_client`. 1251 1252 1253`openssl s_client` tells you about the supported and shared cipher suites 1254on the remote server. `openssl ciphers` lists locally available ciphers. 1255 1256``` 1257$ openssl s_client -connect 192.168.33.5:5665 1258... 1259 1260--- 1261SSL handshake has read 2899 bytes and written 786 bytes 1262--- 1263New, TLSv1/SSLv3, Cipher is AES256-GCM-SHA384 1264Server public key is 4096 bit 1265Secure Renegotiation IS supported 1266Compression: NONE 1267Expansion: NONE 1268No ALPN negotiated 1269SSL-Session: 1270 Protocol : TLSv1.2 1271 Cipher : AES256-GCM-SHA384 1272 1273... 1274``` 1275 1276You can specifically use one cipher or a list with the `-cipher` parameter: 1277 1278```bash 1279openssl s_client -connect 192.168.33.5:5665 -cipher 'ECDHE-RSA-AES256-GCM-SHA384' 1280``` 1281 1282In order to fully simulate a connecting client, provide the certificates too: 1283 1284```bash 1285CERTPATH='/var/lib/icinga2/certs' 1286HOSTNAME='icinga2.vagrant.demo.icinga.com' 1287openssl s_client -connect 192.168.33.5:5665 -cert "${CERTPATH}/${HOSTNAME}.crt" -key "${CERTPATH}/${HOSTNAME}.key" -CAfile "${CERTPATH}/ca.crt" -cipher 'ECDHE-RSA-AES256-GCM-SHA384' 1288``` 1289 1290In case to need to change the default cipher list, 1291set the [cipher_list](09-object-types.md#objecttype-apilistener) attribute 1292in the `api` feature configuration accordingly. 1293 1294Beware of using insecure ciphers, this may become a 1295security risk in your organisation. 1296 1297#### Server Accepts Client <a id="troubleshooting-certificate-handshake-ciphers-server"></a> 1298 1299If the master node does not actively connect to the satellite/agent node(s), but instead 1300the child node actively connectsm, you can still simulate a TLS handshake. 1301 1302Use `openssl s_server` instead of `openssl s_client` on the master during the connection 1303attempt. 1304 1305```bash 1306openssl s_server -connect 192.168.56.101:5665 1307``` 1308 1309Since the server role chooses the preferred cipher suite in Icinga, 1310you can test-drive the "agent connects to master" mode here, granted that 1311the TCP connection is not blocked by the firewall. 1312 1313 1314#### Cipher Scan Tools <a id="troubleshooting-certificate-handshake-ciphers-scantools"></a> 1315 1316You can also use different tools to test the available cipher suites, this is what SSL Labs, etc. 1317provide for TLS enabled websites as well. [This post](https://superuser.com/questions/109213/how-do-i-list-the-ssl-tls-cipher-suites-a-particular-website-offers) 1318highlights some tools and scripts such as [sslscan](https://github.com/rbsec/sslscan) or [testssl.sh](https://github.com/drwetter/testssl.sh/) 1319 1320Example for sslscan on macOS against a Debian 10 Buster instance 1321running v2.11: 1322 1323``` 1324$ brew install sslscan 1325 1326$ sslscan 192.168.33.22:5665 1327Version: 1.11.13-static 1328OpenSSL 1.0.2f 28 Jan 2016 1329 1330Connected to 192.168.33.22 1331 1332Testing SSL server 192.168.33.22 on port 5665 using SNI name 192.168.33.22 1333 1334 TLS Fallback SCSV: 1335Server supports TLS Fallback SCSV 1336 1337 TLS renegotiation: 1338Session renegotiation not supported 1339 1340 TLS Compression: 1341Compression disabled 1342 1343 Heartbleed: 1344TLS 1.2 not vulnerable to heartbleed 1345TLS 1.1 not vulnerable to heartbleed 1346TLS 1.0 not vulnerable to heartbleed 1347 1348 Supported Server Cipher(s): 1349Preferred TLSv1.2 256 bits ECDHE-RSA-AES256-GCM-SHA384 Curve P-256 DHE 256 1350Accepted TLSv1.2 128 bits ECDHE-RSA-AES128-GCM-SHA256 Curve P-256 DHE 256 1351Accepted TLSv1.2 256 bits ECDHE-RSA-AES256-SHA384 Curve P-256 DHE 256 1352Accepted TLSv1.2 128 bits ECDHE-RSA-AES128-SHA256 Curve P-256 DHE 256 1353 1354 SSL Certificate: 1355Signature Algorithm: sha256WithRSAEncryption 1356RSA Key Strength: 4096 1357 1358Subject: icinga2-debian10.vagrant.demo.icinga.com 1359Altnames: DNS:icinga2-debian10.vagrant.demo.icinga.com 1360Issuer: Icinga CA 1361 1362Not valid before: Jul 12 07:39:55 2019 GMT 1363Not valid after: Jul 8 07:39:55 2034 GMT 1364``` 1365 1366## Distributed Troubleshooting <a id="troubleshooting-cluster"></a> 1367 1368This applies to any Icinga 2 node in a [distributed monitoring setup](06-distributed-monitoring.md#distributed-monitoring-scenarios). 1369 1370You should configure the [cluster health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks) if you haven't 1371done so already. 1372 1373> **Note** 1374> 1375> Some problems just exist due to wrong file permissions or applied packet filters. Make 1376> sure to check these in the first place. 1377 1378### Cluster Troubleshooting Connection Errors <a id="troubleshooting-cluster-connection-errors"></a> 1379 1380General connection errors could be one of the following problems: 1381 1382* Incorrect network configuration 1383* Packet loss 1384* Firewall rules preventing traffic 1385 1386Use tools like `netstat`, `tcpdump`, `nmap`, etc. to make sure that the cluster communication 1387works (default port is `5665`). 1388 1389```bash 1390tcpdump -n port 5665 -i any 1391 1392netstat -tulpen | grep icinga 1393 1394nmap icinga2-agent1.localdomain 1395``` 1396 1397### Cluster Troubleshooting TLS Errors <a id="troubleshooting-cluster-tls-errors"></a> 1398 1399If the cluster communication fails with TLS/SSL error messages, make sure to check 1400the following 1401 1402* File permissions on the TLS certificate files 1403* Does the used CA match for all cluster endpoints? 1404 * Verify the `Issuer` being your trusted CA 1405 * Verify the `Subject` containing your endpoint's common name (CN) 1406 * Check the validity of the certificate itself 1407 1408Try to manually connect from `icinga2-agent1.localdomain` to the master node `icinga2-master1.localdomain`: 1409 1410``` 1411$ openssl s_client -CAfile /var/lib/icinga2/certs/ca.crt -cert /var/lib/icinga2/certs/icinga2-agent1.localdomain.crt -key /var/lib/icinga2/certs/icinga2-agent1.localdomain.key -connect icinga2-master1.localdomain:5665 1412 1413CONNECTED(00000003) 1414--- 1415... 1416``` 1417 1418If the connection attempt fails or your CA does not match, [verify the certificates](15-troubleshooting.md#troubleshooting-certificate-verification). 1419 1420 1421#### Cluster Troubleshooting Unauthenticated Clients <a id="troubleshooting-cluster-unauthenticated-clients"></a> 1422 1423Unauthenticated nodes are able to connect. This is required for agent/satellite setups. 1424 1425Master: 1426 1427``` 1428[2015-07-13 18:29:25 +0200] information/ApiListener: New client connection for identity 'icinga2-agent1.localdomain' (unauthenticated) 1429``` 1430 1431Agent as command execution bridge: 1432 1433``` 1434[2015-07-13 18:29:26 +1000] notice/ClusterEvents: Discarding 'execute command' message from 'icinga2-master1.localdomain': Invalid endpoint origin (client not allowed). 1435``` 1436 1437If these messages do not go away, make sure to [verify the master and agent certificates](15-troubleshooting.md#troubleshooting-certificate-verification). 1438 1439 1440### Cluster Troubleshooting Message Errors <a id="troubleshooting-cluster-message-errors"></a> 1441 1442When the network connection is broken or gone, the Icinga 2 instances will be disconnected. 1443If the connection can't be re-established between endpoints in the same HA zone, 1444they remain in a Split-Brain-mode and history may differ. 1445 1446Although the Icinga 2 cluster protocol stores historical events in a [replay log](15-troubleshooting.md#troubleshooting-cluster-replay-log) 1447for later synchronisation, you should make sure to check why the network connection failed. 1448 1449Ensure to setup [cluster health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks) 1450to monitor all endpoints and zones connectivity. 1451 1452 1453### Cluster Troubleshooting Command Endpoint Errors <a id="troubleshooting-cluster-command-endpoint-errors"></a> 1454 1455Command endpoints can be used [for agents](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint) 1456as well as inside an [High-Availability cluster](06-distributed-monitoring.md#distributed-monitoring-scenarios). 1457 1458There is no CLI command for manually executing the check, but you can verify 1459the following (e.g. by invoking a forced check from the web interface): 1460 1461* `/var/log/icinga2/icinga2.log` shows connection and execution errors. 1462 * The ApiListener is not enabled to [accept commands](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint). This is visible as `UNKNOWN` check result output. 1463 * `CheckCommand` definition not found on the remote client. This is visible as `UNKNOWN` check result output. 1464 * Referenced check plugin not found on the remote agent. 1465 * Runtime warnings and errors, e.g. unresolved runtime macros or configuration problems. 1466* Specific error messages are also populated into `UNKNOWN` check results including a detailed error message in their output. 1467* Verify the [check source](15-troubleshooting.md#checks-check-source). This is populated by the node executing the check. You can see that in Icinga Web's detail view or by querying the REST API for this checkable object. 1468 1469Additional tasks: 1470 1471* More verbose logs are found inside the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output). 1472 1473* Use the Icinga 2 API [event streams](12-icinga2-api.md#icinga2-api-event-streams) to receive live check result streams. 1474 1475Fetch all check result events matching the `event.service` name `remote-client`: 1476 1477```bash 1478curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/events?queue=debugcommandendpoint&types=CheckResult&filter=match%28%22remote-client*%22,event.service%29' 1479``` 1480 1481 1482#### Agent Hosts with Command Endpoint require a Zone <a id="troubleshooting-cluster-command-endpoint-errors-agent-hosts-command-endpoint-zone"></a> 1483 14842.11 fixes bugs where agent host checks would never be scheduled on 1485the master. One requirement is that the checkable host/service 1486is put into a zone. 1487 1488By default, the Director puts the agent host in `zones.d/master` 1489and you're good to go. If you manually manage the configuration, 1490the config compiler now throws an error with `command_endpoint` 1491being set but no `zone` defined. 1492 1493In case you previously managed the configuration outside of `zones.d`, 1494follow along with the following instructions. 1495 1496The most convenient way with e.g. managing the objects in `conf.d` 1497is to move them into the `master` zone. 1498 1499First, verify the name of your endpoint's zone. The CLI wizards 1500use `master` by default. 1501 1502``` 1503vim /etc/icinga2/zones.conf 1504 1505object Zone "master" { 1506 ... 1507} 1508``` 1509 1510Then create a new directory in `zones.d` called `master`, if not existing. 1511 1512```bash 1513mkdir -p /etc/icinga2/zones.d/master 1514``` 1515 1516Now move the directory tree from `conf.d` into the `master` zone. 1517 1518```bash 1519mv conf.d/* /etc/icinga2/zones.d/master/ 1520``` 1521 1522Validate the configuration and reload Icinga. 1523 1524```bash 1525icinga2 daemon -C 1526systemctl restart icinga2 1527``` 1528 1529Another method is to specify the `zone` attribute manually, but since 1530this may lead into other unwanted "not checked" scenarios, we don't 1531recommend this for your production environment. 1532 1533### Cluster Troubleshooting Config Sync <a id="troubleshooting-cluster-config-sync"></a> 1534 1535In order to troubleshoot this, remember the key things with the config sync: 1536 1537* Within a config master zone, only one configuration master is allowed to have its config in `/etc/icinga2/zones.d`. 1538 * The config master copies the zone configuration from `/etc/icinga2/zones.d` to `/var/lib/icinga2/api/zones`. This storage is the same for all cluster endpoints, and the source for all config syncs. 1539 * The config master puts the `.authoritative` marker on these zone files locally. This is to ensure that it doesn't receive config updates from other endpoints. If you have copied the content from `/var/lib/icinga2/api/zones` to another node, ensure to remove them. 1540* During startup, the master validates the entire configuration and only syncs valid configuration to other zone endpoints. 1541 1542Satellites/Agents < 2.11 store the received configuration directly in `/var/lib/icinga2/api/zones`, validating it and reloading the daemon. 1543Satellites/Agents >= 2.11 put the received configuration into the staging directory `/var/lib/icinga2/api/zones-stage` first, and will only copy this to the production directory `/var/lib/icinga2/api/zones` once the validation was successful. 1544 1545The configuration sync logs the operations during startup with the `information` severity level. Received zone configuration is also logged. 1546 1547Typical errors are: 1548 1549* The api feature doesn't [accept config](06-distributed-monitoring.md#distributed-monitoring-top-down-config-sync). This is logged into `/var/lib/icinga2/icinga2.log`. 1550* The received configuration zone is not configured in [zones.conf](04-configuration.md#zones-conf) and Icinga denies it. This is logged into `/var/lib/icinga2/icinga2.log`. 1551* The satellite/agent has local configuration in `/etc/icinga2/zones.d` and thinks it is authoritive for this zone. It then denies the received update. Purge the content from `/etc/icinga2/zones.d`, `/var/lib/icinga2/api/zones/*` and restart Icinga to fix this. 1552 1553#### New configuration does not trigger a reload <a id="troubleshooting-cluster-config-sync-no-reload"></a> 1554 1555The debug/notice log dumps the calculated checksums for all files and the comparison. Analyse this to troubleshoot further. 1556 1557A complete sync for the `director-global` global zone can look like this: 1558 1559``` 1560[2019-08-01 09:20:25 +0200] notice/JsonRpcConnection: Received 'config::Update' message from 'icinga2-master1.localdomain' 1561[2019-08-01 09:20:25 +0200] information/ApiListener: Applying config update from endpoint 'icinga2-master1.localdomain' of zone 'master'. 1562[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/.checksums'. 1563[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/.timestamp'. 1564[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/director/001-director-basics.conf'. 1565[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/director/host_templates.conf'. 1566[2019-08-01 09:20:25 +0200] information/ApiListener: Received configuration for zone 'director-global' from endpoint 'icinga2-master1.localdomain'. Comparing the checksums. 1567[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking for config change between stage and production. Old (4): '{"/.checksums":"c4dd1237e36dcad9142f4d9a81324a7cae7d01543a672299 1568b8c1bb08b629b7d1","/.timestamp":"f21c0e6551328812d9f5176e5e31f390de0d431d09800a85385630727b404d83","/director/001-director-basics.conf":"f86583eec81c9bf3a1823a761991fb53d640bd0dc 15696cd12bf8c5e6a275359970f","/director/host_templates.conf":"831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc"}' vs. new (4): '{"/.checksums":"c4dd1237e36dcad9142f4d 15709a81324a7cae7d01543a672299b8c1bb08b629b7d1","/.timestamp":"f21c0e6551328812d9f5176e5e31f390de0d431d09800a85385630727b404d83","/director/001-director-basics.conf":"f86583eec81c9bf 15713a1823a761991fb53d640bd0dc6cd12bf8c5e6a275359970f","/director/host_templates.conf":"831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc"}'. 1572[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring old internal file '/.checksums'. 1573[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring old internal file '/.timestamp'. 1574[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/001-director-basics.conf for old checksum: f86583eec81c9bf3a1823a761991fb53d640bd0dc6cd12bf8c5e6a275359970f. 1575[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/host_templates.conf for old checksum: 831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc. 1576[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring new internal file '/.checksums'. 1577[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring new internal file '/.timestamp'. 1578[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/001-director-basics.conf for new checksum: f86583eec81c9bf3a1823a761991fb53d640bd0dc6cd12bf8c5e6a275359970f. 1579[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/host_templates.conf for new checksum: 831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc. 1580[2019-08-01 09:20:25 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/director-global//director/001-director-basics.c 1581onf' for zone 'director-global'. 1582[2019-08-01 09:20:25 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/director-global//director/host_templates.conf' 1583for zone 'director-global'. 1584[2019-08-01 09:20:25 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/director-global' (2209 Bytes). 1585 1586... 1587 1588[2019-08-01 09:20:25 +0200] information/ApiListener: Received configuration updates (4) from endpoint 'icinga2-master1.localdomain' are different to production, triggering validation and reload. 1589[2019-08-01 09:20:25 +0200] notice/Process: Running command '/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '--close-stdio' '-e' '/var/log/icinga2/e 1590rror.log' '--validate' '--define' 'System.ZonesStageVarDir=/var/lib/icinga2/api/zones-stage/': PID 4532 1591[2019-08-01 09:20:25 +0200] notice/Process: PID 4532 ('/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '--close-stdio' '-e' '/var/log/icinga2/error.l 1592og' '--validate' '--define' 'System.ZonesStageVarDir=/var/lib/icinga2/api/zones-stage/') terminated with exit code 0 1593[2019-08-01 09:20:25 +0200] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and trig 1594gering reload. 1595[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//.checksums' from config sync staging to production zones directory. 1596[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//.timestamp' from config sync staging to production zones directory. 1597[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//director/001-director-basics.conf' from config sync staging to production zones directory. 1598[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//director/host_templates.conf' from config sync staging to production zones directory. 1599 1600... 1601 1602[2019-08-01 09:20:26 +0200] notice/Application: Got reload command, forwarding to umbrella process (PID 4236) 1603``` 1604 1605In case the received configuration updates are equal to what is running in production, a different message is logged and the validation/reload is skipped. 1606 1607``` 1608[2020-02-05 15:18:19 +0200] information/ApiListener: Received configuration updates (4) from endpoint 'icinga2-master1.localdomain' are equal to production, skipping validation and reload. 1609``` 1610 1611 1612#### Syncing Binary Files is Denied <a id="troubleshooting-cluster-config-sync-binary-denied"></a> 1613 1614The config sync is built for syncing text configuration files, wrapped into JSON-RPC messages. 1615Some users have started to use this as binary file sync instead of using tools built for this: 1616rsync, git, Puppet, Ansible, etc. 1617 1618Starting with 2.11, this attempt is now prohibited and logged. 1619 1620``` 1621[2019-08-02 16:03:19 +0200] critical/ApiListener: Ignoring file '/etc/icinga2/zones.d/global-templates/forbidden.exe' for cluster config sync: Does not contain valid UTF8. Binary files are not supported. 1622Context: 1623 (0) Creating config update for file '/etc/icinga2/zones.d/global-templates/forbidden.exe' 1624 (1) Activating object 'api' of type 'ApiListener' 1625``` 1626 1627In order to solve this problem, remove the mentioned files from `zones.d` and use an alternate way 1628of syncing plugin binaries to your satellites and agents. 1629 1630 1631#### Zones in Zones doesn't work <a id="troubleshooting-cluster-config-zones-in-zones"></a> 1632 1633The cluster config sync works in the way that configuration 1634put into `/etc/icinga2/zones.d` only is included when configured 1635outside in `/etc/icinga2/zones.conf`. 1636 1637If you for example create a "Zone Inception" with defining the 1638`satellite` zone in `zones.d/master`, the config compiler does not 1639re-run and include this zone config recursively from `zones.d/satellite`. 1640 1641Since v2.11, the config compiler is only including directories where a 1642zone has been configured. Otherwise it would include renamed old zones, 1643broken zones, etc. and those long-lasting bugs have been now fixed. 1644 1645A more concrete example: Masters and Satellites still need to know the Zone hierarchy outside of `zones.d` synced configuration. 1646 1647**Doesn't work** 1648 1649``` 1650vim /etc/icinga2/zones.conf 1651 1652object Zone "master" { 1653 endpoints = [ "icinga2-master1.localdomain", "icinga2-master2.localdomain" ] 1654} 1655``` 1656 1657``` 1658vim /etc/icinga2/zones.d/master/satellite-zones.conf 1659 1660object Zone "satellite" { 1661 endpoints = [ "icinga2-satellite1.localdomain", "icinga2-satellite1.localdomain" ] 1662} 1663``` 1664 1665``` 1666vim /etc/icinga2/zones.d/satellite/satellite-hosts.conf 1667 1668object Host "agent" { ... } 1669``` 1670 1671The `agent` host object will never reach the satellite, since the master does not have 1672the `satellite` zone configured outside of zones.d. 1673 1674 1675**Works** 1676 1677Each instance needs to know this, and know about the endpoints first: 1678 1679``` 1680vim /etc/icinga2/zones.conf 1681 1682object Endpoint "icinga2-master1.localdomain" { ... } 1683object Endpoint "icinga2-master2.localdomain" { ... } 1684 1685object Endpoint "icinga2-satellite1.localdomain" { ... } 1686object Endpoint "icinga2-satellite2.localdomain" { ... } 1687``` 1688 1689Then the zone hierarchy as trust and also config sync inclusion is required. 1690 1691``` 1692vim /etc/icinga2/zones.conf 1693 1694object Zone "master" { 1695 endpoints = [ "icinga2-master1.localdomain", "icinga2-master2.localdomain" ] 1696} 1697 1698object Zone "satellite" { 1699 endpoints = [ "icinga2-satellite1.localdomain", "icinga2-satellite1.localdomain" ] 1700} 1701``` 1702 1703Once done, you can start deploying actual monitoring objects into the satellite zone. 1704 1705``` 1706vim /etc/icinga2/zones.d/satellite/satellite-hosts.conf 1707 1708object Host "agent" { ... } 1709``` 1710 1711That's also explained and described in the [documentation](06-distributed-monitoring.md#distributed-monitoring-scenarios-master-satellite-agents). 1712 1713The thing you can do: For `command_endpoint` agents like inside the Director: 1714Host -> Agent -> yes, there is no config sync for this zone in place. Therefore 1715it is valid to just sync their zones via the config sync. 1716 1717#### Director Changes 1718 1719The following restores the Zone/Endpoint objects as config objects outside of `zones.d` 1720in your master/satellite's zones.conf with rendering them as external objects in the Director. 1721 1722[Example](06-distributed-monitoring.md#distributed-monitoring-scenarios-master-satellite-agents) 1723for a 3 level setup with the masters and satellites knowing about the zone hierarchy 1724outside defined in [zones.conf](04-configuration.md#zones-conf): 1725 1726``` 1727object Endpoint "icinga-master1.localdomain" { 1728 //define 'host' attribute to control the connection direction on each instance 1729} 1730 1731object Endpoint "icinga-master2.localdomain" { 1732 //... 1733} 1734 1735object Endpoint "icinga-satellite1.localdomain" { 1736 //... 1737} 1738 1739object Endpoint "icinga-satellite2.localdomain" { 1740 //... 1741} 1742 1743//-------------- 1744// Zone hierarchy with endpoints, required for the trust relationship and that the cluster config sync knows which zone directory defined in zones.d needs to be synced to which endpoint. 1745// That's no different to what is explained in the docs as basic zone trust hierarchy, and is intentionally managed outside in zones.conf there. 1746 1747object Zone "master" { 1748 endpoints = [ "icinga-master1.localdomain", "icinga-master2.localdomain" ] 1749} 1750 1751object Zone "satellite" { 1752 endpoints = [ "icinga-satellite1.localdomain", "icinga-satellite2.localdomain" ] 1753 parent = "master" // trust 1754} 1755``` 1756 1757Prepare the above configuration on all affected nodes, satellites are likely uptodate already. 1758Then continue with the steps below. 1759 1760> * backup your database, just to be on the safe side 1761> * create all non-external Zone/Endpoint-Objects on all related Icinga Master/Satellite-Nodes (manually in your local zones.conf) 1762> * while doing so please do NOT restart Icinga, no deployments 1763> * change the type in the Director DB: 1764> 1765> ```sql 1766> UPDATE icinga_zone SET object_type = 'external_object' WHERE object_type = 'object'; 1767> UPDATE icinga_endpoint SET object_type = 'external_object' WHERE object_type = 'object'; 1768> ``` 1769> 1770> * render and deploy a new configuration in the Director. It will state that there are no changes. Ignore it, deploy anyways 1771> 1772> That's it. All nodes should automatically restart, triggered by the deployed configuration via cluster protocol. 1773 1774 1775### Cluster Troubleshooting Overdue Check Results <a id="troubleshooting-cluster-check-results"></a> 1776 1777If your master does not receive check results (or any other events) from the child zones 1778(satellite, clients, etc.), make sure to check whether the client sending in events 1779is allowed to do so. 1780 1781> **Tip** 1782> 1783> General troubleshooting hints on late check results are documented [here](15-troubleshooting.md#late-check-results). 1784 1785The [distributed monitoring conventions](06-distributed-monitoring.md#distributed-monitoring-conventions) 1786apply. So, if there's a mismatch between your client node's endpoint name and its provided 1787certificate's CN, the master will deny all events. 1788 1789> **Tip** 1790> 1791> [Icinga Web 2](02-installation.md#setting-up-icingaweb2) provides a dashboard view 1792> for overdue check results. 1793 1794Enable the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) on the master 1795for more verbose insights. 1796 1797If the client cannot authenticate, it's a more general [problem](15-troubleshooting.md#troubleshooting-cluster-unauthenticated-clients). 1798 1799The client's endpoint is not configured on nor trusted by the master node: 1800 1801``` 1802Discarding 'check result' message from 'icinga2-agent1.localdomain': Invalid endpoint origin (client not allowed). 1803``` 1804 1805The check result message sent by the client does not belong to the zone the checkable object is 1806in on the master: 1807 1808``` 1809Discarding 'check result' message from 'icinga2-agent1.localdomain': Unauthorized access. 1810``` 1811 1812 1813### Cluster Troubleshooting Replay Log <a id="troubleshooting-cluster-replay-log"></a> 1814 1815If your `/var/lib/icinga2/api/log` directory grows, it generally means that your cluster 1816cannot replay the log on connection loss and re-establishment. A master node for example 1817will store all events for not connected endpoints in the same and child zones. 1818 1819Check the following: 1820 1821* All clients are connected? (e.g. [cluster health check](06-distributed-monitoring.md#distributed-monitoring-health-checks)). 1822* Check your [connection](15-troubleshooting.md#troubleshooting-cluster-connection-errors) in general. 1823* Does the log replay work, e.g. are all events processed and the directory gets cleared up over time? 1824* Decrease the `log_duration` attribute value for that specific [endpoint](09-object-types.md#objecttype-endpoint). 1825 1826The cluster health checks also measure the `slave_lag` metric. Use this data to correlate 1827graphs with other events (e.g. disk I/O, network problems, etc). 1828 1829 1830### Cluster Troubleshooting: Windows Agents <a id="troubleshooting-cluster-windows-agents"></a> 1831 1832 1833#### Windows Service Exe Path <a id="troubleshooting-cluster-windows-agents-service-exe-path"></a> 1834 1835Icinga agents can be installed either as x86 or x64 package. If you enable features, or wonder why 1836logs are not written, the first step is to analyse which path the Windows service `icinga2` is using. 1837 1838Start a new administrative Powershell and ensure that the `icinga2` service is running. 1839 1840``` 1841C:\Program Files\ICINGA2\sbin> net start icinga2 1842``` 1843 1844Use the `Get-WmiObject` function to extract the windows service and its path name. 1845 1846``` 1847C:\Program Files\ICINGA2\sbin> Get-WmiObject win32_service | ?{$_.Name -like '*icinga*'} | select Name, DisplayName, State, PathName 1848 1849Name DisplayName State PathName 1850---- ----------- ----- -------- 1851icinga2 Icinga 2 Running "C:\Program Files\ICINGA2\sbin\icinga2.exe" --scm "daemon" 1852``` 1853 1854If you have used the `icinga2.exe` from a different path to enable e.g. the `debuglog` feature, 1855navigate into `C:\Program Files\ICINGA2\sbin\` and use the correct exe to control the feature set. 1856 1857 1858#### Windows Agents consuming 100% CPU <a id="troubleshooting-cluster-windows-agents-cpu"></a> 1859 1860> **Note** 1861> 1862> The network stack was rewritten in 2.11. This fixes several hanging connections and threads 1863> on older Windows agents and master/satellite nodes. Prior to testing the below, plan an upgrade. 1864 1865Icinga 2 requires the `NodeName` [constant](17-language-reference.md#constants) in various places to run. 1866This includes loading the TLS certificates, setting the proper check source, 1867and so on. 1868 1869Typically the Windows setup wizard and also the CLI commands populate the [constants.conf](04-configuration.md#constants-conf) 1870file with the auto-detected or user-provided FQDN/Common Name. 1871 1872If this constant is not set during startup, Icinga will try to resolve the 1873FQDN, if that fails, fetch the hostname. If everything fails, it logs 1874an error and sets this to `localhost`. This results in undefined behaviour 1875if ignored by the admin. 1876 1877Querying the DNS when not reachable is CPU consuming, and may look like Icinga 1878is doing lots of checks, etc. but actually really is just starting up. 1879 1880In order to fix this, edit the `constants.conf` file and populate 1881the `NodeName` constant with the FQDN. Ensure this is the same value 1882as the local endpoint object name. 1883 1884``` 1885const NodeName = "windows-agent1.domain.com" 1886``` 1887 1888 1889 1890#### Windows blocking Icinga 2 with ephemeral port range <a id="troubleshooting-cluster-windows-agents-ephemeral-port-range"></a> 1891 1892When you see a message like this in your Windows agent logs: 1893 1894``` 1895critical/TcpSocket: Invalid socket: 10055, "An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full." 1896``` 1897 1898Windows is blocking Icinga 2 and as such, no more TCP connection handling is possible. 1899 1900Depending on the version, patch level and installed applications, Windows is changing its 1901range of [ephemeral ports](https://en.wikipedia.org/wiki/Ephemeral_port#Range). 1902 1903In order to solve this, raise the `MaxUserPort` value in the registry. 1904 1905``` 1906HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters 1907 1908Value Name: MaxUserPort Value 1909Type: DWORD 1910Value data: 65534 1911``` 1912 1913More details in [this blogpost](https://www.netways.de/blog/2019/01/24/windows-blocking-icinga-2-with-ephemeral-port-range/) 1914and this [MS help entry](https://support.microsoft.com/en-us/help/196271/when-you-try-to-connect-from-tcp-ports-greater-than-5000-you-receive-t). 1915