1# Icinga 2 Troubleshooting <a id="troubleshooting"></a>
2
3## Required Information <a id="troubleshooting-information-required"></a>
4
5Please ensure to provide any detail which may help reproduce and understand your issue.
6Whether you ask on the [community channels](https://community.icinga.com) or you
7create an issue at [GitHub](https://github.com/Icinga), make sure
8that others can follow your explanations. If necessary, draw a picture and attach it for
9better illustration. This is especially helpful if you are troubleshooting a distributed
10setup.
11
12We've come around many community questions and compiled this list. Add your own
13findings and details please.
14
15* Describe the expected behavior in your own words.
16* Describe the actual behavior in one or two sentences.
17* Ensure to provide general information such as:
18	* How was Icinga 2 installed (and which repository in case) and which distribution are you using
19	* `icinga2 --version`
20	* `icinga2 feature list`
21	* `icinga2 daemon -C`
22	* [Icinga Web 2](https://icinga.com/products/icinga-web-2/) version (screenshot from System - About)
23	* [Icinga Web 2 modules](https://icinga.com/products/icinga-web-2-modules/) e.g. the Icinga Director (optional)
24* Configuration insights:
25	* Provide complete configuration snippets explaining your problem in detail
26	* Your [icinga2.conf](04-configuration.md#icinga2-conf) file
27	* If you run multiple Icinga 2 instances, the [zones.conf](04-configuration.md#zones-conf) file (or `icinga2 object list --type Endpoint` and `icinga2 object list --type Zone`) from all affected nodes.
28* Logs
29	* Relevant output from your main and [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) in `/var/log/icinga2`. Please add step-by-step explanations with timestamps if required.
30	* The newest Icinga 2 crash log if relevant, located in `/var/log/icinga2/crash`
31* Additional details
32	* If the check command failed, what's the output of your manual plugin tests?
33	* In case of [debugging](21-development.md#development) Icinga 2, the full back traces and outputs
34
35## Analyze your Environment <a id="troubleshooting-analyze-environment"></a>
36
37There are many components involved on a server running Icinga 2. When you
38analyze a problem, keep in mind that basic system administration knowledge
39is also key to identify bottlenecks and issues.
40
41> **Tip**
42>
43> [Monitor Icinga 2](08-advanced-topics.md#monitoring-icinga) and use the hints for further analysis.
44
45* Analyze the system's performance and dentify bottlenecks and issues.
46* Collect details about all applications (e.g. Icinga 2, MySQL, Apache, Graphite, Elastic, etc.).
47* If data is exchanged via network (e.g. central MySQL cluster) ensure to monitor the bandwidth capabilities too.
48* Add graphs from Grafana or Graphite as screenshots to your issue description
49
50Install tools which help you to do so. Opinions differ, let us know if you have any additions here!
51
52### Analyse your Linux/Unix Environment <a id="troubleshooting-analyze-environment-linux"></a>
53
54[htop](https://hisham.hm/htop/) is a better replacement for `top` and helps to analyze processes
55interactively.
56
57```bash
58yum install htop
59apt-get install htop
60```
61
62If you are for example experiencing performance issues, open `htop` and take a screenshot.
63Add it to your question and/or bug report.
64
65Analyse disk I/O performance in Grafana, take a screenshot and obfuscate any sensitive details.
66Attach it when posting a question to the community channels.
67
68The [sysstat](https://github.com/sysstat/sysstat) package provides a number of tools to
69analyze the performance on Linux. On FreeBSD you could use `systat` for example.
70
71```bash
72yum install sysstat
73apt-get install sysstat
74```
75
76Example for `vmstat` (summary of memory, processes, etc.):
77
78```bash
79# summary
80vmstat -s
81# print timestamps, format in MB, stats every 1 second, 5 times
82vmstat -t -S M 1 5
83```
84
85Example for `iostat`:
86
87```bash
88watch -n 1 iostat
89```
90
91Example for `sar`:
92
93```bash
94sar # cpu
95sar -r # ram
96sar -q # load avg
97sar -b # I/O
98```
99
100`sysstat` also provides the `iostat` binary. On FreeBSD you could use `systat` for example.
101
102If you are missing checks and metrics found in your analysis, add them to your monitoring!
103
104### Analyze your Windows Environment <a id="troubleshooting-analyze-environment-windows"></a>
105
106A good tip for Windows are the tools found inside the [Sysinternals Suite](https://technet.microsoft.com/en-us/sysinternals/bb842062.aspx).
107
108You can also start `perfmon` and analyze specific performance counters.
109Keep notes which could be important for your monitoring, and add service
110checks later on.
111
112> **Tip**
113>
114> Use an administrative Powershell to gain more insights.
115
116```
117cd C:\ProgramData\icinga2\var\log\icinga2
118
119Get-Content .\icinga2.log -tail 10 -wait
120```
121
122## Enable Debug Output <a id="troubleshooting-enable-debug-output"></a>
123
124### Enable Debug Output on Linux/Unix <a id="troubleshooting-enable-debug-output-linux"></a>
125
126Enable the `debuglog` feature:
127
128```bash
129icinga2 feature enable debuglog
130service icinga2 restart
131```
132
133The debug log file can be found in `/var/log/icinga2/debug.log`.
134
135You can tail the log files with an administrative shell:
136
137```bash
138cd /var/log/icinga2
139tail -f debug.log
140```
141
142Alternatively you may run Icinga 2 in the foreground with debugging enabled. Specify the console
143log severity as an additional parameter argument to `-x`.
144
145```bash
146/usr/sbin/icinga2 daemon -x notice
147```
148
149The [log severity](09-object-types.md#objecttype-filelogger) can be one of `critical`, `warning`, `information`, `notice`
150and `debug`.
151
152### Enable Debug Output on Windows <a id="troubleshooting-enable-debug-output-windows"></a>
153
154Open a Powershell with administrative privileges and enable the debug log feature.
155
156```
157C:\> cd C:\Program Files\ICINGA2\sbin
158
159C:\Program Files\ICINGA2\sbin> .\icinga2.exe feature enable debuglog
160```
161
162Ensure that the Icinga 2 service already writes the main log into `C:\ProgramData\icinga2\var\log\icinga2`.
163Restart the Icinga 2 service in an administrative Powershell and open the newly created `debug.log` file.
164
165```
166C:\> Restart-Service icinga2
167
168C:\> Get-Service icinga2
169```
170
171You can tail the log files with an administrative Powershell:
172
173```
174C:\> cd C:\ProgramData\icinga2\var\log\icinga2
175
176C:\ProgramData\icinga2\var\log\icinga2> Get-Content .\debug.log -tail 10 -wait
177```
178
179## Configuration Troubleshooting <a id="troubleshooting-configuration"></a>
180
181### List Configuration Objects <a id="troubleshooting-list-configuration-objects"></a>
182
183The `icinga2 object list` CLI command can be used to list all configuration objects and their
184attributes. The tool also shows where each of the attributes was modified.
185
186> **Tip**
187>
188> Use the Icinga 2 API to access [config objects at runtime](12-icinga2-api.md#icinga2-api-config-objects) directly.
189
190That way you can also identify which objects have been created from your [apply rules](17-language-reference.md#apply).
191
192```
193# icinga2 object list
194
195Object 'localhost!ssh' of type 'Service':
196  * __name = 'localhost!ssh'
197  * check_command = 'ssh'
198    % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 5:3-5:23
199  * check_interval = 60
200    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 24:3-24:21
201  * host_name = 'localhost'
202    % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 4:3-4:25
203  * max_check_attempts = 3
204    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 23:3-23:24
205  * name = 'ssh'
206  * retry_interval = 30
207    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 25:3-25:22
208  * templates = [ 'ssh', 'generic-service' ]
209    % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 1:0-7:1
210    % += modified in '/etc/icinga2/conf.d/templates.conf', lines 22:1-26:1
211  * type = 'Service'
212  * vars
213    % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19
214    * sla = '24x7'
215      % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19
216
217[...]
218```
219
220On Windows, use an administrative Powershell:
221
222```
223C:\> cd C:\Program Files\ICINGA2\sbin
224
225C:\Program Files\ICINGA2\sbin> .\icinga2.exe object list
226```
227
228You can also filter by name and type:
229
230```
231# icinga2 object list --name *ssh* --type Service
232Object 'localhost!ssh' of type 'Service':
233  * __name = 'localhost!ssh'
234  * check_command = 'ssh'
235    % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 5:3-5:23
236  * check_interval = 60
237    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 24:3-24:21
238  * host_name = 'localhost'
239    % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 4:3-4:25
240  * max_check_attempts = 3
241    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 23:3-23:24
242  * name = 'ssh'
243  * retry_interval = 30
244    % = modified in '/etc/icinga2/conf.d/templates.conf', lines 25:3-25:22
245  * templates = [ 'ssh', 'generic-service' ]
246    % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 1:0-7:1
247    % += modified in '/etc/icinga2/conf.d/templates.conf', lines 22:1-26:1
248  * type = 'Service'
249  * vars
250    % += modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19
251    * sla = '24x7'
252      % = modified in '/etc/icinga2/conf.d/hosts/localhost/ssh.conf', lines 6:3-6:19
253
254Found 1 Service objects.
255
256[2014-10-15 14:27:19 +0200] information/cli: Parsed 175 objects.
257```
258
259Runtime modifications via the [REST API](12-icinga2-api.md#icinga2-api-config-objects)
260are not immediately updated. Furthermore there is a known issue with
261[group assign expressions](17-language-reference.md#group-assign) which are not reflected in the host object output.
262You need to restart Icinga 2 in order to update the `icinga2.debug` cache file.
263
264### Apply rules do not match <a id="apply-rules-do-not-match"></a>
265
266You can analyze apply rules and matching objects by using the [script debugger](20-script-debugger.md#script-debugger).
267
268### Where are the check command definitions? <a id="check-command-definitions"></a>
269
270Icinga 2 features a number of built-in [check command definitions](10-icinga-template-library.md#icinga-template-library) which are
271included with
272
273```
274include <itl>
275include <plugins>
276```
277
278in the [icinga2.conf](04-configuration.md#icinga2-conf) configuration file. These files are not considered
279configuration files and will be overridden on upgrade, so please send modifications as proposed patches upstream.
280The default include path is set to `/usr/share/icinga2/includes` with the constant `IncludeConfDir`.
281
282You should add your own command definitions to a new file in `conf.d/` called `commands.conf`
283or similar.
284
285### Configuration is ignored <a id="configuration-ignored"></a>
286
287* Make sure that the line(s) are not [commented out](17-language-reference.md#comments) (starting with `//` or `#`, or
288encapsulated by `/* ... */`).
289* Is the configuration file included in [icinga2.conf](04-configuration.md#icinga2-conf)?
290
291Run the [configuration validation](11-cli-commands.md#config-validation) and add `notice` as log severity.
292Search for the file which should be included i.e. using the `grep` CLI command.
293
294```bash
295icinga2 daemon -C -x notice | grep command
296```
297
298### Configuration attributes are inherited from <a id="configuration-attribute-inheritance"></a>
299
300Icinga 2 allows you to import templates using the [import](17-language-reference.md#template-imports) keyword. If these templates
301contain additional attributes, your objects will automatically inherit them. You can override
302or modify these attributes in the current object.
303
304The [object list](15-troubleshooting.md#troubleshooting-list-configuration-objects) CLI command allows you to verify the attribute origin.
305
306### Configuration Value with Single Dollar Sign <a id="configuration-value-dollar-sign"></a>
307
308In case your configuration validation fails with a missing closing dollar sign error message, you
309did not properly escape the single dollar sign preventing its usage as [runtime macro](03-monitoring-basics.md#runtime-macros).
310
311```
312critical/config: Error: Validation failed for Object 'ping4' (Type: 'Service') at /etc/icinga2/zones.d/global-templates/windows.conf:24: Closing $ not found in macro format string 'top-syntax=${list}'.
313```
314
315Correct the custom variable value to
316
317```
318"top-syntax=$${list}"
319```
320
321
322## Checks Troubleshooting <a id="troubleshooting-checks"></a>
323
324### Executed Command for Checks <a id="checks-executed-command"></a>
325
326* Use the Icinga 2 API to [query](12-icinga2-api.md#icinga2-api-config-objects-query) host/service objects
327for their check result containing the executed shell command.
328* Use the Icinga 2 [console cli command](11-cli-commands.md#cli-command-console)
329to fetch the checkable object, its check result and the executed shell command.
330* Alternatively enable the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) and look for the executed command.
331
332Example for a service object query using a [regex match](18-library-reference.md#global-functions-regex)
333on the name:
334
335```
336$ curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/services' \
337-d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^http" }, "attrs": [ "__name", "last_check_result" ], "pretty": true }'
338{
339    "results": [
340        {
341            "attrs": {
342                "__name": "example.localdomain!http",
343                "last_check_result": {
344                    "active": true,
345                    "check_source": "example.localdomain",
346                    "command": [
347                        "/usr/local/sbin/check_http",
348                        "-I",
349                        "127.0.0.1",
350                        "-u",
351                        "/"
352                    ],
353
354  ...
355
356                }
357            },
358            "joins": {},
359            "meta": {},
360            "name": "example.localdomain!http",
361            "type": "Service"
362        }
363    ]
364}
365```
366
367Alternatively when using the Director, navigate into the Service Detail View
368in Icinga Web and pick `Inspect` to query the details.
369
370Example for using the `icinga2 console` CLI command evaluation functionality:
371
372```
373$ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/' \
374--eval 'get_service("example.localdomain", "http").last_check_result.command' | python -m json.tool
375[
376    "/usr/local/sbin/check_http",
377    "-I",
378    "127.0.0.1",
379    "-u",
380    "/"
381]
382```
383
384Example for searching the debug log:
385
386```bash
387icinga2 feature enable debuglog
388systemctl restart icinga2
389tail -f /var/log/icinga2/debug.log | grep "notice/Process"
390```
391
392
393### Checks are not executed <a id="checks-not-executed"></a>
394
395* First off, decide whether the checks are executed locally, or remote in a distributed setup.
396
397If the master does not receive check results from the satellite, move your analysis to the satellite
398and verify why the checks are not executed there.
399
400* Check the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) to see if the check command gets executed.
401* Verify that failed dependencies do not prevent command execution.
402* Make sure that the plugin is executable by the Icinga 2 user (run a manual test).
403* Make sure the [checker](11-cli-commands.md#enable-features) feature is enabled.
404* Use the Icinga 2 API [event streams](12-icinga2-api.md#icinga2-api-event-streams) to receive live check result streams.
405
406Test a plugin as icinga user.
407
408```bash
409sudo -u icinga /usr/lib/nagios/plugins/check_ping -4 -H 127.0.0.1 -c 5000,100% -w 3000,80%
410```
411
412> **Note**
413>
414> **Never test plugins as root, but the icinga daemon user.** The environment and permissions differ.
415>
416> Also, the daemon user **does not** spawn a terminal shell (Bash, etc.) so it won't read anything from .bashrc
417> and variants. The Icinga daemon only relies on sysconfig environment variables being set.
418
419
420Enable the checker feature.
421
422```
423# icinga2 feature enable checker
424The feature 'checker' is already enabled.
425```
426
427Fetch all check result events matching the `event.service` name `random`:
428
429```bash
430curl -k -s -u root:icinga -H 'Accept: application/json' -X POST \
431 'https://localhost:5665/v1/events?queue=debugchecks&types=CheckResult&filter=match%28%22random*%22,event.service%29'
432```
433
434
435### Analyze Check Source <a id="checks-check-source"></a>
436
437Sometimes checks are not executed on the remote host, but on the master and so on.
438This could lead into unwanted results or NOT-OK states.
439
440The `check_source` attribute is the best indication where a check command
441was actually executed. This could be a satellite with synced configuration
442or a client as remote command bridge -- both will return the check source
443as where the plugin is called.
444
445Example for retrieving the check source from all `disk` services using a
446[regex match](18-library-reference.md#global-functions-regex) on the name:
447
448```
449$ curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/services' \
450-d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^disk" }, "attrs": [ "__name", "last_check_result" ], "pretty": true }'
451{
452    "results": [
453        {
454            "attrs": {
455                "__name": "icinga2-agent1.localdomain!disk",
456                "last_check_result": {
457                    "active": true,
458                    "check_source": "icinga2-agent1.localdomain",
459
460  ...
461
462                }
463            },
464            "joins": {},
465            "meta": {},
466            "name": "icinga2-agent1.localdomain!disk",
467            "type": "Service"
468        }
469    ]
470}
471```
472
473Alternatively when using the Director, navigate into the Service Detail View
474in Icinga Web and pick `Inspect` to query the details.
475
476Example with the debug console:
477
478```
479$ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/' \
480--eval 'get_service("icinga2-agent1.localdomain", "disk").last_check_result.check_source' | python -m json.tool
481
482"icinga2-agent1.localdomain"
483```
484
485
486### NSClient++ Check Errors with nscp-local <a id="nsclient-check-errors-nscp-local"></a>
487
488The [nscp-local](10-icinga-template-library.md#nscp-check-local) CheckCommand object definitions call the local `nscp.exe` command.
489If a Windows client service check fails to find the `nscp.exe` command, the log output would look like this:
490
491```
492Command ".\nscp.exe" "client" "-a" "drive=d" "-a" "show-all" "-b" "-q" "check_drivesize" failed to execute: 2, "The system cannot find the file specified."
493```
494
495or
496
497```
498Command ".
499scp.exe" "client" "-a" "drive=d" "-a" "show-all" "-b" "-q" "check_drivesize" failed to execute: 2, "The system cannot find the file specified."
500```
501
502The above actually prints `.\\nscp.exe` where the escaped `\n` character gets interpreted as new line.
503
504Both errors lead to the assumption that the `NscpPath` constant is empty or set to a `.` character.
505This could mean the following:
506
507* The command is **not executed on the Windows client**. Check the [check_source](15-troubleshooting.md#checks-check-source) attribute from the check result.
508* You are using an outdated NSClient++ version (0.3.x or 0.4.x) which is not compatible with Icinga 2.
509* You are using a custom NSClient++ installer which does not register the correct GUID for NSClient++
510
511More troubleshooting:
512
513Retrieve the `NscpPath` constant on your Windows client:
514
515```
516C:\Program Files\ICINGA2\sbin\icinga2.exe variable get NscpPath
517```
518
519If the variable is returned empty, manually test how Icinga 2 would resolve
520its path (this can be found inside the ITL):
521
522```
523C:\Program Files\ICINGA2\sbin\icinga2.exe console --eval "dirname(msi_get_component_path(\"{5C45463A-4AE9-4325-96DB-6E239C034F93}\"))"
524```
525
526If this command does not return anything, NSClient++ is not properly installed.
527Verify that inside the `Programs and Features` (`appwiz.cpl`) control panel.
528
529You can run the bundled NSClient++ installer from the Icinga 2 Windows package.
530The msi package is located in `C:\Program Files\ICINGA2\sbin`.
531
532The bundled NSClient++ version has properly been tested with Icinga 2. Keep that
533in mind when using a different package.
534
535
536### Check Thresholds Not Applied <a id="check-thresholds-not-applied"></a>
537
538This could happen with [clients as command endpoint execution](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint).
539
540If you have for example a client host `icinga2-agent1.localdomain`
541and a service `disk` check defined on the master, the warning and
542critical thresholds are sometimes to applied and unwanted notification
543alerts are raised.
544
545This happens because the client itself includes a host object with
546its `NodeName` and a basic set of checks in the [conf.d](04-configuration.md#conf-d)
547directory, i.e. `disk` with the default thresholds.
548
549Clients which have the `checker` feature enabled will attempt
550to execute checks for local services and send their results
551back to the master.
552
553If you now have the same host and service objects on the
554master you will receive wrong check results from the client.
555
556Solution:
557
558* Disable the `checker` feature on clients: `icinga2 feature disable checker`.
559* Remove the inclusion of [conf.d](04-configuration.md#conf-d) as suggested in the [client setup docs](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint).
560
561### Check Fork Errors <a id="check-fork-errors"></a>
562
563Newer versions of systemd on Linux limit spawned processes for
564services.
565
566* v227 introduces the `TasksMax` setting to units which allows to specify the spawned process limit.
567* v228 adds `DefaultTasksMax` in the global `systemd-system.conf` with a default setting of 512 processes.
568* v231 changes the default value to 15%
569
570This can cause problems with Icinga 2 in large environments with many
571commands executed in parallel starting with systemd v228. Some distributions
572also may have changed the defaults.
573
574The error message could look like this:
575
576```
5772017-01-12T11:55:40.742685+01:00 icinga2-master1 kernel: [65567.582895] cgroup: fork rejected by pids controller in /system.slice/icinga2.service
578```
579
580In order to solve the problem, increase the value for `DefaultTasksMax`
581or set it to `infinity`.
582
583```bash
584mkdir /etc/systemd/system/icinga2.service.d
585cat >/etc/systemd/system/icinga2.service.d/limits.conf <<EOF
586[Service]
587DefaultTasksMax=infinity
588EOF
589
590systemctl daemon-reload
591systemctl restart icinga2
592```
593
594An example is available inside the GitHub repository in [etc/initsystem](https://github.com/Icinga/icinga2/tree/master/etc/initsystem).
595
596External Resources:
597
598* [Fork limit for cgroups](https://lwn.net/Articles/663873/)
599* [systemd changelog](https://github.com/systemd/systemd/blob/master/NEWS)
600* [Icinga 2 upstream issue](https://github.com/Icinga/icinga2/issues/5611)
601* [systemd upstream discussion](https://github.com/systemd/systemd/issues/3211)
602
603### Systemd Watchdog <a id="check-systemd-watchdog"></a>
604
605Usually Icinga 2 is a mission critical part of infrastructure and should be
606online at all times. In case of a recoverable crash (e.g. OOM) you may want to
607restart Icinga 2 automatically. With systemd it is as easy as overriding some
608settings of the Icinga 2 systemd service by creating
609`/etc/systemd/system/icinga2.service.d/override.conf` with the following
610content:
611
612```
613[Service]
614Restart=always
615RestartSec=1
616StartLimitInterval=10
617StartLimitBurst=3
618```
619
620Using the watchdog can also help with monitoring Icinga 2, to activate and use it add the following to the override:
621
622```
623WatchdogSec=30s
624```
625
626This way systemd will kill Icinga 2 if it does not notify for over 30 seconds. A timeout of less than 10 seconds is not
627recommended. When the watchdog is activated, `Restart=` can be set to `watchdog` to restart Icinga 2 in the case of a
628watchdog timeout.
629
630Run `systemctl daemon-reload && systemctl restart icinga2` to apply the changes.
631Now systemd will always try to restart Icinga 2 (except if you run
632`systemctl stop icinga2`). After three failures in ten seconds it will stop
633trying because you probably have a problem that requires manual intervention.
634
635### Late Check Results <a id="late-check-results"></a>
636
637[Icinga Web 2](https://icinga.com/products/icinga-web-2/) provides
638a dashboard overview for `overdue checks`.
639
640The REST API provides the [status](12-icinga2-api.md#icinga2-api-status) URL endpoint with some generic metrics
641on Icinga and its features.
642
643```bash
644curl -k -s -u root:icinga 'https://localhost:5665/v1/status?pretty=1' | less
645```
646
647You can also calculate late check results via the REST API:
648
649* Fetch the `last_check` timestamp from each object
650* Compare the timestamp with the current time and add `check_interval` multiple times (change it to see which results are really late, like five times check_interval)
651
652You can use the [icinga2 console](11-cli-commands.md#cli-command-console) to connect to the instance, fetch all data
653and calculate the differences. More infos can be found in [this blogpost](https://icinga.com/2016/08/11/analyse-icinga-2-problems-using-the-console-api/).
654
655```
656# ICINGA2_API_USERNAME=root ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://localhost:5665/'
657
658<1> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res
659
660[ [ "10807-host!10807-service", "2016-06-10 15:54:55 +0200" ], [ "mbmif.int.netways.de!disk /", "2016-01-26 16:32:29 +0100" ] ]
661```
662
663Or if you are just interested in numbers, call [len](18-library-reference.md#array-len) on the result array `res`:
664
665```
666<2> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len()
667
6682.000000
669```
670
671If you need to analyze that problem multiple times, just add the current formatted timestamp
672and repeat the commands.
673
674```
675<23> => DateTime(get_time()).to_string()
676
677"2017-04-04 16:09:39 +0200"
678
679<24> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len()
680
6818287.000000
682```
683
684More details about the Icinga 2 DSL and its possibilities can be
685found in the [language](17-language-reference.md#language-reference) and [library](18-library-reference.md#library-reference) reference chapters.
686
687### Late Check Results in Distributed Environments <a id="late-check-results-distributed"></a>
688
689When it comes to a distributed HA setup, each node is responsible for a load-balanced amount of checks.
690Host and Service objects provide the attribute `paused`. If this is set to `false`, the current node
691actively attempts to schedule and execute checks. Otherwise the node does not feel responsible.
692
693```
694<3> => var res = {}; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res[s.paused] += 1 }; res
695{
696  @false = 2.000000
697  @true = 1.000000
698}
699```
700
701You may ask why this analysis is important? Fair enough - if the numbers are not inverted in a HA zone
702with two members, this may give a hint that the cluster nodes are in a split-brain scenario, or you've
703found a bug in the cluster.
704
705
706If you are running a cluster setup where the master/satellite executes checks on the client via
707[top down command endpoint](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint) mode,
708you might want to know which zones are affected.
709
710This analysis assumes that clients which are not connected, have the string `connected` in their
711service check result output and their state is `UNKNOWN`.
712
713```
714<4> => var res = {}; for (s in get_objects(Service)) { if (s.state==3) { if (match("*connected*", s.last_check_result.output)) { res[s.zone] += [s.host_name] } } };  for (k => v in res) { res[k] = len(v.unique()) }; res
715
716{
717  Asia = 31.000000
718  Europe = 214.000000
719  USA = 207.000000
720}
721```
722
723The result set shows the configured zones and their affected hosts in a unique list. The output also just prints the numbers
724but you can adjust this by omitting the `len()` call inside the for loop.
725
726## Notifications Troubleshooting <a id="troubleshooting-notifications"></a>
727
728### Notifications are not sent <a id="troubleshooting-notifications-not-sent"></a>
729
730* Check the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) to see if a notification is triggered.
731* If yes, verify that all conditions are satisfied.
732* Are any errors on the notification command execution logged?
733
734Please ensure to add these details with your own description
735to any question or issue posted to the community channels.
736
737Verify the following configuration:
738
739* Is the host/service `enable_notifications` attribute set, and if so, to which value?
740* Do the [notification](09-object-types.md#objecttype-notification) attributes `states`, `types`, `period` match the notification conditions?
741* Do the [user](09-object-types.md#objecttype-user) attributes `states`, `types`, `period` match the notification conditions?
742* Are there any notification `begin` and `end` times configured?
743* Make sure the [notification](11-cli-commands.md#enable-features) feature is enabled.
744* Does the referenced NotificationCommand work when executed as Icinga user on the shell?
745
746If notifications are to be sent via mail, make sure that the mail program specified inside the
747[NotificationCommand object](09-object-types.md#objecttype-notificationcommand) exists.
748The name and location depends on the distribution so the preconfigured setting might have to be
749changed on your system.
750
751
752Examples:
753
754```
755# icinga2 feature enable notification
756The feature 'notification' is already enabled.
757```
758
759```bash
760icinga2 feature enable debuglog
761systemctl restart icinga2
762
763grep Notification /var/log/icinga2/debug.log > /root/analyze_notification_problem.log
764```
765
766You can use the Icinga 2 API [event streams](12-icinga2-api.md#icinga2-api-event-streams) to receive live notification streams:
767
768```bash
769curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/events?queue=debugnotifications&types=Notification'
770```
771
772
773### Analyze Notification Result <a id="troubleshooting-notifications-result"></a>
774
775> **Note**
776>
777> This feature is available since v2.11 and requires all endpoints
778> being updated.
779
780Notifications inside a HA enabled zone are balanced between the endpoints,
781just like checks.
782
783Sometimes notifications may fail, and with looking into the (debug) logs
784for both masters, you cannot correlate this correctly.
785
786The `last_notification_result` runtime attribute is stored and synced for Notification
787objects and can be queried via REST API.
788
789Example for retrieving the notification object and result from all `disk` services using a
790[regex match](18-library-reference.md#global-functions-regex) on the name:
791
792```
793$ curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/notifications' \
794-d '{ "filter": "regex(pattern, service.name)", "filter_vars": { "pattern": "^disk" }, "attrs": [ "__name", "last_notification_result" ], "pretty": true }'
795{
796    "results": [
797
798        {
799            "attrs": {
800                "last_notification_result": {
801                    "active": true,
802                    "command": [
803                        "/etc/icinga2/scripts/mail-service-notification.sh",
804                        "-4",
805                        "",
806                        "-6",
807                        "",
808                        "-b",
809                        "",
810                        "-c",
811                        "",
812                        "-d",
813                        "2019-08-02 10:54:16 +0200",
814                        "-e",
815                        "disk",
816                        "-l",
817                        "icinga2-agent1.localdomain",
818                        "-n",
819                        "icinga2-agent1.localdomain",
820                        "-o",
821                        "DISK OK - free space: / 38108 MB (90.84% inode=100%);",
822                        "-r",
823                        "user@localdomain",
824                        "-s",
825                        "OK",
826                        "-t",
827                        "RECOVERY",
828                        "-u",
829                        "disk"
830                    ],
831                    "execution_end": 1564736056.186217,
832                    "execution_endpoint": "icinga2-master1.localdomain",
833                    "execution_start": 1564736056.132323,
834                    "exit_status": 0.0,
835                    "output": "",
836                    "type": "NotificationResult"
837                }
838            },
839            "joins": {},
840            "meta": {},
841            "name": "icinga2-agent1.localdomain!disk!mail-service-notification",
842            "type": "Notification"
843        }
844
845...
846
847    ]
848}
849```
850
851Example with the debug console:
852
853```
854$ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/' --eval 'get_object(Notification, "icinga2-agent1.localdomain!disk!mail-service-notification").last_notification_result.execution_endpoint' | jq
855
856"icinga2-agent1.localdomain"
857```
858
859Whenever a notification command failed to execute, you can fetch the output as well.
860
861
862## Feature Troubleshooting <a id="troubleshooting-features"></a>
863
864### Feature is not working <a id="feature-not-working"></a>
865
866* Make sure that the feature configuration is enabled by symlinking from `features-available/`
867to `features-enabled` and that the latter is included in [icinga2.conf](04-configuration.md#icinga2-conf).
868* Are the feature attributes set correctly according to the documentation?
869* Any errors on the logs?
870
871Look up the [object type](09-object-types.md#object-types) for the required feature and verify it is enabled:
872
873```bash
874icinga2 object list --type <feature object type>
875```
876
877Example for the `graphite` feature:
878
879```bash
880icinga2 object list --type GraphiteWriter
881```
882
883Look into the log and check whether the feature logs anything specific for this matter.
884
885```bash
886grep GraphiteWriter /var/log/icinga2/icinga2.log
887```
888
889## REST API Troubleshooting <a id="troubleshooting-api"></a>
890
891In order to analyse errors on API requests, you can explicitly enable the [verbose parameter](12-icinga2-api.md#icinga2-api-parameters-global).
892
893```
894$ curl -k -s -u root:icinga -H 'Accept: application/json' -X DELETE 'https://localhost:5665/v1/objects/hosts/example-cmdb?pretty=1&verbose=1'
895{
896    "diagnostic_information": "Error: Object does not exist.\n\n ....",
897    "error": 404.0,
898    "status": "No objects found."
899}
900```
901
902### REST API Troubleshooting: No Objects Found <a id="troubleshooting-api-no-objects-found"></a>
903
904Please note that the `404` status with no objects being found can also originate
905from missing or too strict object permissions for the authenticated user.
906
907This is a security feature to disable object name guessing. If this would not be the
908case, restricted users would be able to get a list of names of your objects just by
909trying every character combination.
910
911In order to analyse and fix the problem, please check the following:
912
913- use an administrative account with full permissions to check whether the objects are actually there.
914- verify the permissions on the affected ApiUser object and fix them.
915
916### Missing Runtime Objects (Hosts, Downtimes, etc.) <a id="troubleshooting-api-missing-runtime-objects"></a>
917
918Runtime objects consume the internal config packages shared with
919the REST API config packages. Each host, downtime, comment, service, etc. created
920via the REST API is stored in the `_api` package.
921
922This includes downtimes and comments, which where sometimes stored in the wrong
923directory path, because the active-stage file was empty/truncated/unreadable at
924this point.
925
926Wrong:
927
928```
929/var/lib/icinga2/api/packages/_api//conf.d/downtimes/1234-5678-9012-3456.conf
930```
931
932Correct:
933
934```
935/var/lib/icinga2/api/packages/_api/dbe0bef8-c72c-4cc9-9779-da7c4527c5b2/conf.d/downtimes/1234-5678-9012-3456.conf
936```
937
938At creation time, the object lives in memory but its storage is broken. Upon restart,
939it is missing and e.g. a missing downtime will re-enable unwanted notifications.
940
941`abcd-ef12-3456-7890` is the active stage name which wasn't correctly
942read by the Icinga daemon. This information is stored in `/var/lib/icinga2/api/packages/_api/active-stage`.
943
9442.11 now limits the direct active-stage file access (this is hidden from the user),
945and caches active stages for packages in-memory.
946
947It also tries to repair the broken package, and logs a new message:
948
949```
950systemctl restart icinga2
951
952tail -f /var/log/icinga2/icinga2.log
953
954[2019-05-10 12:27:15 +0200] information/ConfigObjectUtility: Repairing config package '_api' with stage 'dbe0bef8-c72c-4cc9-9779-da7c4527c5b2'.
955```
956
957If this does not happen, you can manually fix the broken config package, and mark a deployed stage as active
958again, carefully do the following steps with creating a backup before:
959
960Navigate into the API package prefix.
961
962```bash
963cd /var/lib/icinga2/api/packages
964```
965
966Change into the broken package directory and list all directories and files
967ordered by latest changes.
968
969```
970cd _api
971ls -lahtr
972
973drwx------  4 michi  wheel   128B Mar 27 14:39 ..
974-rw-r--r--  1 michi  wheel    25B Mar 27 14:39 include.conf
975-rw-r--r--  1 michi  wheel   405B Mar 27 14:39 active.conf
976drwx------  7 michi  wheel   224B Mar 27 15:01 dbe0bef8-c72c-4cc9-9779-da7c4527c5b2
977drwx------  5 michi  wheel   160B Apr 26 12:47 .
978```
979
980As you can see, the `active-stage` file is missing. When it is there, verify that its content
981is set to the stage directory as follows.
982
983If you have more than one stage directory here, pick the latest modified
984directory. Copy the directory name `abcd-ef12-3456-7890` and
985add it into a new file `active-stage`. This can be done like this:
986
987```bash
988echo "dbe0bef8-c72c-4cc9-9779-da7c4527c5b2" > active-stage
989```
990
991`active.conf` needs to have the correct active stage too, add it again
992like this. Note: This is deep down in the code, use with care!
993
994```bash
995sed -i 's/ActiveStages\["_api"\] = .*/ActiveStages\["_api"\] = "dbe0bef8-c72c-4cc9-9779-da7c4527c5b2"/g' /var/lib/icinga2/api/packages/_api/active.conf
996```
997
998Restart Icinga 2.
999
1000```bash
1001systemctl restart icinga2
1002```
1003
1004
1005> **Note**
1006>
1007> The internal `_api` config package structure may change in the future. Do not modify
1008> things in there manually or with scripts unless guided here or asked by a developer.
1009
1010
1011## Certificate Troubleshooting <a id="troubleshooting-certificate"></a>
1012
1013Tools for analysing certificates and TLS connections:
1014
1015- `openssl` binary on Linux/Unix, `openssl.exe` on Windows ([download](https://slproweb.com/products/Win32OpenSSL.html))
1016- `sslscan` tool, available [here](https://github.com/rbsec/sslscan) (Linux/Windows)
1017
1018Note: You can also execute sslscan on Windows using Powershell.
1019
1020
1021### Certificate Verification <a id="troubleshooting-certificate-verification"></a>
1022
1023Whenever the TLS handshake fails when a client connects to the cluster or the REST API,
1024ensure to verify the used certificates.
1025
1026Print the CA and client certificate and ensure that the following attributes are set:
1027
1028* Version must be 3.
1029* Serial number is a hex-encoded string.
1030* Issuer should be your certificate authority (defaults to `Icinga CA` for all certificates generated by CLI commands and automated signing requests).
1031* Validity: The certificate must not be expired.
1032* Subject with the common name (CN) matches the client endpoint name and its FQDN.
1033* v3 extensions must set the basic constraint for `CA:TRUE` (ca.crt) or `CA:FALSE` (client certificate).
1034* Subject Alternative Name is set to the resolvable DNS name (required for REST API and browsers).
1035
1036Navigate into the local certificate store:
1037
1038```bash
1039cd /var/lib/icinga2/certs/
1040```
1041
1042Make sure to verify the agents' certificate and its stored `ca.crt` in `/var/lib/icinga2/certs` and ensure that
1043all instances (master, satellite, agent) are signed by the **same CA**.
1044
1045Compare the `ca.crt` file from the agent node and compare it to your master's `ca.crt` file.
1046
1047
1048Since 2.12, you can use the built-in CLI command `pki verify` to perform TLS certificate validation tasks.
1049
1050> **Hint**
1051>
1052> The CLI command uses exit codes aligned to the [Plugin API specification](05-service-monitoring.md#service-monitoring-plugin-api).
1053> Run the commands followed with `echo $?` to see the exit code.
1054
1055These CLI commands can be used on Windows agents too without requiring the OpenSSL binary.
1056
1057#### Print TLS Certificate <a id="troubleshooting-certificate-verification-print"></a>
1058
1059Pass the certificate file to the `--cert` CLI command parameter to print its details.
1060This prints a shorter version of `openssl x509 -in <file> -text`.
1061
1062```
1063$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt
1064
1065information/cli: Printing certificate 'icinga2-agent2.localdomain.crt'
1066
1067 Version:             3
1068 Subject:             CN = icinga2-agent2.localdomain
1069 Issuer:              CN = Icinga CA
1070 Valid From:          Feb 14 11:29:36 2020 GMT
1071 Valid Until:         Feb 10 11:29:36 2035 GMT
1072 Serial:              12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0
1073
1074 Signature Algorithm: sha256WithRSAEncryption
1075 Subject Alt Names:   icinga2-agent2.localdomain
1076 Fingerprint:         40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C
1077```
1078
1079You can also print the `ca.crt` certificate without any further checks using the `--cert` parameter.
1080
1081#### Print and Verify CA Certificate <a id="troubleshooting-certificate-verification-print-verify-ca"></a>
1082
1083The `--cacert` CLI parameter allows to check whether the given certificate file is a public CA certificate.
1084
1085```
1086$ icinga2 pki verify --cacert ca.crt
1087
1088information/cli: Checking whether certificate 'ca.crt' is a valid CA certificate.
1089
1090 Version:             3
1091 Subject:             CN = Icinga CA
1092 Issuer:              CN = Icinga CA
1093 Valid From:          Jul 31 12:26:08 2019 GMT
1094 Valid Until:         Jul 27 12:26:08 2034 GMT
1095 Serial:              89:fe:d6:12:66:25:3a:c5:07:c1:eb:d4:e6:f2:df:ca:13:6e:dc:e7
1096
1097 Signature Algorithm: sha256WithRSAEncryption
1098 Subject Alt Names:
1099 Fingerprint:         9A 11 29 A8 A3 89 F8 56 30 1A E4 0A B2 6B 28 46 07 F0 14 17 BD 19 A4 FC BD 41 40 B5 1A 8F BF 20
1100
1101information/cli: OK: CA certificate file 'ca.crt' was verified successfully.
1102```
1103
1104In case you pass a wrong certificate, an error is shown and the exit code is `2` (Critical).
1105
1106```
1107$ icinga2 pki verify --cacert icinga2-agent2.localdomain.crt
1108
1109information/cli: Checking whether certificate 'icinga2-agent2.localdomain.crt' is a valid CA certificate.
1110
1111 Version:             3
1112 Subject:             CN = icinga2-agent2.localdomain
1113 Issuer:              CN = Icinga CA
1114 Valid From:          Feb 14 11:29:36 2020 GMT
1115 Valid Until:         Feb 10 11:29:36 2035 GMT
1116 Serial:              12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0
1117
1118 Signature Algorithm: sha256WithRSAEncryption
1119 Subject Alt Names:   icinga2-agent2.localdomain
1120 Fingerprint:         40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C
1121
1122critical/cli: CRITICAL: The file 'icinga2-agent2.localdomain.crt' does not seem to be a CA certificate file.
1123```
1124
1125#### Verify Certificate is signed by CA Certificate <a id="troubleshooting-certificate-verification-signed-by-ca"></a>
1126
1127Pass the certificate file to the `--cert` CLI parameter, and the `ca.crt` file to the `--cacert` parameter.
1128Common troubleshooting scenarios involve self-signed certificates and untrusted agents resulting in disconnects.
1129
1130```
1131$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt --cacert ca.crt
1132
1133information/cli: Verifying certificate 'icinga2-agent2.localdomain.crt'
1134
1135 Version:             3
1136 Subject:             CN = icinga2-agent2.localdomain
1137 Issuer:              CN = Icinga CA
1138 Valid From:          Feb 14 11:29:36 2020 GMT
1139 Valid Until:         Feb 10 11:29:36 2035 GMT
1140 Serial:              12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0
1141
1142 Signature Algorithm: sha256WithRSAEncryption
1143 Subject Alt Names:   icinga2-agent2.localdomain
1144 Fingerprint:         40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C
1145
1146information/cli:  with CA certificate 'ca.crt'.
1147
1148 Version:             3
1149 Subject:             CN = Icinga CA
1150 Issuer:              CN = Icinga CA
1151 Valid From:          Jul 31 12:26:08 2019 GMT
1152 Valid Until:         Jul 27 12:26:08 2034 GMT
1153 Serial:              89:fe:d6:12:66:25:3a:c5:07:c1:eb:d4:e6:f2:df:ca:13:6e:dc:e7
1154
1155 Signature Algorithm: sha256WithRSAEncryption
1156 Subject Alt Names:
1157 Fingerprint:         9A 11 29 A8 A3 89 F8 56 30 1A E4 0A B2 6B 28 46 07 F0 14 17 BD 19 A4 FC BD 41 40 B5 1A 8F BF 20
1158
1159information/cli: OK: Certificate with CN 'icinga2-agent2.localdomain' is signed by CA.
1160```
1161
1162#### Verify Certificate matches Common Name (CN) <a id="troubleshooting-certificate-verification-common-name-match"></a>
1163
1164This allows to verify the common name inside the certificate with a given string parameter.
1165Typical troubleshooting involve upper/lower case CNs (Windows).
1166
1167```
1168$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt --cn icinga2-agent2.localdomain
1169
1170information/cli: Verifying common name (CN) 'icinga2-agent2.localdomain in certificate 'icinga2-agent2.localdomain.crt'.
1171
1172 Version:             3
1173 Subject:             CN = icinga2-agent2.localdomain
1174 Issuer:              CN = Icinga CA
1175 Valid From:          Feb 14 11:29:36 2020 GMT
1176 Valid Until:         Feb 10 11:29:36 2035 GMT
1177 Serial:              12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0
1178
1179 Signature Algorithm: sha256WithRSAEncryption
1180 Subject Alt Names:   icinga2-agent2.localdomain
1181 Fingerprint:         40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C
1182
1183information/cli: OK: CN 'icinga2-agent2.localdomain' matches certificate CN 'icinga2-agent2.localdomain'.
1184```
1185
1186In the example below, the certificate uses an upper case CN.
1187
1188```
1189$ icinga2 pki verify --cert icinga2-agent2.localdomain.crt --cn icinga2-agent2.localdomain
1190
1191information/cli: Verifying common name (CN) 'icinga2-agent2.localdomain in certificate 'icinga2-agent2.localdomain.crt'.
1192
1193 Version:             3
1194 Subject:             CN = ICINGA2-agent2.localdomain
1195 Issuer:              CN = Icinga CA
1196 Valid From:          Feb 14 11:29:36 2020 GMT
1197 Valid Until:         Feb 10 11:29:36 2035 GMT
1198 Serial:              12:fe:a6:22:f5:e3:db:a2:95:8e:92:b2:af:1a:e3:01:44:c4:70:e0
1199
1200 Signature Algorithm: sha256WithRSAEncryption
1201 Subject Alt Names:   ICINGA2-agent2.localdomain
1202 Fingerprint:         40 98 A0 77 58 4F CA D1 05 AC 18 53 D7 52 8D D7 9C 7F 5A 23 B4 AF 63 A4 92 9D DC FF 89 EF F1 4C
1203
1204critical/cli: CRITICAL: CN 'icinga2-agent2.localdomain' does NOT match certificate CN 'icinga2-agent2.localdomain'.
1205```
1206
1207
1208
1209### Certificate Signing <a id="troubleshooting-certificate-signing"></a>
1210
1211Icinga offers two methods:
1212
1213* [CSR Auto-Signing](06-distributed-monitoring.md#distributed-monitoring-setup-csr-auto-signing) which uses a client (an agent or a satellite) ticket generated on the master as trust identifier.
1214* [On-Demand CSR Signing](06-distributed-monitoring.md#distributed-monitoring-setup-on-demand-csr-signing) which allows to sign pending certificate requests on the master.
1215
1216Whenever a signed certificate is not received on the requesting clients, ensure to check the following:
1217
1218* The ticket was valid and the master's log shows nothing different (CSR Auto-Signing only)
1219* If the agent/satellite is directly connected to the CA master, check whether the master actually has performance problems to process the request. If the connection is closed without certificate response, analyse the master's health. It is also advised to upgrade to v2.11 where network stack problems have been fixed.
1220* If you're using a 3+ level cluster, check whether the satellite really forwarded the CSR signing request and the master processed it.
1221
1222Other common errors:
1223
1224* The generated ticket is invalid. The client receives this error message, as well as the master logs a warning message.
1225* The [api](09-object-types.md#objecttype-apilistener) feature does not have the `ticket_salt` attribute set to the generated `TicketSalt` constant by the CLI wizards.
1226
1227In case you are using On-Demand CSR Signing, `icinga2 ca list` on the master only lists
1228pending requests since v2.11. Add `--all` to also see signed requests. Keep in mind that
1229old requests are purged after 1 week automatically.
1230
1231
1232### TLS Handshake: Ciphers <a id="troubleshooting-certificate-handshake-ciphers"></a>
1233
1234Starting with v2.11, the default configured ciphers have been hardened to modern
1235standards. This includes TLS v1.2 as minimum protocol version too.
1236
1237In case the TLS handshake fails with `no shared cipher`, first analyse whether both
1238instances support the same ciphers.
1239
1240#### Client connects to Server <a id="troubleshooting-certificate-handshake-ciphers-client"></a>
1241
1242Connect using `openssl s_client` and try to reproduce the connection problem.
1243
1244> **Important**
1245>
1246> The endpoint with the server role **accepting** the connection picks the preferred
1247> cipher. E.g. when a satellite connects to the master, the master chooses the cipher.
1248>
1249> Keep this in mind where to simulate the client role connecting to a server with
1250> CLI tools such as `openssl s_client`.
1251
1252
1253`openssl s_client` tells you about the supported and shared cipher suites
1254on the remote server. `openssl ciphers` lists locally available ciphers.
1255
1256```
1257$ openssl s_client -connect 192.168.33.5:5665
1258...
1259
1260---
1261SSL handshake has read 2899 bytes and written 786 bytes
1262---
1263New, TLSv1/SSLv3, Cipher is AES256-GCM-SHA384
1264Server public key is 4096 bit
1265Secure Renegotiation IS supported
1266Compression: NONE
1267Expansion: NONE
1268No ALPN negotiated
1269SSL-Session:
1270    Protocol  : TLSv1.2
1271    Cipher    : AES256-GCM-SHA384
1272
1273...
1274```
1275
1276You can specifically use one cipher or a list with the `-cipher` parameter:
1277
1278```bash
1279openssl s_client -connect 192.168.33.5:5665 -cipher 'ECDHE-RSA-AES256-GCM-SHA384'
1280```
1281
1282In order to fully simulate a connecting client, provide the certificates too:
1283
1284```bash
1285CERTPATH='/var/lib/icinga2/certs'
1286HOSTNAME='icinga2.vagrant.demo.icinga.com'
1287openssl s_client -connect 192.168.33.5:5665 -cert "${CERTPATH}/${HOSTNAME}.crt" -key "${CERTPATH}/${HOSTNAME}.key" -CAfile "${CERTPATH}/ca.crt" -cipher 'ECDHE-RSA-AES256-GCM-SHA384'
1288```
1289
1290In case to need to change the default cipher list,
1291set the [cipher_list](09-object-types.md#objecttype-apilistener) attribute
1292in the `api` feature configuration accordingly.
1293
1294Beware of using insecure ciphers, this may become a
1295security risk in your organisation.
1296
1297#### Server Accepts Client <a id="troubleshooting-certificate-handshake-ciphers-server"></a>
1298
1299If the master node does not actively connect to the satellite/agent node(s), but instead
1300the child node actively connectsm, you can still simulate a TLS handshake.
1301
1302Use `openssl s_server` instead of `openssl s_client` on the master during the connection
1303attempt.
1304
1305```bash
1306openssl s_server -connect 192.168.56.101:5665
1307```
1308
1309Since the server role chooses the preferred cipher suite in Icinga,
1310you can test-drive the "agent connects to master" mode here, granted that
1311the TCP connection is not blocked by the firewall.
1312
1313
1314#### Cipher Scan Tools <a id="troubleshooting-certificate-handshake-ciphers-scantools"></a>
1315
1316You can also use different tools to test the available cipher suites, this is what SSL Labs, etc.
1317provide for TLS enabled websites as well. [This post](https://superuser.com/questions/109213/how-do-i-list-the-ssl-tls-cipher-suites-a-particular-website-offers)
1318highlights some tools and scripts such as [sslscan](https://github.com/rbsec/sslscan) or [testssl.sh](https://github.com/drwetter/testssl.sh/)
1319
1320Example for sslscan on macOS against a Debian 10 Buster instance
1321running v2.11:
1322
1323```
1324$ brew install sslscan
1325
1326$ sslscan 192.168.33.22:5665
1327Version: 1.11.13-static
1328OpenSSL 1.0.2f  28 Jan 2016
1329
1330Connected to 192.168.33.22
1331
1332Testing SSL server 192.168.33.22 on port 5665 using SNI name 192.168.33.22
1333
1334  TLS Fallback SCSV:
1335Server supports TLS Fallback SCSV
1336
1337  TLS renegotiation:
1338Session renegotiation not supported
1339
1340  TLS Compression:
1341Compression disabled
1342
1343  Heartbleed:
1344TLS 1.2 not vulnerable to heartbleed
1345TLS 1.1 not vulnerable to heartbleed
1346TLS 1.0 not vulnerable to heartbleed
1347
1348  Supported Server Cipher(s):
1349Preferred TLSv1.2  256 bits  ECDHE-RSA-AES256-GCM-SHA384   Curve P-256 DHE 256
1350Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-GCM-SHA256   Curve P-256 DHE 256
1351Accepted  TLSv1.2  256 bits  ECDHE-RSA-AES256-SHA384       Curve P-256 DHE 256
1352Accepted  TLSv1.2  128 bits  ECDHE-RSA-AES128-SHA256       Curve P-256 DHE 256
1353
1354  SSL Certificate:
1355Signature Algorithm: sha256WithRSAEncryption
1356RSA Key Strength:    4096
1357
1358Subject:  icinga2-debian10.vagrant.demo.icinga.com
1359Altnames: DNS:icinga2-debian10.vagrant.demo.icinga.com
1360Issuer:   Icinga CA
1361
1362Not valid before: Jul 12 07:39:55 2019 GMT
1363Not valid after:  Jul  8 07:39:55 2034 GMT
1364```
1365
1366## Distributed Troubleshooting <a id="troubleshooting-cluster"></a>
1367
1368This applies to any Icinga 2 node in a [distributed monitoring setup](06-distributed-monitoring.md#distributed-monitoring-scenarios).
1369
1370You should configure the [cluster health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks) if you haven't
1371done so already.
1372
1373> **Note**
1374>
1375> Some problems just exist due to wrong file permissions or applied packet filters. Make
1376> sure to check these in the first place.
1377
1378### Cluster Troubleshooting Connection Errors <a id="troubleshooting-cluster-connection-errors"></a>
1379
1380General connection errors could be one of the following problems:
1381
1382* Incorrect network configuration
1383* Packet loss
1384* Firewall rules preventing traffic
1385
1386Use tools like `netstat`, `tcpdump`, `nmap`, etc. to make sure that the cluster communication
1387works (default port is `5665`).
1388
1389```bash
1390tcpdump -n port 5665 -i any
1391
1392netstat -tulpen | grep icinga
1393
1394nmap icinga2-agent1.localdomain
1395```
1396
1397### Cluster Troubleshooting TLS Errors <a id="troubleshooting-cluster-tls-errors"></a>
1398
1399If the cluster communication fails with TLS/SSL error messages, make sure to check
1400the following
1401
1402* File permissions on the TLS certificate files
1403* Does the used CA match for all cluster endpoints?
1404  * Verify the `Issuer` being your trusted CA
1405  * Verify the `Subject` containing your endpoint's common name (CN)
1406  * Check the validity of the certificate itself
1407
1408Try to manually connect from `icinga2-agent1.localdomain` to the master node `icinga2-master1.localdomain`:
1409
1410```
1411$ openssl s_client -CAfile /var/lib/icinga2/certs/ca.crt -cert /var/lib/icinga2/certs/icinga2-agent1.localdomain.crt -key /var/lib/icinga2/certs/icinga2-agent1.localdomain.key -connect icinga2-master1.localdomain:5665
1412
1413CONNECTED(00000003)
1414---
1415...
1416```
1417
1418If the connection attempt fails or your CA does not match, [verify the certificates](15-troubleshooting.md#troubleshooting-certificate-verification).
1419
1420
1421#### Cluster Troubleshooting Unauthenticated Clients <a id="troubleshooting-cluster-unauthenticated-clients"></a>
1422
1423Unauthenticated nodes are able to connect. This is required for agent/satellite setups.
1424
1425Master:
1426
1427```
1428[2015-07-13 18:29:25 +0200] information/ApiListener: New client connection for identity 'icinga2-agent1.localdomain' (unauthenticated)
1429```
1430
1431Agent as command execution bridge:
1432
1433```
1434[2015-07-13 18:29:26 +1000] notice/ClusterEvents: Discarding 'execute command' message from 'icinga2-master1.localdomain': Invalid endpoint origin (client not allowed).
1435```
1436
1437If these messages do not go away, make sure to [verify the master and agent certificates](15-troubleshooting.md#troubleshooting-certificate-verification).
1438
1439
1440### Cluster Troubleshooting Message Errors <a id="troubleshooting-cluster-message-errors"></a>
1441
1442When the network connection is broken or gone, the Icinga 2 instances will be disconnected.
1443If the connection can't be re-established between endpoints in the same HA zone,
1444they remain in a Split-Brain-mode and history may differ.
1445
1446Although the Icinga 2 cluster protocol stores historical events in a [replay log](15-troubleshooting.md#troubleshooting-cluster-replay-log)
1447for later synchronisation, you should make sure to check why the network connection failed.
1448
1449Ensure to setup [cluster health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks)
1450to monitor all endpoints and zones connectivity.
1451
1452
1453### Cluster Troubleshooting Command Endpoint Errors <a id="troubleshooting-cluster-command-endpoint-errors"></a>
1454
1455Command endpoints can be used [for agents](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint)
1456as well as inside an [High-Availability cluster](06-distributed-monitoring.md#distributed-monitoring-scenarios).
1457
1458There is no CLI command for manually executing the check, but you can verify
1459the following (e.g. by invoking a forced check from the web interface):
1460
1461* `/var/log/icinga2/icinga2.log` shows connection and execution errors.
1462 * The ApiListener is not enabled to [accept commands](06-distributed-monitoring.md#distributed-monitoring-top-down-command-endpoint). This is visible as `UNKNOWN` check result output.
1463 * `CheckCommand` definition not found on the remote client. This is visible as `UNKNOWN` check result output.
1464 * Referenced check plugin not found on the remote agent.
1465 * Runtime warnings and errors, e.g. unresolved runtime macros or configuration problems.
1466* Specific error messages are also populated into `UNKNOWN` check results including a detailed error message in their output.
1467* Verify the [check source](15-troubleshooting.md#checks-check-source). This is populated by the node executing the check. You can see that in Icinga Web's detail view or by querying the REST API for this checkable object.
1468
1469Additional tasks:
1470
1471* More verbose logs are found inside the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output).
1472
1473* Use the Icinga 2 API [event streams](12-icinga2-api.md#icinga2-api-event-streams) to receive live check result streams.
1474
1475Fetch all check result events matching the `event.service` name `remote-client`:
1476
1477```bash
1478curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/events?queue=debugcommandendpoint&types=CheckResult&filter=match%28%22remote-client*%22,event.service%29'
1479```
1480
1481
1482#### Agent Hosts with Command Endpoint require a Zone <a id="troubleshooting-cluster-command-endpoint-errors-agent-hosts-command-endpoint-zone"></a>
1483
14842.11 fixes bugs where agent host checks would never be scheduled on
1485the master. One requirement is that the checkable host/service
1486is put into a zone.
1487
1488By default, the Director puts the agent host in `zones.d/master`
1489and you're good to go. If you manually manage the configuration,
1490the config compiler now throws an error with `command_endpoint`
1491being set but no `zone` defined.
1492
1493In case you previously managed the configuration outside of `zones.d`,
1494follow along with the following instructions.
1495
1496The most convenient way with e.g. managing the objects in `conf.d`
1497is to move them into the `master` zone.
1498
1499First, verify the name of your endpoint's zone. The CLI wizards
1500use `master` by default.
1501
1502```
1503vim /etc/icinga2/zones.conf
1504
1505object Zone "master" {
1506  ...
1507}
1508```
1509
1510Then create a new directory in `zones.d` called `master`, if not existing.
1511
1512```bash
1513mkdir -p /etc/icinga2/zones.d/master
1514```
1515
1516Now move the directory tree from `conf.d` into the `master` zone.
1517
1518```bash
1519mv conf.d/* /etc/icinga2/zones.d/master/
1520```
1521
1522Validate the configuration and reload Icinga.
1523
1524```bash
1525icinga2 daemon -C
1526systemctl restart icinga2
1527```
1528
1529Another method is to specify the `zone` attribute manually, but since
1530this may lead into other unwanted "not checked" scenarios, we don't
1531recommend this for your production environment.
1532
1533### Cluster Troubleshooting Config Sync <a id="troubleshooting-cluster-config-sync"></a>
1534
1535In order to troubleshoot this, remember the key things with the config sync:
1536
1537* Within a config master zone, only one configuration master is allowed to have its config in `/etc/icinga2/zones.d`.
1538    * The config master copies the zone configuration from `/etc/icinga2/zones.d` to `/var/lib/icinga2/api/zones`. This storage is the same for all cluster endpoints, and the source for all config syncs.
1539    * The config master puts the `.authoritative` marker on these zone files locally. This is to ensure that it doesn't receive config updates from other endpoints. If you have copied the content from `/var/lib/icinga2/api/zones` to another node, ensure to remove them.
1540* During startup, the master validates the entire configuration and only syncs valid configuration to other zone endpoints.
1541
1542Satellites/Agents < 2.11 store the received configuration directly in `/var/lib/icinga2/api/zones`, validating it and reloading the daemon.
1543Satellites/Agents >= 2.11 put the received configuration into the staging directory `/var/lib/icinga2/api/zones-stage` first, and will only copy this to the production directory `/var/lib/icinga2/api/zones` once the validation was successful.
1544
1545The configuration sync logs the operations during startup with the `information` severity level. Received zone configuration is also logged.
1546
1547Typical errors are:
1548
1549* The api feature doesn't [accept config](06-distributed-monitoring.md#distributed-monitoring-top-down-config-sync). This is logged into `/var/lib/icinga2/icinga2.log`.
1550* The received configuration zone is not configured in [zones.conf](04-configuration.md#zones-conf) and Icinga denies it. This is logged into `/var/lib/icinga2/icinga2.log`.
1551* The satellite/agent has local configuration in `/etc/icinga2/zones.d` and thinks it is authoritive for this zone. It then denies the received update. Purge the content from `/etc/icinga2/zones.d`, `/var/lib/icinga2/api/zones/*` and restart Icinga to fix this.
1552
1553#### New configuration does not trigger a reload <a id="troubleshooting-cluster-config-sync-no-reload"></a>
1554
1555The debug/notice log dumps the calculated checksums for all files and the comparison. Analyse this to troubleshoot further.
1556
1557A complete sync for the `director-global` global zone can look like this:
1558
1559```
1560[2019-08-01 09:20:25 +0200] notice/JsonRpcConnection: Received 'config::Update' message from 'icinga2-master1.localdomain'
1561[2019-08-01 09:20:25 +0200] information/ApiListener: Applying config update from endpoint 'icinga2-master1.localdomain' of zone 'master'.
1562[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/.checksums'.
1563[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/.timestamp'.
1564[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/director/001-director-basics.conf'.
1565[2019-08-01 09:20:25 +0200] notice/ApiListener: Creating config update for file '/var/lib/icinga2/api/zones/director-global/director/host_templates.conf'.
1566[2019-08-01 09:20:25 +0200] information/ApiListener: Received configuration for zone 'director-global' from endpoint 'icinga2-master1.localdomain'. Comparing the checksums.
1567[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking for config change between stage and production. Old (4): '{"/.checksums":"c4dd1237e36dcad9142f4d9a81324a7cae7d01543a672299
1568b8c1bb08b629b7d1","/.timestamp":"f21c0e6551328812d9f5176e5e31f390de0d431d09800a85385630727b404d83","/director/001-director-basics.conf":"f86583eec81c9bf3a1823a761991fb53d640bd0dc
15696cd12bf8c5e6a275359970f","/director/host_templates.conf":"831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc"}' vs. new (4): '{"/.checksums":"c4dd1237e36dcad9142f4d
15709a81324a7cae7d01543a672299b8c1bb08b629b7d1","/.timestamp":"f21c0e6551328812d9f5176e5e31f390de0d431d09800a85385630727b404d83","/director/001-director-basics.conf":"f86583eec81c9bf
15713a1823a761991fb53d640bd0dc6cd12bf8c5e6a275359970f","/director/host_templates.conf":"831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc"}'.
1572[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring old internal file '/.checksums'.
1573[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring old internal file '/.timestamp'.
1574[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/001-director-basics.conf for old checksum: f86583eec81c9bf3a1823a761991fb53d640bd0dc6cd12bf8c5e6a275359970f.
1575[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/host_templates.conf for old checksum: 831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc.
1576[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring new internal file '/.checksums'.
1577[2019-08-01 09:20:25 +0200] debug/ApiListener: Ignoring new internal file '/.timestamp'.
1578[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/001-director-basics.conf for new checksum: f86583eec81c9bf3a1823a761991fb53d640bd0dc6cd12bf8c5e6a275359970f.
1579[2019-08-01 09:20:25 +0200] debug/ApiListener: Checking /director/host_templates.conf for new checksum: 831e9b7e3ec1e33288e56a51e63c688da1d6316155349382a101f7fce6229ecc.
1580[2019-08-01 09:20:25 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/director-global//director/001-director-basics.c
1581onf' for zone 'director-global'.
1582[2019-08-01 09:20:25 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/director-global//director/host_templates.conf'
1583for zone 'director-global'.
1584[2019-08-01 09:20:25 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/director-global' (2209 Bytes).
1585
1586...
1587
1588[2019-08-01 09:20:25 +0200] information/ApiListener: Received configuration updates (4) from endpoint 'icinga2-master1.localdomain' are different to production, triggering validation and reload.
1589[2019-08-01 09:20:25 +0200] notice/Process: Running command '/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '--close-stdio' '-e' '/var/log/icinga2/e
1590rror.log' '--validate' '--define' 'System.ZonesStageVarDir=/var/lib/icinga2/api/zones-stage/': PID 4532
1591[2019-08-01 09:20:25 +0200] notice/Process: PID 4532 ('/usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2' '--no-stack-rlimit' 'daemon' '--close-stdio' '-e' '/var/log/icinga2/error.l
1592og' '--validate' '--define' 'System.ZonesStageVarDir=/var/lib/icinga2/api/zones-stage/') terminated with exit code 0
1593[2019-08-01 09:20:25 +0200] information/ApiListener: Config validation for stage '/var/lib/icinga2/api/zones-stage/' was OK, replacing into '/var/lib/icinga2/api/zones/' and trig
1594gering reload.
1595[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//.checksums' from config sync staging to production zones directory.
1596[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//.timestamp' from config sync staging to production zones directory.
1597[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//director/001-director-basics.conf' from config sync staging to production zones directory.
1598[2019-08-01 09:20:26 +0200] information/ApiListener: Copying file 'director-global//director/host_templates.conf' from config sync staging to production zones directory.
1599
1600...
1601
1602[2019-08-01 09:20:26 +0200] notice/Application: Got reload command, forwarding to umbrella process (PID 4236)
1603```
1604
1605In case the received configuration updates are equal to what is running in production, a different message is logged and the validation/reload is skipped.
1606
1607```
1608[2020-02-05 15:18:19 +0200] information/ApiListener: Received configuration updates (4) from endpoint 'icinga2-master1.localdomain' are equal to production, skipping validation and reload.
1609```
1610
1611
1612#### Syncing Binary Files is Denied <a id="troubleshooting-cluster-config-sync-binary-denied"></a>
1613
1614The config sync is built for syncing text configuration files, wrapped into JSON-RPC messages.
1615Some users have started to use this as binary file sync instead of using tools built for this:
1616rsync, git, Puppet, Ansible, etc.
1617
1618Starting with 2.11, this attempt is now prohibited and logged.
1619
1620```
1621[2019-08-02 16:03:19 +0200] critical/ApiListener: Ignoring file '/etc/icinga2/zones.d/global-templates/forbidden.exe' for cluster config sync: Does not contain valid UTF8. Binary files are not supported.
1622Context:
1623	(0) Creating config update for file '/etc/icinga2/zones.d/global-templates/forbidden.exe'
1624	(1) Activating object 'api' of type 'ApiListener'
1625```
1626
1627In order to solve this problem, remove the mentioned files from `zones.d` and use an alternate way
1628of syncing plugin binaries to your satellites and agents.
1629
1630
1631#### Zones in Zones doesn't work <a id="troubleshooting-cluster-config-zones-in-zones"></a>
1632
1633The cluster config sync works in the way that configuration
1634put into `/etc/icinga2/zones.d` only is included when configured
1635outside in `/etc/icinga2/zones.conf`.
1636
1637If you for example create a "Zone Inception" with defining the
1638`satellite` zone in `zones.d/master`, the config compiler does not
1639re-run and include this zone config recursively from `zones.d/satellite`.
1640
1641Since v2.11, the config compiler is only including directories where a
1642zone has been configured. Otherwise it would include renamed old zones,
1643broken zones, etc. and those long-lasting bugs have been now fixed.
1644
1645A more concrete example: Masters and Satellites still need to know the Zone hierarchy outside of `zones.d` synced configuration.
1646
1647**Doesn't work**
1648
1649```
1650vim /etc/icinga2/zones.conf
1651
1652object Zone "master" {
1653  endpoints = [ "icinga2-master1.localdomain", "icinga2-master2.localdomain" ]
1654}
1655```
1656
1657```
1658vim /etc/icinga2/zones.d/master/satellite-zones.conf
1659
1660object Zone "satellite" {
1661  endpoints = [ "icinga2-satellite1.localdomain", "icinga2-satellite1.localdomain" ]
1662}
1663```
1664
1665```
1666vim /etc/icinga2/zones.d/satellite/satellite-hosts.conf
1667
1668object Host "agent" { ... }
1669```
1670
1671The `agent` host object will never reach the satellite, since the master does not have
1672the `satellite` zone configured outside of zones.d.
1673
1674
1675**Works**
1676
1677Each instance needs to know this, and know about the endpoints first:
1678
1679```
1680vim /etc/icinga2/zones.conf
1681
1682object Endpoint "icinga2-master1.localdomain" { ... }
1683object Endpoint "icinga2-master2.localdomain" { ... }
1684
1685object Endpoint "icinga2-satellite1.localdomain" { ... }
1686object Endpoint "icinga2-satellite2.localdomain" { ... }
1687```
1688
1689Then the zone hierarchy as trust and also config sync inclusion is required.
1690
1691```
1692vim /etc/icinga2/zones.conf
1693
1694object Zone "master" {
1695  endpoints = [ "icinga2-master1.localdomain", "icinga2-master2.localdomain" ]
1696}
1697
1698object Zone "satellite" {
1699  endpoints = [ "icinga2-satellite1.localdomain", "icinga2-satellite1.localdomain" ]
1700}
1701```
1702
1703Once done, you can start deploying actual monitoring objects into the satellite zone.
1704
1705```
1706vim /etc/icinga2/zones.d/satellite/satellite-hosts.conf
1707
1708object Host "agent" { ... }
1709```
1710
1711That's also explained and described in the [documentation](06-distributed-monitoring.md#distributed-monitoring-scenarios-master-satellite-agents).
1712
1713The thing you can do: For `command_endpoint` agents like inside the Director:
1714Host -> Agent -> yes, there is no config sync for this zone in place. Therefore
1715it is valid to just sync their zones via the config sync.
1716
1717#### Director Changes
1718
1719The following restores the Zone/Endpoint objects as config objects outside of `zones.d`
1720in your master/satellite's zones.conf with rendering them as external objects in the Director.
1721
1722[Example](06-distributed-monitoring.md#distributed-monitoring-scenarios-master-satellite-agents)
1723for a 3 level setup with the masters and satellites knowing about the zone hierarchy
1724outside defined in [zones.conf](04-configuration.md#zones-conf):
1725
1726```
1727object Endpoint "icinga-master1.localdomain" {
1728  //define 'host' attribute to control the connection direction on each instance
1729}
1730
1731object Endpoint "icinga-master2.localdomain" {
1732  //...
1733}
1734
1735object Endpoint "icinga-satellite1.localdomain" {
1736  //...
1737}
1738
1739object Endpoint "icinga-satellite2.localdomain" {
1740  //...
1741}
1742
1743//--------------
1744// Zone hierarchy with endpoints, required for the trust relationship and that the cluster config sync knows which zone directory defined in zones.d needs to be synced to which endpoint.
1745// That's no different to what is explained in the docs as basic zone trust hierarchy, and is intentionally managed outside in zones.conf there.
1746
1747object Zone "master" {
1748  endpoints = [ "icinga-master1.localdomain", "icinga-master2.localdomain" ]
1749}
1750
1751object Zone "satellite" {
1752  endpoints = [ "icinga-satellite1.localdomain", "icinga-satellite2.localdomain" ]
1753  parent = "master" // trust
1754}
1755```
1756
1757Prepare the above configuration on all affected nodes, satellites are likely uptodate already.
1758Then continue with the steps below.
1759
1760> * backup your database, just to be on the safe side
1761> * create all non-external Zone/Endpoint-Objects on all related Icinga Master/Satellite-Nodes (manually in your local zones.conf)
1762> * while doing so please do NOT restart Icinga, no deployments
1763> * change the type in the Director DB:
1764>
1765> ```sql
1766> UPDATE icinga_zone SET object_type = 'external_object' WHERE object_type = 'object';
1767> UPDATE icinga_endpoint SET object_type = 'external_object' WHERE object_type = 'object';
1768> ```
1769>
1770> * render and deploy a new configuration in the Director. It will state that there are no changes. Ignore it, deploy anyways
1771>
1772> That's it. All nodes should automatically restart, triggered by the deployed configuration via cluster protocol.
1773
1774
1775### Cluster Troubleshooting Overdue Check Results <a id="troubleshooting-cluster-check-results"></a>
1776
1777If your master does not receive check results (or any other events) from the child zones
1778(satellite, clients, etc.), make sure to check whether the client sending in events
1779is allowed to do so.
1780
1781> **Tip**
1782>
1783> General troubleshooting hints on late check results are documented [here](15-troubleshooting.md#late-check-results).
1784
1785The [distributed monitoring conventions](06-distributed-monitoring.md#distributed-monitoring-conventions)
1786apply. So, if there's a mismatch between your client node's endpoint name and its provided
1787certificate's CN, the master will deny all events.
1788
1789> **Tip**
1790>
1791> [Icinga Web 2](02-installation.md#setting-up-icingaweb2) provides a dashboard view
1792> for overdue check results.
1793
1794Enable the [debug log](15-troubleshooting.md#troubleshooting-enable-debug-output) on the master
1795for more verbose insights.
1796
1797If the client cannot authenticate, it's a more general [problem](15-troubleshooting.md#troubleshooting-cluster-unauthenticated-clients).
1798
1799The client's endpoint is not configured on nor trusted by the master node:
1800
1801```
1802Discarding 'check result' message from 'icinga2-agent1.localdomain': Invalid endpoint origin (client not allowed).
1803```
1804
1805The check result message sent by the client does not belong to the zone the checkable object is
1806in on the master:
1807
1808```
1809Discarding 'check result' message from 'icinga2-agent1.localdomain': Unauthorized access.
1810```
1811
1812
1813### Cluster Troubleshooting Replay Log <a id="troubleshooting-cluster-replay-log"></a>
1814
1815If your `/var/lib/icinga2/api/log` directory grows, it generally means that your cluster
1816cannot replay the log on connection loss and re-establishment. A master node for example
1817will store all events for not connected endpoints in the same and child zones.
1818
1819Check the following:
1820
1821* All clients are connected? (e.g. [cluster health check](06-distributed-monitoring.md#distributed-monitoring-health-checks)).
1822* Check your [connection](15-troubleshooting.md#troubleshooting-cluster-connection-errors) in general.
1823* Does the log replay work, e.g. are all events processed and the directory gets cleared up over time?
1824* Decrease the `log_duration` attribute value for that specific [endpoint](09-object-types.md#objecttype-endpoint).
1825
1826The cluster health checks also measure the `slave_lag` metric. Use this data to correlate
1827graphs with other events (e.g. disk I/O, network problems, etc).
1828
1829
1830### Cluster Troubleshooting: Windows Agents <a id="troubleshooting-cluster-windows-agents"></a>
1831
1832
1833#### Windows Service Exe Path <a id="troubleshooting-cluster-windows-agents-service-exe-path"></a>
1834
1835Icinga agents can be installed either as x86 or x64 package. If you enable features, or wonder why
1836logs are not written, the first step is to analyse which path the Windows service `icinga2` is using.
1837
1838Start a new administrative Powershell and ensure that the `icinga2` service is running.
1839
1840```
1841C:\Program Files\ICINGA2\sbin> net start icinga2
1842```
1843
1844Use the `Get-WmiObject` function to extract the windows service and its path name.
1845
1846```
1847C:\Program Files\ICINGA2\sbin> Get-WmiObject win32_service | ?{$_.Name -like '*icinga*'} | select Name, DisplayName, State, PathName
1848
1849Name    DisplayName State   PathName
1850----    ----------- -----   --------
1851icinga2 Icinga 2    Running "C:\Program Files\ICINGA2\sbin\icinga2.exe" --scm "daemon"
1852```
1853
1854If you have used the `icinga2.exe` from a different path to enable e.g. the `debuglog` feature,
1855navigate into `C:\Program Files\ICINGA2\sbin\` and use the correct exe to control the feature set.
1856
1857
1858#### Windows Agents consuming 100% CPU <a id="troubleshooting-cluster-windows-agents-cpu"></a>
1859
1860> **Note**
1861>
1862> The network stack was rewritten in 2.11. This fixes several hanging connections and threads
1863> on older Windows agents and master/satellite nodes. Prior to testing the below, plan an upgrade.
1864
1865Icinga 2 requires the `NodeName` [constant](17-language-reference.md#constants) in various places to run.
1866This includes loading the TLS certificates, setting the proper check source,
1867and so on.
1868
1869Typically the Windows setup wizard and also the CLI commands populate the [constants.conf](04-configuration.md#constants-conf)
1870file with the auto-detected or user-provided FQDN/Common Name.
1871
1872If this constant is not set during startup, Icinga will try to resolve the
1873FQDN, if that fails, fetch the hostname. If everything fails, it logs
1874an error and sets this to `localhost`. This results in undefined behaviour
1875if ignored by the admin.
1876
1877Querying the DNS when not reachable is CPU consuming, and may look like Icinga
1878is doing lots of checks, etc. but actually really is just starting up.
1879
1880In order to fix this, edit the `constants.conf` file and populate
1881the `NodeName` constant with the FQDN. Ensure this is the same value
1882as the local endpoint object name.
1883
1884```
1885const NodeName = "windows-agent1.domain.com"
1886```
1887
1888
1889
1890#### Windows blocking Icinga 2 with ephemeral port range <a id="troubleshooting-cluster-windows-agents-ephemeral-port-range"></a>
1891
1892When you see a message like this in your Windows agent logs:
1893
1894```
1895critical/TcpSocket: Invalid socket: 10055, "An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full."
1896```
1897
1898Windows is blocking Icinga 2 and as such, no more TCP connection handling is possible.
1899
1900Depending on the version, patch level and installed applications, Windows is changing its
1901range of [ephemeral ports](https://en.wikipedia.org/wiki/Ephemeral_port#Range).
1902
1903In order to solve this, raise the `MaxUserPort` value in the registry.
1904
1905```
1906HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters
1907
1908Value Name: MaxUserPort Value
1909Type: DWORD
1910Value data: 65534
1911```
1912
1913More details in [this blogpost](https://www.netways.de/blog/2019/01/24/windows-blocking-icinga-2-with-ephemeral-port-range/)
1914and this [MS help entry](https://support.microsoft.com/en-us/help/196271/when-you-try-to-connect-from-tcp-ports-greater-than-5000-you-receive-t).
1915