1----------------------------------------------------------------------
2 gather -- collect and display system statistics
3----------------------------------------------------------------------
4
5$Id$
6
71.Introduction.
82.Installation and configuration.
93.Examples.
104.Timeperiod shortcuts.
11
12
131.Introduction.
14--------------
15
16Many of those who have worked with computer systems faced with
17situations when something wrong goes with the system that need to be
18traced in some way. Under Unix there are many nice tools such as top,
19ps, netstat, vmstat, sysctl and so on that can be used to get useful
20information about system and trace the problem. But what if the
21problem happens accidently and usually when you are away from the box,
22have no access to it or just are sleeping? What you have when you get
23to the box is some logs and may be some performance statistics in
24rrd. Very often it is not enough to figure out what was wrong with the
25system. To have more info you can write some scripts that run system
26utilities in batch mode to get statistics, run these scripts via cron,
27then when something has happened you have tons of files with utils'
28output where you have to find useful for your information. Writing
29scripts and then digging in thousands of files is time consuming task
30that it would be nice to automatize a bit. So this is where gather
31utility goes to help. This script runs system utils to collect
32statistics and then helps you to analyze collected data. You specify
33commands you want to run to get statistics in gather.map file, set
34cron to run gather utility with desired periodicity and then use this
35utility to output and grep collected statistics for specified
36period. gather's output contains timestamps thus you can see what things
37and when happened.
38
39
402.Installation and configuration.
41---------------------------------
42
43gather utility is perl script so you need perl installed on your box
44to use it. Run
45
46 make
47
48to make installation files. Then copy gather script somewhere you want
49to have it, preferably in some directory from PATH. gather reads its
50configuration parameters from gather.cfg and gather.map files. Check
51in gather script where by default it looks for configs and put
52gather.cfg there. You can change some defaults running make with
53additional variable set, e.g:
54
55 make CONFDIR=$HOME/.gather
56
57See Makefile for other parameters you can set.
58
59Also you can change defaults using command line parameter. Run
60
61 gather help
62
63to see minihelp and some defaults.
64
65gather.cfg contains configuration variables that specify location of
66gather.map file, directory where statistics is collected, compression
67used and some other. Take gather.cfg from gather distribution and tune
68it for your environment and needs. Every configuration variable is
69commented so you shouldn't have problems with configuration. Please
70note that gather.cfg is really a Perl script evaluated by gather when
71it runs. So be careful not to make syntactic errors if you want the
72program to work.
73
74Next thing is to specify commands in gather.map file. You can use
75gather.map from gather distribution as an example. gather.map looks
76something like this:
77
78 %map = ('uptime' => {'desc' => 'system uptime',
79 'cmd' => '/usr/bin/uptime'},
80 'sysctl' => {'desc' => 'sysctl variables',
81 'cmd' => '/sbin/sysctl -a'},
82 'sockstat' => {'desc' => 'sockstat output',
83 'cmd' => '/usr/bin/sockstat'}
84 ...
85
86 );
87
88It is rather self explanatory but here is a description. In garher.map
89you should initialize Perl hash variable `%map'. Keys 'sysctl',
90'sockstat' are just names for identifying your statistics commands;
91you can use any name you like here but you can't use the same name
92twice. 'desc' is optional description of the command, you can write
93everything you want here, but try to keep it informative and short
94enough, as it is used in `gather show utils' output. 'cmd' is the
95command to run. All output from the command will be redirected to
96gather database.
97
98When you have gather.cfg and the map configured you can run gather to
99collect data:
100
101 gather collect
102
103gather will run all commands specified in map and store output. You
104need to set up cron to run this command with desired periodicity.
105
106Also if you don't want to run out of free space you need to setup
107command:
108
109 gather expire <days>
110
111in crontab to run daily and expire old data. Data older then <days>
112will be deleted.
113
114Gather database is actually just a directory where output of each
115script is saved in separate file in YEAR-MONTH-DAY/HOUR/MINUTE
116subdirectory, thus you can browse it looking for needed info but also
117you can use gather to retrieve and grep data. Run
118
119 gather show help
120
121to see minihelp about available subcommands. Next section provides
122some examples that demonstrate how you can use gather utility.
123
1242.1 Installing with Chef.
125-------------------------
126
127The gather utility can be installed using Chef cookbook. See further
128instructions on Chef Supermarket - open-source community platform:
129https://supermarket.chef.io/cookbooks/gatherit
130
1313.Examples.
132-----------
133
134When you have set up gather utility as described above and collected
135some statistics you can use `gather show' command to display and grep
136data.
137
1383.1.show utils.
139---------------
140
141Run
142
143 gather show utils
144
145and you will see the list of commands you have installed in map and
146used to collect data:
147
148 ------------------------------------------------------------------
149 name cmd desc
150 ------------------------------------------------------------------
151 ...
152 sockstat /usr/bin/sockstat sockstat output
153 sysctl /sbin/sysctl -a sysctl variables
154 ...
155 uptime /usr/bin/uptime system uptime
156 ...
157
1583.2.Time periods.
159-----------------
160
161Asking gather to display data you have to specify time period what
162data you want for. Time period has the following format:
163YEAR-MONTH-DAY/HOUR/MINUTE, eg:
164
165 2008-09-14/11/10
166
167HOUR and MINUTE are optional so if you want data for the whole hour,
168you can specify:
169
170 2008-09-14/11
171
172and if you want data for the whole day, just specify this day:
173
174 2008-09-14
175
176Yoy can use ranges for setting time periods. E.g. specifying:
177
178 2008-09-13/11/10--2008-09-14/12
179
180you will get data for period from 11:10 2008-09-13 to 12:00
1812008-09-14.
182
1833.3.show grep.
184--------------
185
186To display data you can use grep subcommand. You should set regexpres
187that will filter data. If you want all output, set regexp to '.'
188(point). E.g.:
189
190 gather show -t '2008-09-14/13' grep '.*' uptime
191
192will output something like this:
193
194 2008-09-14/13/00: 1:00PM up 1:53, 0 users, load averages: 0.16, 0.04, 0.01
195 2008-09-14/13/05: 1:05PM up 1:58, 0 users, load averages: 0.16, 0.05, 0.01
196 2008-09-14/13/10: 1:10PM up 2:03, 0 users, load averages: 0.16, 0.04, 0.01
197 2008-09-14/13/15: 1:15PM up 2:08, 0 users, load averages: 0.16, 0.04, 0.01
198 2008-09-14/13/20: 1:20PM up 2:13, 0 users, load averages: 0.16, 0.04, 0.01
199 2008-09-14/13/25: 1:25PM up 2:18, 0 users, load averages: 0.00, 0.00, 0.00
200 2008-09-14/13/30: 1:30PM up 2:23, 0 users, load averages: 0.16, 0.03, 0.01
201 2008-09-14/13/35: 1:35PM up 2:28, 0 users, load averages: 0.08, 0.02, 0.01
202 2008-09-14/13/40: 1:40PM up 2:33, 0 users, load averages: 0.16, 0.03, 0.01
203 2008-09-14/13/45: 1:45PM up 2:38, 0 users, load averages: 0.18, 0.05, 0.01
204 2008-09-14/13/50: 1:50PM up 2:43, 0 users, load averages: 0.23, 0.07, 0.02
205 2008-09-14/13/55: 1:55PM up 2:48, 0 users, load averages: 0.08, 0.03, 0.01
206
207But usually you will need more complicated regexpres then just '.' to
208filter needed info. E.g. to see statistics for several hours about
209open files, you can run:
210
211 gather show -t '2008-09-14/12--2008-09-14/15' grep '^kern.openfiles:' sysctl
212
213That will output something like this:
214
215 2008-09-14/12/00: kern.openfiles: 197
216 2008-09-14/12/05: kern.openfiles: 194
217 2008-09-14/12/10: kern.openfiles: 194
218 ...
219 2008-09-14/15/50: kern.openfiles: 187
220 2008-09-14/15/55: kern.openfiles: 188
221
222You can use '-c' option if you want to count of matched strings rather
223then display them. E.g. to see number of sockets used by user www from
22412:00 to 13:00 on 2008-09-14 you can run:
225
226 gather show -t '2008-09-14/12' grep -c '^www\s' sockstat
227
228and output like this:
229
230 2008-09-14/12/00: 10
231 2008-09-14/12/05: 10
232 2008-09-14/12/10: 10
233 ...
234
2353.4.show filter.
236----------------
237
238If you need not just to grep data but perform some actions on them you
239will want to use filter subcommand. E.g to see amount of loginned
240users, you can run:
241
242 gather show -t '2008-09-14/12' filter "perl -pe 's/^.*(\\d+ users),.*\$/\$1/'" uptime
243
244That will output something like this:
245
246 2008-09-14/12/00: 0 users
247 2008-09-14/12/05: 0 users
248 2008-09-14/12/10: 0 users
249 ...
250
251Remember about screening properly all control characters in filter
252command. If filter is rather complicated it is better to write
253separate script to avoid screening hell and then run:
254
255 gather show -t '2008-09-14/11' filter ./script uptime
256
257Other advantage of this approach is that you can store written filter
258and use it later. If you use gather for some time soon you will have
259collection of useful filters.
260
2613.5.show assemble.
262------------------
263
264Another show subcommand, `assemble', can be useful when analysing an
265output of such utilities like `sysctl' or `netstat -s' -- long list of
266variables with their values.
267
268E.g. `systctl -a' output would look something like this:
269
270 ...
271 vm.stats.misc.zero_page_count: 8130
272 vm.stats.misc.cnt_prezero: 0
273 vm.stats.vm.v_kthreadpages: 0
274 vm.stats.vm.v_rforkpages: 0
275 vm.stats.vm.v_vforkpages: 170509301
276 vm.stats.vm.v_forkpages: 1647077180
277 vm.stats.vm.v_kthreads: 41928
278 vm.stats.vm.v_rforks: 0
279 vm.stats.vm.v_vforks: 829962
280 vm.stats.vm.v_forks: 9605243
281 vm.stats.vm.v_interrupt_free_min: 2
282 vm.stats.vm.v_pageout_free_min: 34
283 vm.stats.vm.v_cache_max: 44618
284 vm.stats.vm.v_cache_min: 22309
285 vm.stats.vm.v_cache_count: 12929
286 vm.stats.vm.v_inactive_count: 445331
287 vm.stats.vm.v_inactive_target: 33463
288 vm.stats.vm.v_active_count: 70486
289 vm.stats.vm.v_wire_count: 67018
290 ...
291
292Using e.g. `show grep vm.stats.vm.v_vforkpages' we could get listing
293for this particular variable in some interesting timeperiod. But
294checking all variables in this way would be a long process. With
295assemble subcommand it is much faster:
296
297 gather show -t 2010-02-14/08 assemble '^(?k:vm\.stats\..*):\s+(?v:\d+)$' sysctl
298
299 ...
300
301 sysctl: vm.stats.object.collapses:
302
303 2010-02-14/08/00: 35627077 -
304 2010-02-14/08/05: 35628981 1904
305 2010-02-14/08/10: 35634677 5696
306 2010-02-14/08/15: 35636642 1965
307 2010-02-14/08/20: 35642462 5820
308 2010-02-14/08/25: 35644147 1685
309 2010-02-14/08/30: 35649925 5778
310 2010-02-14/08/35: 35651872 1947
311 2010-02-14/08/40: 35657488 5616
312 2010-02-14/08/45: 35659431 1943
313 2010-02-14/08/50: 35665174 5743
314 2010-02-14/08/55: 35666864 1690
315
316 sysctl: vm.stats.sys.v_intr:
317
318 2010-02-14/08/00: 497713097 -
319 2010-02-14/08/05: 497751068 37971
320 2010-02-14/08/10: 497772905 21837
321 2010-02-14/08/15: 497784808 11903
322 2010-02-14/08/20: 497793871 9063
323 2010-02-14/08/25: 497805554 11683
324 2010-02-14/08/30: 497815321 9767
325 2010-02-14/08/35: 497837284 21963
326 2010-02-14/08/40: 497845850 8566
327 2010-02-14/08/45: 497981716 135866
328 2010-02-14/08/50: 497990448 8732
329 2010-02-14/08/55: 498002434 11986
330
331 sysctl: vm.stats.sys.v_soft:
332
333 2010-02-14/08/00: 476175765 -
334 2010-02-14/08/05: 476231628 55863
335 2010-02-14/08/10: 476287825 56197
336 2010-02-14/08/15: 476353282 65457
337 2010-02-14/08/20: 476414205 60923
338 2010-02-14/08/25: 476474890 60685
339 2010-02-14/08/30: 476541538 66648
340 2010-02-14/08/35: 476602048 60510
341 2010-02-14/08/40: 476664288 62240
342 2010-02-14/08/45: 476729602 65314
343 2010-02-14/08/50: 476796621 67019
344 2010-02-14/08/55: 476859315 62694
345
346 ...
347
348Some explanation. '^(?k:vm\.stats\..*):\s+(?v:\d+)$' -- is a regular
349expression with two nonstandard (gather specific) extensions:
350
351 (?k:<key_regexp>) -- the regexp matches key.
352 (?v:<val_regexp>) -- the regexp matches value.
353
354So in string like this:
355
356 vm.stats.sys.v_soft: 476175765
357
358the regular expression above will match vm.stats.sys.v_soft as a key
359and 476175765 as a value. As a result all lines with
360vm.stats.sys.v_soft key will be assembled:
361
362 sysctl: vm.stats.sys.v_soft:
363
364 2010-02-14/08/00: 476175765 -
365 2010-02-14/08/05: 476231628 55863
366 2010-02-14/08/10: 476287825 56197
367 2010-02-14/08/15: 476353282 65457
368 2010-02-14/08/20: 476414205 60923
369 ...
370
371The first column is time, the second is value at this time and the
372third is difference with the previous value -- this helps much to find
373anomalies. By default the assembled data are displayed to stdout but
374with `-d <dir>' option you can specify a directory where assembled
375data will be stored, in separate file for every key.
376
377Still the amount of data you need to review is rather large :-). If
378you know exact time when the "problem" occurs (e.g. at about 08:20,
379i.e. "2010-02-14/08/20:" lines) you can use `-t "2010-02-14/08/20:"'
380option -- this will do some primitive analysis looking for variables
381that had anomalies at this time and will list them so you could start
382you analysis from reviewing this variables first.
383
3843.6.show plot.
385--------------
386
387If you have gnuplot installed you can use `show plot' subcommand to
388produce data plots. As its arguments it expects a regexp and dataset
389name. In the regexp you should use grouping to capture a parameter you
390want to display (as a function of time).
391
392For example, let's suppose we want to plot laptop battery life using
393sysctl output:
394
395 % sysctl hw.acpi.battery.life
396 hw.acpi.battery.life: 70
397
398Our gather is configured to collect sysctl and produces this output:
399
400 % gather show -t 1h grep hw.acpi.battery.life sysctl
401 2012-04-28 08:43: hw.acpi.battery.life: 15
402 2012-04-28 08:44: hw.acpi.battery.life: 16
403 2012-04-28 08:45: hw.acpi.battery.life: 17
404 2012-04-28 08:46: hw.acpi.battery.life: 18
405 ...
406
407To plot this we can use the following command, which captures a figure
408after "life:" as group \1:
409
410 gather show -t 1h plot 'hw.acpi.battery.life: (\d+)' sysctl
411
412To plot it into a png file:
413
414 gather show -t 1h plot -t png -o '/tmp/battery.life.png' 'hw.acpi.battery.life: (\d+)' sysctl
415
416If you always want to print to a file you may want to change default
417settings in gather.cfg.
418
419Also, note, if you don't have gnuplot installed on the host where you
420are running gather, you can set 'cat' as gnulplot command in the
421configuration file and produce gnuplot script with data, which you can
422ran on a host with gnuplot installed. Or use "ssh host | gnuplot "
423pipe.
424
4254.Timeperiod shortcuts.
426-----------------------
427
428The most general form of timeperiod is:
429
430 YYYY-MM-DD/HH/MM--yyyy-mm-dd/hh/mm
431
432where YYYY-MM-DD/HH/MM is start of timeperiod and yyyy-mm-dd/hh/mm is
433its end. You can skip MM and HH in start or end part of range. E.g:
434
435 2008-11-16/14--2008-11-17
436
437This is interpreted as:
438
439 2008-11-16/14/00--2008-11-17/23/59
440
441It is also possible to specify only the first part of a timeperiod. E.g:
442
443 2008-11-16/14 (interpreted as 2008-11-16/14/00--2008-11-16/14/59)
444
445or
446
447 2008-11-16 (interpreted as 2008-11-16/00/00--2008-11-16/23/59)
448
449If day, hour or minute in end part of timeperiod is the same as in the
450start one, you can skip it:
451
452 YYYY-MM-DD/HH/MM--/hh/mm (interpreted as YYYY-MM-DD/HH/MM--YYYY-MM-DD/hh/mm)
453
454 YYYY-MM-DD/HH/MM--//mm (interpreted as YYYY-MM-DD/HH/MM--YYYY-MM-DD/HH/mm)
455
456 YYYY-MM-DD/HH/MM--yyyy-mm-dd// (interpreted as YYYY-MM-DD/HH/MM--yyyy-mm-dd/HH/MM)
457
458 and so on.
459
460Here are some other shortcuts you can use to reduce typing:
461
462 . current day
463
464 ./. current day/current hour
465
466 ././. current day/current hour/current minute
467
468 $ now (the same as ././.)
469
470 Nd N days ago
471
472 Nh N hours ago
473
474 Nm N minutes ago
475
476If N{d,h,m} is used alone (there is only start part) then it is
477replaced by timeperiod "from that time by now". I.e. timeperiod "Nd"
478is the same as "Nd--$".
479
480--
481Mikolaj Golub <to.my.trociny@gmail.com>
482