• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

examples/H29-Jun-2016-13994

ChangelogH A D29-Jun-2016988 4825

MakefileH A D29-Jun-20161.9 KiB5419

READMEH A D29-Jun-201615.9 KiB482351

gather.cfg.inH A D29-Jun-20161.1 KiB5438

gather.mapH A D29-Jun-2016467 2013

gather.pl.inH A D29-Jun-201629.5 KiB1,228821

README

1----------------------------------------------------------------------
2          gather -- collect and display system statistics
3----------------------------------------------------------------------
4
5$Id$
6
71.Introduction.
82.Installation and configuration.
93.Examples.
104.Timeperiod shortcuts.
11
12
131.Introduction.
14--------------
15
16Many of those who have worked with computer systems faced with
17situations when something wrong goes with the system that need to be
18traced in some way. Under Unix there are many nice tools such as top,
19ps, netstat, vmstat, sysctl and so on that can be used to get useful
20information about system and trace the problem. But what if the
21problem happens accidently and usually when you are away from the box,
22have no access to it or just are sleeping? What you have when you get
23to the box is some logs and may be some performance statistics in
24rrd. Very often it is not enough to figure out what was wrong with the
25system. To have more info you can write some scripts that run system
26utilities in batch mode to get statistics, run these scripts via cron,
27then when something has happened you have tons of files with utils'
28output where you have to find useful for your information. Writing
29scripts and then digging in thousands of files is time consuming task
30that it would be nice to automatize a bit. So this is where gather
31utility goes to help. This script runs system utils to collect
32statistics and then helps you to analyze collected data. You specify
33commands you want to run to get statistics in gather.map file, set
34cron to run gather utility with desired periodicity and then use this
35utility to output and grep collected statistics for specified
36period. gather's output contains timestamps thus you can see what things
37and when happened.
38
39
402.Installation and configuration.
41---------------------------------
42
43gather utility is perl script so you need perl installed on your box
44to use it. Run
45
46  make
47
48to make installation files. Then copy gather script somewhere you want
49to have it, preferably in some directory from PATH. gather reads its
50configuration parameters from gather.cfg and gather.map files. Check
51in gather script where by default it looks for configs and put
52gather.cfg there. You can change some defaults running make with
53additional variable set, e.g:
54
55  make CONFDIR=$HOME/.gather
56
57See Makefile for other parameters you can set.
58
59Also you can change defaults using command line parameter. Run
60
61  gather help
62
63to see minihelp and some defaults.
64
65gather.cfg contains configuration variables that specify location of
66gather.map file, directory where statistics is collected, compression
67used and some other. Take gather.cfg from gather distribution and tune
68it for your environment and needs. Every configuration variable is
69commented so you shouldn't have problems with configuration. Please
70note that gather.cfg is really a Perl script evaluated by gather when
71it runs. So be careful not to make syntactic errors if you want the
72program to work.
73
74Next thing is to specify commands in gather.map file. You can use
75gather.map from gather distribution as an example. gather.map looks
76something like this:
77
78  %map = ('uptime'   => {'desc' => 'system uptime',
79                         'cmd'  => '/usr/bin/uptime'},
80          'sysctl'   => {'desc' => 'sysctl variables',
81                         'cmd'  => '/sbin/sysctl -a'},
82          'sockstat' => {'desc' => 'sockstat output',
83                         'cmd'  => '/usr/bin/sockstat'}
84           ...
85
86         );
87
88It is rather self explanatory but here is a description. In garher.map
89you should initialize Perl hash variable `%map'. Keys 'sysctl',
90'sockstat' are just names for identifying your statistics commands;
91you can use any name you like here but you can't use the same name
92twice. 'desc' is optional description of the command, you can write
93everything you want here, but try to keep it informative and short
94enough, as it is used in `gather show utils' output. 'cmd' is the
95command to run. All output from the command will be redirected to
96gather database.
97
98When you have gather.cfg and the map configured you can run gather to
99collect data:
100
101  gather collect
102
103gather will run all commands specified in map and store output. You
104need to set up cron to run this command with desired periodicity.
105
106Also if you don't want to run out of free space you need to setup
107command:
108
109  gather expire <days>
110
111in crontab to run daily and expire old data. Data older then <days>
112will be deleted.
113
114Gather database is actually just a directory where output of each
115script is saved in separate file in YEAR-MONTH-DAY/HOUR/MINUTE
116subdirectory, thus you can browse it looking for needed info but also
117you can use gather to retrieve and grep data. Run
118
119  gather show help
120
121to see minihelp about available subcommands. Next section provides
122some examples that demonstrate how you can use gather utility.
123
1242.1 Installing with Chef.
125-------------------------
126
127The gather utility can be installed using Chef cookbook. See further
128instructions on Chef Supermarket - open-source community platform:
129https://supermarket.chef.io/cookbooks/gatherit
130
1313.Examples.
132-----------
133
134When you have set up gather utility as described above and collected
135some statistics you can use `gather show' command to display and grep
136data.
137
1383.1.show utils.
139---------------
140
141Run
142
143  gather show utils
144
145and you will see the list of commands you have installed in map and
146used to collect data:
147
148  ------------------------------------------------------------------
149  name       cmd                      desc
150  ------------------------------------------------------------------
151  ...
152  sockstat   /usr/bin/sockstat        sockstat output
153  sysctl     /sbin/sysctl -a          sysctl variables
154  ...
155  uptime     /usr/bin/uptime          system uptime
156  ...
157
1583.2.Time periods.
159-----------------
160
161Asking gather to display data you have to specify time period what
162data you want for. Time period has the following format:
163YEAR-MONTH-DAY/HOUR/MINUTE, eg:
164
165  2008-09-14/11/10
166
167HOUR and MINUTE are optional so if you want data for the whole hour,
168you can specify:
169
170  2008-09-14/11
171
172and if you want data for the whole day, just specify this day:
173
174  2008-09-14
175
176Yoy can use ranges for setting time periods. E.g. specifying:
177
178  2008-09-13/11/10--2008-09-14/12
179
180you will get data for period from 11:10 2008-09-13 to 12:00
1812008-09-14.
182
1833.3.show grep.
184--------------
185
186To display data you can use grep subcommand. You should set regexpres
187that will filter data. If you want all output, set regexp to '.'
188(point). E.g.:
189
190  gather show -t '2008-09-14/13' grep '.*' uptime
191
192will output something like this:
193
194  2008-09-14/13/00:  1:00PM  up  1:53, 0 users, load averages: 0.16, 0.04, 0.01
195  2008-09-14/13/05:  1:05PM  up  1:58, 0 users, load averages: 0.16, 0.05, 0.01
196  2008-09-14/13/10:  1:10PM  up  2:03, 0 users, load averages: 0.16, 0.04, 0.01
197  2008-09-14/13/15:  1:15PM  up  2:08, 0 users, load averages: 0.16, 0.04, 0.01
198  2008-09-14/13/20:  1:20PM  up  2:13, 0 users, load averages: 0.16, 0.04, 0.01
199  2008-09-14/13/25:  1:25PM  up  2:18, 0 users, load averages: 0.00, 0.00, 0.00
200  2008-09-14/13/30:  1:30PM  up  2:23, 0 users, load averages: 0.16, 0.03, 0.01
201  2008-09-14/13/35:  1:35PM  up  2:28, 0 users, load averages: 0.08, 0.02, 0.01
202  2008-09-14/13/40:  1:40PM  up  2:33, 0 users, load averages: 0.16, 0.03, 0.01
203  2008-09-14/13/45:  1:45PM  up  2:38, 0 users, load averages: 0.18, 0.05, 0.01
204  2008-09-14/13/50:  1:50PM  up  2:43, 0 users, load averages: 0.23, 0.07, 0.02
205  2008-09-14/13/55:  1:55PM  up  2:48, 0 users, load averages: 0.08, 0.03, 0.01
206
207But usually you will need more complicated regexpres then just '.' to
208filter needed info. E.g. to see statistics for several hours about
209open files, you can run:
210
211  gather show -t '2008-09-14/12--2008-09-14/15' grep '^kern.openfiles:' sysctl
212
213That will output something like this:
214
215  2008-09-14/12/00: kern.openfiles: 197
216  2008-09-14/12/05: kern.openfiles: 194
217  2008-09-14/12/10: kern.openfiles: 194
218  ...
219  2008-09-14/15/50: kern.openfiles: 187
220  2008-09-14/15/55: kern.openfiles: 188
221
222You can use '-c' option if you want to count of matched strings rather
223then display them. E.g. to see number of sockets used by user www from
22412:00 to 13:00 on 2008-09-14 you can run:
225
226  gather show -t '2008-09-14/12' grep -c '^www\s' sockstat
227
228and output like this:
229
230  2008-09-14/12/00: 10
231  2008-09-14/12/05: 10
232  2008-09-14/12/10: 10
233  ...
234
2353.4.show filter.
236----------------
237
238If you need not just to grep data but perform some actions on them you
239will want to use filter subcommand. E.g to see amount of loginned
240users, you can run:
241
242  gather show -t '2008-09-14/12' filter "perl -pe 's/^.*(\\d+ users),.*\$/\$1/'"  uptime
243
244That will output something like this:
245
246  2008-09-14/12/00: 0 users
247  2008-09-14/12/05: 0 users
248  2008-09-14/12/10: 0 users
249  ...
250
251Remember about screening properly all control characters in filter
252command. If filter is rather complicated it is better to write
253separate script to avoid screening hell and then run:
254
255  gather show -t '2008-09-14/11' filter ./script uptime
256
257Other advantage of this approach is that you can store written filter
258and use it later. If you use gather for some time soon you will have
259collection of useful filters.
260
2613.5.show assemble.
262------------------
263
264Another show subcommand, `assemble', can be useful when analysing an
265output of such utilities like `sysctl' or `netstat -s' -- long list of
266variables with their values.
267
268E.g. `systctl -a' output would look something like this:
269
270  ...
271  vm.stats.misc.zero_page_count: 8130
272  vm.stats.misc.cnt_prezero: 0
273  vm.stats.vm.v_kthreadpages: 0
274  vm.stats.vm.v_rforkpages: 0
275  vm.stats.vm.v_vforkpages: 170509301
276  vm.stats.vm.v_forkpages: 1647077180
277  vm.stats.vm.v_kthreads: 41928
278  vm.stats.vm.v_rforks: 0
279  vm.stats.vm.v_vforks: 829962
280  vm.stats.vm.v_forks: 9605243
281  vm.stats.vm.v_interrupt_free_min: 2
282  vm.stats.vm.v_pageout_free_min: 34
283  vm.stats.vm.v_cache_max: 44618
284  vm.stats.vm.v_cache_min: 22309
285  vm.stats.vm.v_cache_count: 12929
286  vm.stats.vm.v_inactive_count: 445331
287  vm.stats.vm.v_inactive_target: 33463
288  vm.stats.vm.v_active_count: 70486
289  vm.stats.vm.v_wire_count: 67018
290  ...
291
292Using e.g. `show grep vm.stats.vm.v_vforkpages' we could get listing
293for this particular variable in some interesting timeperiod. But
294checking all variables in this way would be a long process. With
295assemble subcommand it is much faster:
296
297  gather show -t 2010-02-14/08 assemble '^(?k:vm\.stats\..*):\s+(?v:\d+)$' sysctl
298
299  ...
300
301  sysctl: vm.stats.object.collapses:
302
303  2010-02-14/08/00: 35627077      -
304  2010-02-14/08/05: 35628981      1904
305  2010-02-14/08/10: 35634677      5696
306  2010-02-14/08/15: 35636642      1965
307  2010-02-14/08/20: 35642462      5820
308  2010-02-14/08/25: 35644147      1685
309  2010-02-14/08/30: 35649925      5778
310  2010-02-14/08/35: 35651872      1947
311  2010-02-14/08/40: 35657488      5616
312  2010-02-14/08/45: 35659431      1943
313  2010-02-14/08/50: 35665174      5743
314  2010-02-14/08/55: 35666864      1690
315
316  sysctl: vm.stats.sys.v_intr:
317
318  2010-02-14/08/00: 497713097     -
319  2010-02-14/08/05: 497751068     37971
320  2010-02-14/08/10: 497772905     21837
321  2010-02-14/08/15: 497784808     11903
322  2010-02-14/08/20: 497793871     9063
323  2010-02-14/08/25: 497805554     11683
324  2010-02-14/08/30: 497815321     9767
325  2010-02-14/08/35: 497837284     21963
326  2010-02-14/08/40: 497845850     8566
327  2010-02-14/08/45: 497981716     135866
328  2010-02-14/08/50: 497990448     8732
329  2010-02-14/08/55: 498002434     11986
330
331  sysctl: vm.stats.sys.v_soft:
332
333  2010-02-14/08/00: 476175765     -
334  2010-02-14/08/05: 476231628     55863
335  2010-02-14/08/10: 476287825     56197
336  2010-02-14/08/15: 476353282     65457
337  2010-02-14/08/20: 476414205     60923
338  2010-02-14/08/25: 476474890     60685
339  2010-02-14/08/30: 476541538     66648
340  2010-02-14/08/35: 476602048     60510
341  2010-02-14/08/40: 476664288     62240
342  2010-02-14/08/45: 476729602     65314
343  2010-02-14/08/50: 476796621     67019
344  2010-02-14/08/55: 476859315     62694
345
346  ...
347
348Some explanation. '^(?k:vm\.stats\..*):\s+(?v:\d+)$' -- is a regular
349expression with two nonstandard (gather specific) extensions:
350
351  (?k:<key_regexp>) -- the regexp matches key.
352  (?v:<val_regexp>) -- the regexp matches value.
353
354So in string like this:
355
356  vm.stats.sys.v_soft: 476175765
357
358the regular expression above will match vm.stats.sys.v_soft as a key
359and 476175765 as a value. As a result all lines with
360vm.stats.sys.v_soft key will be assembled:
361
362  sysctl: vm.stats.sys.v_soft:
363
364  2010-02-14/08/00: 476175765     -
365  2010-02-14/08/05: 476231628     55863
366  2010-02-14/08/10: 476287825     56197
367  2010-02-14/08/15: 476353282     65457
368  2010-02-14/08/20: 476414205     60923
369  ...
370
371The first column is time, the second is value at this time and the
372third is difference with the previous value -- this helps much to find
373anomalies. By default the assembled data are displayed to stdout but
374with `-d <dir>' option you can specify a directory where assembled
375data will be stored, in separate file for every key.
376
377Still the amount of data you need to review is rather large :-). If
378you know exact time when the "problem" occurs (e.g. at about 08:20,
379i.e. "2010-02-14/08/20:" lines) you can use `-t "2010-02-14/08/20:"'
380option -- this will do some primitive analysis looking for variables
381that had anomalies at this time and will list them so you could start
382you analysis from reviewing this variables first.
383
3843.6.show plot.
385--------------
386
387If you have gnuplot installed you can use `show plot' subcommand to
388produce data plots. As its arguments it expects a regexp and dataset
389name. In the regexp you should use grouping to capture a parameter you
390want to display (as a function of time).
391
392For example, let's suppose we want to plot laptop battery life using
393sysctl output:
394
395  % sysctl hw.acpi.battery.life
396  hw.acpi.battery.life: 70
397
398Our gather is configured to collect sysctl and produces this output:
399
400  % gather show -t 1h grep hw.acpi.battery.life sysctl
401  2012-04-28 08:43: hw.acpi.battery.life: 15
402  2012-04-28 08:44: hw.acpi.battery.life: 16
403  2012-04-28 08:45: hw.acpi.battery.life: 17
404  2012-04-28 08:46: hw.acpi.battery.life: 18
405  ...
406
407To plot this we can use the following command, which captures a figure
408after "life:" as group \1:
409
410  gather show -t 1h plot 'hw.acpi.battery.life: (\d+)' sysctl
411
412To plot it into a png file:
413
414  gather show -t 1h plot -t png -o '/tmp/battery.life.png' 'hw.acpi.battery.life: (\d+)' sysctl
415
416If you always want to print to a file you may want to change default
417settings in gather.cfg.
418
419Also, note, if you don't have gnuplot installed on the host where you
420are running gather, you can set 'cat' as gnulplot command in the
421configuration file and produce gnuplot script with data, which you can
422ran on a host with gnuplot installed. Or use "ssh host | gnuplot "
423pipe.
424
4254.Timeperiod shortcuts.
426-----------------------
427
428The most general form of timeperiod is:
429
430  YYYY-MM-DD/HH/MM--yyyy-mm-dd/hh/mm
431
432where YYYY-MM-DD/HH/MM is start of timeperiod and yyyy-mm-dd/hh/mm is
433its end. You can skip MM and HH in start or end part of range. E.g:
434
435  2008-11-16/14--2008-11-17
436
437This is interpreted as:
438
439  2008-11-16/14/00--2008-11-17/23/59
440
441It is also possible to specify only the first part of a timeperiod. E.g:
442
443  2008-11-16/14 (interpreted as 2008-11-16/14/00--2008-11-16/14/59)
444
445or
446
447  2008-11-16 (interpreted as 2008-11-16/00/00--2008-11-16/23/59)
448
449If day, hour or minute in end part of timeperiod is the same as in the
450start one, you can skip it:
451
452  YYYY-MM-DD/HH/MM--/hh/mm (interpreted as YYYY-MM-DD/HH/MM--YYYY-MM-DD/hh/mm)
453
454  YYYY-MM-DD/HH/MM--//mm (interpreted as YYYY-MM-DD/HH/MM--YYYY-MM-DD/HH/mm)
455
456  YYYY-MM-DD/HH/MM--yyyy-mm-dd// (interpreted as YYYY-MM-DD/HH/MM--yyyy-mm-dd/HH/MM)
457
458  and so on.
459
460Here are some other shortcuts you can use to reduce typing:
461
462  .         current day
463
464  ./.       current day/current hour
465
466  ././.     current day/current hour/current minute
467
468  $         now (the same as ././.)
469
470  Nd        N days ago
471
472  Nh        N hours ago
473
474  Nm        N minutes ago
475
476If N{d,h,m} is used alone (there is only start part) then it is
477replaced by timeperiod "from that time by now". I.e. timeperiod "Nd"
478is the same as "Nd--$".
479
480--
481Mikolaj Golub <to.my.trociny@gmail.com>
482