1How to configure
2================
3
4To use liblognorm, you need 3 things.
5
61. An installed and working copy of liblognorm. The installation process
7   has been discussed in the chapter :doc:`installation`.
82. Log files.
93. A rulebase, which is heart of liblognorm configuration.
10
11Log files
12---------
13
14A log file is a text file, which typically holds many lines. Each line is
15a log message. These are usually a bit strange to read, thus to analyze.
16This mostly happens, if you have a lot of different devices, that are all
17creating log messages in a different format.
18
19Rulebase
20--------
21
22The rulebase holds all the schemes for your logs. It basically consists of
23many lines that reflect the structure of your log messages. When the
24normalization process is started, a parse-tree will be generated from
25the rulebase and put into the memory. This will then be used to parse the
26log messages.
27
28Each line in rulebase file is evaluated separately.
29
30Rulebase Versions
31-----------------
32This documentation is for liblognorm version 2 and above. Version 2 is a
33complete rewrite of liblognorm which offers many enhanced features but
34is incompatible to some pre-v2 rulebase commands. For details, see
35compatiblity document.
36
37Note that liblognorm v2 contains a full copy of the v1 engine. As such
38it is fully compatible to old rulebases. In order to use the new v2
39engine, you need to explicitely opt in. To do so, you need to add
40the line::
41
42    version=2
43
44to the top of your rulebase file. Currently, it is very important that
45
46 * the line is given exactly as above
47 * no whitespace within the sequence is permitted (e.g. "version = 2"
48   is invalid)
49 * no whitepace or comment after the "2" is permitted
50   (e.g. "version=2 # comment") is invalid
51 * this line **must** be the **very** first line of the file; this
52   also means there **must** not be any comment or empty lines in
53   front of it
54
55Only if the version indicator is properly detected, the v2 engine is
56used. Otherwise, the v1 engine is used. So if you use v2 features but
57got the version line wrong, you'll end up with error messages from the
58v1 engine.
59
60The v2 engine understands almost all v1 parsers, and most importantly all
61that are typically used. It does not understand these parsers:
62
63 * tokenized
64 * recursive
65 * descent
66 * regex
67 * interpret
68 * suffixed
69 * named_suffixed
70
71The recursive and descent parsers should be replaced by user-defined types
72in. The tokenized parsers should be replaced by repeat. The interpret functionality
73is provided via the parser's "format" parameters. For the others,
74currently there exists no replacement, but will the exception of regex,
75will be added based on demand. If you think regex support is urgently
76needed, please read our
77`related issue on github, <https://github.com/rsyslog/liblognorm/issues/143>`_
78where you can also cast
79you ballot in favor of it. If you need any of these parsers, you need
80to use the v1 engine. That of course means you cannot use the v2 enhancements,
81so converting as much as possible makes sense.
82
83Commentaries
84------------
85
86To keep your rulebase tidy, you can use commentaries. Start a commentary
87with "#" like in many other configurations. It should look like this::
88
89    # The following prefix and rules are for firewall logs
90
91Note that the comment character MUST be in the first column of the line.
92
93Empty lines are just skipped, they can be inserted for readability.
94
95User-Defined Types
96------------------
97
98If the line starts with ``type=``, then it contains a user-defined type.
99You can use a user-defined type wherever you use a built-in type; they
100are equivalent. That also means you can use user-defined types in the
101definition of other user-defined types (they can be used recursively).
102The only restriction is that you must define a type **before** you can
103use it.
104
105This line has following format::
106
107    type=<typename>:<match description>
108
109Everything before the colon is treated as the type name. User-defined types
110must always start with "@". So "@mytype" is a valid name, whereas "mytype"
111is invalid and will lead to an error.
112
113After the colon, a match description should be
114given. It is exactly the same like the one given in rule lines (see below).
115
116A generic IP address type could look as follows::
117
118    type=@IPaddr:%ip:ipv4%
119    type=@IPaddr:%ip:ipv6%
120
121This creates a type "@IPaddr", which consists of either an IPv4 or IPv6
122address. Note how we use two different lines to create an alternative
123representation. This is how things generally work with types: you can use
124as many "type" lines for a single type as you need to define your object.
125Note that pure alternatives could also be defined via the "alternative"
126parser - which option to choose is left to the user. They are equivalent.
127The ability to use multiple type lines for definition, however, brings
128more power than just to define alternatives.
129
130Includes
131--------
132Especially with user-defined types includes come handy. With an include,
133you can include definitions already made elsewhere into the current
134rule set (just like the "include" directive works in many programming
135languages). An include is done by a line starting with ``include=``
136where the rest of the line is the actual file name, just like in this
137example::
138
139   include=/var/lib/liblognorm/stdtypes.rb
140
141The definition is included right at the position where it occurs.
142Processing of the original file is continued when the included file
143has been fully processed. Includes can be nested.
144
145To facilitate repositories of common rules, liblognorm honors the
146
147::
148
149   LIBLOGNORM_RULEBASES
150
151environment variable. If it is set liblognorm tries to locate the file
152inside the path pointed to by ``LIBLOGNORM_RULEBASES`` in the following
153case:
154
155* the provided file cannot be found
156* the provided file name is not an absolute path (does not start with "/")
157
158So assuming we have::
159
160   export LIBLOGNORM_RULEBASES=/var/lib/loblognorm
161
162The above example can be re-written as follows::
163
164   include=stdtypes.rb
165
166Note, however, that if ``stdtypes.rb`` exist in the current working
167directory, that file will be loaded insted of the one from
168``/var/lib/liblognorm``.
169
170This use facilitates building a library of standard type definitions. Note
171the the liblognorm project also ships type definitions for common
172scenarios.
173
174Rules
175-----
176
177If the line starts with ``rule=``, then it contains a rule. This line has
178following format::
179
180    rule=[<tag1>[,<tag2>...]]:<match description>
181
182Everything before a colon is treated as comma-separated list of tags, which
183will be attached to a match. After the colon, match description should be
184given. It consists of string literals and field selectors. String literals
185should match exactly, whereas field selectors may match variable parts
186of a message.
187
188A rule could look like this (in legacy format)::
189
190    rule=:%date:date-rfc3164% %host:word% %tag:char-to:\x3a%: no longer listening on %ip:ipv4%#%port:number%'
191
192This excerpt is a common rule. A rule always contains several different
193"parts"/properties and reflects the structure of the message you want to
194normalize (e.g. Host, IP, Source, Syslogtag...).
195
196
197Literals
198--------
199
200Literal is just a sequence of characters, which must match exactly.
201Percent sign characters must be escaped to prevent them from starting a
202field accidentally. Replace each "%" with "\\x25" or "%%", when it occurs
203in a string literal.
204
205Fields
206------
207
208There are different formats for field specification:
209
210 * legacy format
211 * condensed format
212 * full json format
213
214Legacy Format
215#############
216Legay format is exactly identical to the v1 engine. This permits you to use
217existing v1 rulebases without any modification with the v2 engine, except for
218adding the ``version=2`` header line to the top of the file. Remember: some
219v1 types are not supported - if you are among the few who use them, you need
220to do some manual conversion. For almost all users, manual conversion should
221not be necessary.
222
223Legacy format is not documented here. If you want to use it, see the v1
224documentation.
225
226Condensed Format
227################
228The goal of this format is to be as brief as possible, permitting you an
229as-clear-as-possible view of your rule. It is very similar to legacy format
230and recommended to be used for simple types which do not need any parser
231parameters.
232
233Its structure is as follows::
234
235    %<field name>:<field type>{<parameters>}%
236
237**field name** -> that name can be selected freely. It should be a description
238of what kind of information the field is holding, e.g. SRC is the field
239contains the source IP address of the message. These names should also be
240chosen carefully, since the field name can be used in every rule and
241therefore should fit for the same kind of information in different rules.
242
243Some special field names exist:
244
245* **dash** ("-"): this field is matched but not saved
246* **dot** ("."): this is useful if a parser returns a set of fields. Usually,
247  it does so by creating a json subtree. If the field is named ".", then
248  no subtree is created but instead the subfields are moved into the main
249  hierarchy.
250* **two dots** (".."): similiar to ".", but can be used at the lower level to denote
251  that a field is to be included with the name given by the upper-level
252  object. Note that ".." is only acted on if a subelement contains a single
253  field. The reason is that if there were more, we could not assign all of
254  them to the *single* name given by the upper-level-object. The prime
255  use case for this special name is in user-defined types that parse only
256  a single value. Without "..", they would always become a JSON subtree, which
257  seems unnatural and is different from built-in types. So it is suggested to
258  name such fields as "..", which means that the user can assign a name of his
259  liking, just like in the case of built-in parsers.
260
261**field type** -> selects the accordant parser, which are described below.
262
263Special characters that need to be escaped when used inside a field
264description are "%" and ":". It is strongly recommended **not** to use them.
265
266**parameters** -> This is an optional set of parameters, given in pure JSON
267format. Parameters can be generic (e.g. "priority") or specific to a
268parser (e.g. "extradata"). Generic parameters are described below in their
269own section, parser-specific ones in the relevant type documentation.
270
271As an example, the "char-to" parser accepts a parameter named "extradata"
272which describes up to which character it shall match (the name "extradata"
273stems back to the legacy v1 system)::
274
275	%tag:char-to{"extradata":":"}%
276
277Whitespace, including LF, is permitted inside a field definition after
278the opening precent sign and before the closing one. This can be used to
279make complex rules more readable. So the example rule from the overview
280section above could be rewritten as::
281
282    rule=:%
283          date:date-rfc3164
284          % %
285	  host:word
286	  % %
287	  tag:char-to{"extradata":":"}
288	  %: no longer listening on %
289	  ip:ipv4
290	  %#%
291	  port:number
292	  %'
293
294When doing this, note well that whitespace IS important inside the
295literal text. So e.g. in the second example line above "% %" we require
296a single SP as literal text. Note that any combination of your liking is
297valid, so it could also be written as::
298
299    rule=:%date:date-rfc3164% %host:word% % tag:char-to{"extradata":":"}
300          %: no longer listening on %  ip:ipv4  %#%  port:number  %'
301
302To prevent a typical user error, continuation lines are **not** permitted
303to start with ``rule=``. There are some obscure cases where this could
304be a valid rule, and it can be re-formatted in that case. Moreoften, this
305is the result of a missing percent sign, as in this sample::
306
307     rule=:test%field:word ... missing percent sign ...
308     rule=:%f:word%
309
310If we would permit ``rule=`` at start of continuation line, these kinds
311of problems would be very hard to detect.
312
313Full JSON Format
314################
315This format is best for complex definitions or if there are many parser
316parameters.
317
318Its structure is as follows::
319
320    %JSON%
321
322Where JSON is the configuration expressed in JSON. To get you started, let's
323rewrite above sample in pure JSON form::
324
325    rule=:%[ {"type":"date-rfc3164", "name":"date"},
326             {"type":"literal", "text:" "},
327             {"type":"char-to", "name":"host", "extradata":":"},
328             {"type":"literal", "text:": no longer listening on "},
329             {"type":"ipv4", "name":"ip"},
330             {"type":"literal", "text:"#"},
331             {"type":"number", "name":"port"}
332            ]%
333
334A couple of things to note:
335
336 * we express everything in this example in a *single* parser definition
337 * this is done by using a **JSON array**; whenever an array is used,
338   multiple parsers can be specified. They are exectued one after the
339   other in given order.
340 * literal text is matched here via explicit parser call; as specified
341   below, this is recommended only for specific use cases with the
342   current version of liblognorm
343 * parser parameters (both generic and parser-specific ones) are given
344   on the main JSON level
345 * the literal text shall not be stored inside an output variable; for
346   this reason no name attribute is given (we could also have used
347   ``"name":"-"`` which achives the same effect but is more verbose).
348
349With the literal parser calls replaced by actual literals, the sample
350looks like this::
351
352    rule=:%{"type":"date-rfc3164", "name":"date"}
353          % %
354           {"type":"char-to", "name":"host", "extradata":":"}
355	  % no longer listening on %
356            {"type":"ipv4", "name":"ip"}
357	  %#%
358            {"type":"number", "name":"port"}
359          %
360
361Which format you use and how you exactly use it is up to you.
362
363Some guidelines:
364
365 * using the "literal" parser in JSON should be avoided currently; the
366   experimental version does have some rough edges where conflicts
367   in literal processing will not be properly handled. This should not
368   be an issue in "closed environments", like "repeat", where no such
369   conflict can occur.
370 * otherwise, JSON is perfect for very complex things (like nesting of
371   parsers - it is **not** suggested to use any other format for these
372   kinds of things.
373 * if a field needs to be matched but the result of that match is not
374   needed, omit the "name" attribute; specifically avoid using
375   the more verbose ``"name":"-"``.
376 * it is a good idea to start each defintion with ``"type":"..."``
377   as this provides a good quick overview over what is being defined.
378
379Mandatory Parameters
380....................
381
382type
383~~~~
384The field type, selects the parser to use. See "fields" below for description.
385
386Optional Generic Parameters
387...........................
388
389name
390~~~~
391The field name to use. If "-" is used, the field is matched, but not stored.
392In this case, you can simply **not** specify a field name, which is the
393preferred way of doing this.
394
395priority
396~~~~~~~~
397The priority to assign to this parser. Priorities are numerical values in the
398range from 0 (highest) to 65535 (lowest). If multiple parsers could match at
399a given character position of a log line, parsers are tried in priority order.
400Different priorities can lead to different parsing. For example, if the
401greedy "rest" type is assigned priority 0, and no other parser is assigned the
402same priority, no other parser will ever match (because "rest" is very greedy
403and always matches the rest of the message).
404
405Note that liblognorm internally
406has a parser-specific priority, which is selected by the program developer based
407on the specificallity of a type. If the user assigns equal priorities, parsers are
408executed based on the parser-specific priority.
409
410The default priority value is 30,000.
411
412Field types
413-----------
414We have legacy and regular field types. Pre-v2, we did not have user-defined types.
415As such, there was a relatively large number of parsers that handled very similar
416cases, for example for strings. These parsers still work and may even provide
417best performance in extreme cases. In v2, we focus on fewer, but more
418generic parsers, which are then tailored via parameters.
419
420There is nothing bad about using legacy parsers and there is no
421plan to outphase them at any time in the future. We just wanted to
422let you know, especially if you wonder about some "wereid" parsers.
423In v1, parsers could have only a single paramter, which was called
424"extradata" at that time. This is why some of the legacy parsers
425require or support a parameter named "extradata" and do not use a
426better name for it (internally, the legacy format creates a
427v2 parser defintion with "extradata" being populated from the
428legacy "extradata" part of the configuration).
429
430number
431######
432
433One or more decimal digits.
434
435Parameters
436..........
437
438format
439~~~~~~
440
441Specifies the format of the json object. Possible values are "string" and
442"number", with string being the default. If "number" is used, the json
443object will be a native json integer.
444
445maxval
446~~~~~~
447
448Maximum value permitted for this number. If the value is higher than this,
449it will not be detected by this parser definition and an alternate detection
450path will be pursued.
451
452float
453#####
454
455A floating-pt number represented in non-scientific form.
456
457Parameters
458..........
459
460format
461~~~~~~
462
463Specifies the format of the json object. Possible values are "string" and
464"number", with string being the default. If "number" is used, the json
465object will be a native json floating point number. Note that we try to
466preserve the original string serialization format, but keep on your mind
467that floating point numbers are inherently imprecise, so slight variance
468may occur depending on processing them.
469
470
471hexnumber
472#########
473
474A hexadecimal number as seen by this parser begins with the string
475"0x", is followed by 1 or more hex digits and is terminated by white
476space. Any interleaving non-hex digits will cause non-detection. The
477rules are strict to avoid false positives.
478
479Parameters
480..........
481
482format
483~~~~~~
484
485Specifies the format of the json object. Possible values are "string" and
486"number", with string being the default. If "number" is used, the json
487object will be a native json integer. Note that json numbers are always
488decimal, so if "number" is selected, the hex number will be converted
489to decimal. The original hex string is no longer available in this case.
490
491maxval
492~~~~~~
493
494Maximum value permitted for this number. If the value is higher than this,
495it will not be detected by this parser definition and an alternate detection
496path will be pursued. This is most useful if fixed-size hex numbers need to
497be processed. For example, for byte values the "maxval" could be set to 255,
498which ensures that invalid values are not misdetected.
499
500
501kernel-timestamp
502################
503
504Parses a linux kernel timestamp, which has the format::
505
506    [ddddd.dddddd]
507
508where "d" is a decimal digit. The part before the period has to
509have at least 5 digits as per kernel code. There is no upper
510limit per se inside the kernel, but liblognorm does not accept
511more than 12 digits, which seems more than sufficient (we may reduce
512the max count if misdetections occur). The part after the period
513has to have exactly 6 digits.
514
515
516whitespace
517##########
518
519This parses all whitespace until the first non-whitespace character
520is found. This check is performed using the ``isspace()`` C library
521function to check for space, horizontal tab, newline, vertical tab,
522feed and carriage return characters.
523
524This parser is primarily a tool to skip to the next "word" if
525the exact number of whitspace characters (and type of whitespace)
526is not known. The current parsing position MUST be on a whitspace,
527else the parser does not match.
528
529Remeber that to just parse but not preserve the field contents, the
530dash ("-") is used as field name in compact format or the "name"
531parameter is simply omitted in JSON format. This is almost always
532expected with the *whitespace* type.
533
534string
535######
536
537This is a highly customizable parser that can be used to extract
538many types of strings. It is meant to be used for most cases. It
539is suggested that specific string types are created as user-defined
540types using this parser.
541
542This parser supports:
543
544* various quoting modes for strings
545* escape character processing
546
547Parameters
548..........
549
550quoting.mode
551~~~~~~~~~~~~
552Specifies how the string is quoted. Possible modes:
553
554* **none** - no quoting is permitted
555* **required** - quotes must be present
556* **auto** - quotes are permitted, but not required
557
558Default is ``auto``.
559
560quoting.escape.mode
561~~~~~~~~~~~~~~~~~~~
562
563Specifies how quote character escaping is handled. Possible modes:
564
565* **none** - there are no escapes, quote characters are *not* permitted in value
566* **double** - the ending quote character is duplicated to indicate
567  a single quote without termination of the value (e.g. ``""``)
568* **backslash** - a backslash is prepended to the quote character (e.g ``\"``)
569* **both** - both double and backslash escaping can happen and are supported
570
571Default is ``both``.
572
573Note that turning on ``backslash`` mode (or ``both``) has the side-effect that
574backslash escaping is enabled in general. This usually is what you want
575if this option is selected (e.g. otherwise you could no longer represent
576backslash).
577
578**NOTE**: this parameter also affects operation if quoting is **turned off**. That
579is somewhat counter-intuitive, but has traditionally been the case - which means
580we cannot change it.
581
582quoting.char.begin
583~~~~~~~~~~~~~~~~~~
584
585Sets the begin quote character.
586
587Default is ".
588
589quoting.char.end
590~~~~~~~~~~~~~~~~
591
592Sets the end quote character.
593
594Default is ".
595
596Note that setting the begin and end quote character permits you to
597support more quoting modes. For example, brackets and braces are
598used by some software for quoting. To handle such string, you can for
599example use a configuration like this::
600
601   rule=:a %f:string{"quoting.char.begin":"[", "quoting.char.end":"]"}% b
602
603which matches strings like this::
604
605   a [test test2] b
606
607matching.permitted
608~~~~~~~~~~~~~~~~~~
609
610This allows to specify a set of characters permitted in the to-be-parsed
611field. It is primarily a utility to extract things like programming-language
612like names (e.g. consisting of letters, digits and a set of special characters
613only), alphanumeric or alphabetic strings.
614
615If this parameter is not specified, all characters are permitted. If it
616is specified, only the configured characters are permitted.
617
618Note that this option reliably only works on US-ASCII data. Multi-byte
619character encodings may lead to strange results.
620
621There are two ways to specify permitted characters. The simple one is to
622specify them directly for the parameter::
623
624  rule=:%f:string{"matching.permitted":"abc"}%
625
626This only supports literal characters and all must be given as a single
627parameter. For more advanced use cases, an array of permitted characters
628can be provided::
629
630  rule=:%f:string{"matching.permitted":[
631		       {"class":"digit"},
632		       {"chars":"xX"}
633                          ]}%
634
635Here, ``class`` is a specify for the usual character classes, with
636support for:
637
638* digit
639* hexdigit
640* alpha
641* alnum
642
643In contrast, ``chars`` permits to specify literal characters. Both
644``class`` as well as ``chars`` may be specified multiple times inside
645the array. For example, the ``alnum`` class could also be permitted as
646follows::
647
648  rule=:%f:string{"matching.permitted":[
649		       {"class":"digit"},
650		       {"class":"alpha"}
651                          ]}%
652
653matching.mode
654~~~~~~~~~~~~~
655
656This parameter permits the strict matching requirement of liblognorm, where each
657parser must be terminated by a space character. Possible values are:
658
659* **strict** - which requires that space
660* **lazy** - which does not
661
662Default is ``strict``, this parameter is available starting with version 2.0.6.
663
664In ``lazy`` mode, the parser always matches if at least one character can be matched.
665This can lead to unexpected results, so use it with care.
666
667Example: assume the following message (without quotes)::
668
669    "12:34 56"
670
671And the following parser definition::
672
673  rule=:%f:string{"matching.permitted":[ {"class":"digit"} ]}
674                   %%r:rest%
675
676This will be unresolvable, as ":" is not a digit. With this definition::
677
678  rule=:%f:string{"matching.permitted":[ {"class":"digit"} ], "matching.mode":"lazy"}
679                   %%r:rest%
680
681it becomes resolvable, and ``f`` will contain "12" and ``r`` will contain ":34 56".
682This also shows the risk associated, as the result obtained may not necessarily be
683what was intended.
684
685
686word
687####
688
689One or more characters, up to the next space (\\x20), or
690up to end of line.
691
692string-to
693#########
694
695One or more characters, up to the next string given in
696"extradata".
697
698alpha
699#####
700
701One or more alphabetic characters, up to the next whitspace, punctuation,
702decimal digit or control character.
703
704char-to
705#######
706
707One or more characters, up to the next character(s) given in
708extradata.
709
710Parameters
711..........
712
713extradata
714~~~~~~~~~
715
716This is a mandatory parameter. It contains one or more characters, each of
717which terminates the match.
718
719
720char-sep
721########
722
723Zero or more characters, up to the next character(s) given in extradata.
724
725Parameters
726..........
727
728extradata
729~~~~~~~~~~
730
731This is a mandatory parameter. It contains one or more characters, each of
732which terminates the match.
733
734rest
735####
736
737Zero or more characters untill end of line. Must always be at end of the
738rule, even though this condition is currently **not** checked. In any case,
739any definitions after *rest* are ignored.
740
741Note that the *rest* syntax should be avoided because it generates
742a very broad match. If it needs to be used, the user shall assign it
743the lowest priority among his parser definitions. Note that the
744parser-sepcific priority is also lowest, so by default it will only
745match if nothing else matches.
746
747quoted-string
748#############
749
750Zero or more characters, surrounded by double quote marks.
751Quote marks are stripped from the match.
752
753op-quoted-string
754################
755
756Zero or more characters, possibly surrounded by double quote marks.
757If the first character is a quote mark, operates like quoted-string. Otherwise, operates like "word"
758Quote marks are stripped from the match.
759
760date-iso
761########
762Date in ISO format ('YYYY-MM-DD').
763
764time-24hr
765#########
766
767Time of format 'HH:MM:SS', where HH is 00..23.
768
769time-12hr
770#########
771
772Time of format 'HH:MM:SS', where HH is 00..12.
773
774duration
775########
776
777A duration is similar to a timestamp, except that
778it tells about time elapsed. As such, hours can be larger than 23
779and hours may also be specified by a single digit (this, for example,
780is commonly done in Cisco software).
781
782Examples for durations are "12:05:01", "0:00:01" and "37:59:59" but not
783"00:60:00" (HH and MM must still be within the usual range for
784minutes and seconds).
785
786
787date-rfc3164
788############
789
790Valid date/time in RFC3164 format, i.e.: 'Oct 29 09:47:08'.
791This parser implements several quirks to match malformed
792timestamps from some devices.
793
794Parameters
795..........
796
797format
798~~~~~~
799
800Specifies the format of the json object. Possible values are
801
802- **string** - string representation as given in input data
803- **timestamp-unix** - string converted to an unix timestamp (seconds since epoch)
804- **timestamp-unix-ms** - a kind of unix-timestamp, but with millisecond resolution.
805  This format is understood for example by ElasticSearch. Note that RFC3164 does **not**
806  contain subsecond resolution, so this option makes no sense for RFC3164-data only.
807  It is usefull, howerver, if processing mixed sources, some of which contain higher
808  precision.
809
810
811date-rfc5424
812############
813
814Valid date/time in RFC5424 format, i.e.:
815'1985-04-12T19:20:50.52-04:00'.
816Slightly different formats are allowed.
817
818Parameters
819..........
820
821format
822~~~~~~
823
824Specifies the format of the json object. Possible values are
825
826- **string** - string representation as given in input data
827- **timestamp-unix** - string converted to an unix timestamp (seconds since epoch).
828  If subsecond resolution is given in the original timestamp, it is lost.
829- **timestamp-unix-ms** - a kind of unix-timestamp, but with millisecond resolution.
830  This format is understood for example by ElasticSearch. Note that a RFC5424
831  timestamp can contain higher than ms resolution. If so, the timestamp is
832  truncated to millisecond resolution.
833
834
835
836ipv4
837####
838
839IPv4 address, in dot-decimal notation (AAA.BBB.CCC.DDD).
840
841ipv6
842####
843
844IPv6 address, in textual notation as specified in RFC4291.
845All formats specified in section 2.2 are supported, including
846embedded IPv4 address (e.g. "::13.1.68.3"). Note that a
847**pure** IPv4 address ("13.1.68.3") is **not** valid and as
848such not recognized.
849
850To avoid false positives, there must be either a whitespace
851character after the IPv6 address or the end of string must be
852reached.
853
854mac48
855#####
856
857The standard (IEEE 802) format for printing MAC-48 addresses in
858human-friendly form is six groups of two hexadecimal digits,
859separated by hyphens (-) or colons (:), in transmission order
860(e.g. 01-23-45-67-89-ab or 01:23:45:67:89:ab ).
861This form is also commonly used for EUI-64.
862from: http://en.wikipedia.org/wiki/MAC_address
863
864cef
865###
866
867This parses ArcSight Comment Event Format (CEF) as described in
868the "Implementing ArcSight CEF" manual revision 20 (2013-06-15).
869
870It matches a format that closely follows the spec. The header fields
871are extracted into the field name container, all extension are
872extracted into a container called "Extensions" beneath it.
873
874Example
875.......
876
877Rule (compact format)::
878
879    rule=:%f:cef'
880
881Data::
882
883    CEF:0|Vendor|Product|Version|Signature ID|some name|Severity| aa=field1 bb=this is a value cc=field 3
884
885Result::
886
887    {
888      "f": {
889        "DeviceVendor": "Vendor",
890        "DeviceProduct": "Product",
891        "DeviceVersion": "Version",
892        "SignatureID": "Signature ID",
893        "Name": "some name",
894        "Severity": "Severity",
895        "Extensions": {
896          "aa": "field1",
897          "bb": "this is a value",
898          "cc": "field 3"
899        }
900      }
901    }
902
903checkpoint-lea
904##############
905
906This supports the LEA on-disk format. Unfortunately, the format
907is underdocumented, the Checkpoint docs we could get hold of just
908describe the API and provide a field dictionary. In a nutshell, what
909we do is extract field names up to the colon and values up to the
910semicolon. No escaping rules are known to us, so we assume none
911exists (and as such no semicolon can be part of a value). This
912format needs to continue until the end of the log message.
913
914We have also seen some samples of a LEA format that has data **after**
915the format described above. So it does not end at the end of log line.
916We guess that this is LEA when used inside (syslog) messages. We have
917one sample where the format ends on a brace (`; ]`). To support this,
918the `terminator` parameter exists (see below).
919
920If someone has a definitive reference or a sample set to contribute
921to the project, please let us know and we will check if we need to
922add additional transformations.
923
924Parameters
925..........
926
927terminator
928~~~~~~~~~~
929Must be a single character. If used, LEA format is terminated when the
930character is hit instead of a field name. Note that the terminator character
931is **not** part of LEA. It it should be skipped, it must be specified as
932a literal after the parser. We have implemented it in this way as this
933provides most options for this format - about which we do not know any
934details.
935
936Example
937.......
938
939This configures a LEA parser for use with the syslog transfer format
940(if we guess right). It terminates when a brace is detected.
941
942Rule (condensed format)::
943
944    rule=:%field:checkpoint-lea{"terminator": "]"}%]
945
946Data::
947
948    tcp_flags: RST-ACK; src: 192.168.0.1; ]
949
950Result::
951
952   { "field": { "tcp_flags": "RST-ACK", "src": "192.168.0.1" } }'
953
954
955cisco-interface-spec
956####################
957
958A Cisco interface specifier, as for example seen in PIX or ASA.
959The format contains a number of optional parts and is described
960as follows (in ABNF-like manner where square brackets indicate
961optional parts):
962
963::
964
965  [interface:]ip/port [SP (ip2/port2)] [[SP](username)]
966
967Samples for such a spec are:
968
969 * outside:192.168.52.102/50349
970 * inside:192.168.1.15/56543 (192.168.1.112/54543)
971 * outside:192.168.1.13/50179 (192.168.1.13/50179)(LOCAL\some.user)
972 * outside:192.168.1.25/41850(LOCAL\RG-867G8-DEL88D879BBFFC8)
973 * inside:192.168.1.25/53 (192.168.1.25/53) (some.user)
974 * 192.168.1.15/0(LOCAL\RG-867G8-DEL88D879BBFFC8)
975
976Note that the current verision of liblognorm does not permit sole
977IP addresses to be detected as a Cisco interface spec. However, we
978are reviewing more Cisco message and need to decide if this is
979to be supported. The problem here is that this would create a much
980broader parser which would potentially match many things that are
981**not** Cisco interface specs.
982
983As this object extracts multiple subelements, it create a JSON
984structure.
985
986Let's for example look at this definiton (compact format)::
987
988    %ifaddr:cisco-interface-spec%
989
990and assume the following message is to be parsed::
991
992 outside:192.168.1.13/50179 (192.168.1.13/50179) (LOCAL\some.user)
993
994Then the resulting JSON will be as follows::
995
996{ "ifaddr": { "interface": "outside", "ip": "192.168.1.13", "port": "50179", "ip2": "192.168.1.13", "port2": "50179", "user": "LOCAL\\some.user" } }
997
998Subcomponents that are not given in the to-be-normalized string are
999also not present in the resulting JSON.
1000
1001iptables
1002########
1003
1004Name=value pairs, separated by spaces, as in Netfilter log messages.
1005Name of the selector is not used; names from the line are
1006used instead. This selector always matches everything till
1007end of the line. Cannot match zero characters.
1008
1009cisco-interface-spec
1010####################
1011
1012This is an experimental parser. It is used to detect Cisco Interface
1013Specifications. A sample of them is:
1014
1015::
1016
1017   outside:176.97.252.102/50349
1018
1019Note that this parser does not yet extract the individual parts
1020due to the restrictions in current liblognorm. This is planned for
1021after a general algorithm overhaul.
1022
1023In order to match, this syntax must start on a non-whitespace char
1024other than colon.
1025
1026json
1027####
1028This parses native JSON from the message. All data up to the first non-JSON
1029is parsed into the field. There may be any other field after the JSON,
1030including another JSON section.
1031
1032Note that any white space after the actual JSON
1033is considered **to be part of the JSON**. So you cannot filter on whitespace
1034after the JSON.
1035
1036Example
1037.......
1038
1039Rule (compact format)::
1040
1041    rule=:%field1:json%interim text %field2:json%'
1042
1043Data::
1044
1045   {"f1": "1"} interim text {"f2": 2}
1046
1047Result::
1048
1049   { "field2": { "f2": 2 }, "field1": { "f1": "1" } }
1050
1051Note also that the space before "interim" must **not** be given in the
1052rule, as it is consumed by the JSON parser. However, the space after
1053"text" is required.
1054
1055alternative
1056###########
1057
1058This type permits to specify alternative ways of parsing within a single
1059definition. This can make writing rule bases easier. It also permits the
1060v2 engine to create a more efficient parsing data structure resulting in
1061better performance (to be noticed only in extreme cases, though).
1062
1063An example explains this parser best::
1064
1065    rule=:a %
1066            {"type":"alternative",
1067	     "parser": [
1068	                {"name":"num", "type":"number"},
1069			{"name":"hex", "type":"hexnumber"}
1070		       ]
1071	    }% b
1072
1073This rule matches messages like these::
1074
1075   a 1234 b
1076   a 0xff b
1077
1078Note that the "parser" parameter here needs to be provided with an array
1079of *alternatives*. In this case, the JSON array is **not** interpreted as
1080a sequence. Note, though that you can nest defintions by using custom types.
1081
1082repeat
1083######
1084This parser is used to extract a repeated sequence with the same pattern.
1085
1086An example explains this parser best::
1087
1088    rule=:a %
1089            {"name":"numbers", "type":"repeat",
1090                "parser":[
1091                           {"type":"number", "name":"n1"},
1092                           {"type":"literal", "text":":"},
1093	                   {"type":"number", "name":"n2"}
1094	                 ],
1095	        "while":[
1096	                   {"type":"literal", "text":", "}
1097	                ]
1098             }% b
1099
1100This matches lines like this::
1101
1102    a 1:2, 3:4, 5:6, 7:8 b
1103
1104and will generate this JSON::
1105
1106    { "numbers": [
1107                   { "n2": "2", "n1": "1" },
1108		   { "n2": "4", "n1": "3" },
1109		   { "n2": "6", "n1": "5" },
1110		   { "n2": "8", "n1": "7" }
1111		 ]
1112    }
1113
1114As can be seen, there are two parameters to "alternative". The parser
1115parameter specifies which type should be repeatedly parsed out of
1116the input data. We could use a single parser for that, but in the example
1117above we parse a sequence. Note the nested array in the "parser" parameter.
1118
1119If we just wanted to match a single list of numbers like::
1120
1121    a 1, 2, 3, 4 b
1122
1123we could use this definition::
1124
1125    rule=:a %
1126            {"name":"numbers", "type":"repeat",
1127                "parser":
1128                         {"type":"number", "name":"n"},
1129	        "while":
1130	                 {"type":"literal", "text":", "}
1131             }% b
1132
1133Note that in this example we also removed the redundant single-element
1134array in "while".
1135
1136The "while" parameter tells "repeat" how long to do repeat processing. It
1137is specified by any parser, including a nested sequence of parser (array).
1138As long as the "while" part matches, the repetition is continued. If it no
1139longer matches, "repeat" processing is successfully completed. Note that
1140the "parser" parameter **must** match at least once, otherwise "repeat"
1141fails.
1142
1143In the above sample, "while" mismatches after "4", because no ", " follows.
1144Then, the parser termiantes, and according to definition the literal " b"
1145is matched, which will result in a successful rule match (note: the "a ",
1146" b" literals are just here for explanatory purposes and could be any
1147other rule element).
1148
1149Sometimes we need to deal with malformed messages. For example, we
1150could have a sequence like this::
1151
1152    a 1:2, 3:4,5:6, 7:8 b
1153
1154Note the missing space after "4,". To handle such cases, we can nest the
1155"alternative" parser inside "while"::
1156
1157    rule=:a %
1158            {"name":"numbers", "type":"repeat",
1159                "parser":[
1160                           {"type":"number", "name":"n1"},
1161                           {"type":"literal", "text":":"},
1162	                   {"type":"number", "name":"n2"}
1163	                 ],
1164                "while": {
1165                            "type":"alternative", "parser": [
1166                                    {"type":"literal", "text":", "},
1167                                    {"type":"literal", "text":","}
1168                             ]
1169                         }
1170             }% b
1171
1172This definition handles numbers being delemited by either ", " or ",".
1173
1174For people with programming skills, the "repeat" parser is described
1175by this pseudocode::
1176
1177    do
1178        parse via parsers given in "parser"
1179	if parsing fails
1180	    abort "repeat" unsuccessful
1181	parse via parsers given in "while"
1182    while the "while" parsers parsed successfully
1183    if not aborted, flag "repeat" as successful
1184
1185Parameters
1186..........
1187
1188option.permitMismatchInParser
1189~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1190If set to "True", permits repeat to accept as successful even when
1191the parser processing failed. This by default is false, and can be
1192set to true to cover some border cases, where the while part cannot
1193definitely detect the end of processing. An example of such a border
1194case is a listing of flags, being terminated by a double space where
1195each flag is delimited by single spaces. For example, Cisco products
1196generate such messages (note the flags part)::
1197
1198    Aug 18 13:18:45 192.168.0.1 %ASA-6-106015: Deny TCP (no connection) from 10.252.88.66/443 to 10.79.249.222/52746 flags RST  on interface outside
1199
1200cee-syslog
1201##########
1202This parses cee syslog from the message. This format has been defined
1203by Mitre CEE as well as Project Lumberjack.
1204
1205This format essentially is JSON with additional restrictions:
1206
1207 * The message must start with "@cee:"
1208 * an JSON **object** must immediately follow (whitespace before it permitted,
1209   but a JSON array is **not** permitted)
1210 * after the JSON, there must be no other non-whitespace characters.
1211
1212In other words: the message must consist of a single JSON object only,
1213prefixed by the "@cee:" cookie.
1214
1215Note that the cee cookie is case sensitive, so "@CEE:" is **NOT** valid.
1216
1217Prefixes
1218--------
1219
1220Several rules can have a common prefix. You can set it once with this
1221syntax::
1222
1223    prefix=<prefix match description>
1224
1225Prefix match description syntax is the same as rule match description.
1226Every following rule will be treated as an addition to this prefix.
1227
1228Prefix can be reset to default (empty value) by the line::
1229
1230    prefix=
1231
1232You can define a prefix for devices that produce the same header in each
1233message. We assume, that you have your rules sorted by device. In such a
1234case you can take the header of the rules and use it with the prefix
1235variable. Here is a example of a rule for IPTables (legacy format, to be converted later)::
1236
1237    prefix=%date:date-rfc3164% %host:word% %tag:char-to:-\x3a%:
1238    rule=:INBOUND%INBOUND:char-to:-\x3a%: IN=%IN:word% PHYSIN=%PHYSIN:word% OUT=%OUT:word% PHYSOUT=%PHYSOUT:word% SRC=%source:ipv4% DST=%destination:ipv4% LEN=%LEN:number% TOS=%TOS:char-to: % PREC=%PREC:word% TTL=%TTL:number% ID=%ID:number% DF PROTO=%PROTO:word% SPT=%SPT:number% DPT=%DPT:number% WINDOW=%WINDOW:number% RES=0x00 ACK SYN URGP=%URGP:number%
1239
1240Usually, every rule would hold what is defined in the prefix at its
1241beginning. But since we can define the prefix, we can save that work in
1242every line and just make the rules for the log lines. This saves us a lot
1243of work and even saves space.
1244
1245In a rulebase you can use multiple prefixes obviously. The prefix will be
1246used for the following rules. If then another prefix is set, the first one
1247will be erased, and new one will be used for the following rules.
1248
1249Rule tags
1250---------
1251
1252Rule tagging capability permits very easy classification of syslog
1253messages and log records in general. So you can not only extract data from
1254your various log source, you can also classify events, for example, as
1255being a "login", a "logout" or a firewall "denied access". This makes it
1256very easy to look at specific subsets of messages and process them in ways
1257specific to the information being conveyed.
1258
1259To see how it works, let’s first define what a tag is:
1260
1261A tag is a simple alphanumeric string that identifies a specific type of
1262object, action, status, etc. For example, we can have object tags for
1263firewalls and servers. For simplicity, let’s call them "firewall" and
1264"server". Then, we can have action tags like "login", "logout" and
1265"connectionOpen". Status tags could include "success" or "fail", among
1266others. Tags form a flat space, there is no inherent relationship between
1267them (but this may be added later on top of the current implementation).
1268Think of tags like the tag cloud in a blogging system. Tags can be defined
1269for any reason and need. A single event can be associated with as many
1270tags as required.
1271
1272Assigning tags to messages is simple. A rule contains both the sample of
1273the message (including the extracted fields) as well as the tags.
1274Have a look at this sample::
1275
1276    rule=:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%
1277
1278Here, we have a rule that shows an invalid ssh login request. The various
1279fields are used to extract information into a well-defined structure. Have
1280you ever wondered why every rule starts with a colon? Now, here is the
1281answer: the colon separates the tag part from the actual sample part.
1282Now, you can create a rule like this::
1283
1284    rule=ssh,user,login,fail:sshd[%pid:number%]: Invalid user %user:word% from %src-ip:ipv4%
1285
1286Note the "ssh,user,login,fail" part in front of the colon. These are the
1287four tags the user has decided to assign to this event. What now happens
1288is that the normalizer does not only extract the information from the
1289message if it finds a match, but it also adds the tags as metadata. Once
1290normalization is done, one can not only query the individual fields, but
1291also query if a specific tag is associated with this event. For example,
1292to find all ssh-related events (provided the rules are built that way),
1293you can normalize a large log and select only that subset of the
1294normalized log that contains the tag "ssh".
1295
1296Log annotations
1297---------------
1298
1299In short, annotations allow to add arbitrary attributes to a parsed
1300message, depending on rule tags. Values of these attributes are fixed,
1301they cannot be derived from variable fields. Syntax is as following::
1302
1303    annotate=<tag>:+<field name>="<field value>"
1304
1305Field value should always be enclosed in double quote marks.
1306
1307There can be multiple annotations for the same tag.
1308
1309Examples
1310--------
1311
1312Look at :doc:`sample rulebase <sample_rulebase>` for configuration
1313examples and matching log lines. Note that the examples are currently
1314in legacy format, only.
1315