1=pod
2
3=head1 NAME
4
5B<rwscan> - Detect scanning activity in a SiLK dataset
6
7=head1 SYNOPSIS
8
9  rwscan [--scan-model=MODEL] [--output-path=PATH]
10        [--trw-internal-set=SETFILE]
11        [--trw-theta0=PROB] [--trw-theta1=PROB]
12        [--no-titles] [--no-columns] [--column-separator=CHAR]
13        [--no-final-delimiter] [{--delimited | --delimited=CHAR}]
14        [--integer-ips] [--model-fields] [--scandb]
15        [--threads=THREADS] [--queue-depth=DEPTH]
16        [--verbose-progress=CIDR] [--verbose-flows]
17        [ {--verbose-results | --verbose-results=NUM} ]
18        [--site-config-file=FILENAME]
19        [FILES...]
20
21  rwscan --help
22
23  rwscan --version
24
25=head1 DESCRIPTION
26
27B<rwscan> reads sorted SiLK Flow records, performs scan detection
28analysis on those records, and outputs textual columnar output for the
29scanning IP addresses.  B<rwscan> writes its out to the
30B<--output-path> or to the standard output when B<--output-path> is
31not specified.
32
33The types of scan detection analysis that B<rwscan> supports are
34Threshold Random Walk (TRW) and Bayesian Logistic Regression (BLR).
35Details about these techniques are described in the L</METHOD OF
36OPERATION> section below.
37
38B<rwscan> is designed to write its data into a database.  This
39database can be queried using the B<rwscanquery(1)> tool.  See the
40L</EXAMPLES> section for the recommended database schema.
41
42The input to B<rwscan> should be pre-sorted using B<rwsort(1)> by the
43source IP, protocol, and destination IP (i.e.,
44B<--fields=sip,proto,dip>).
45
46B<rwscan> reads SiLK Flow records from the files named on the command
47line or from the standard input when no file names are specified.  To
48read the standard input in addition to the named files, use C<-> or
49C<stdin> as a file name.  If an input file name ends in C<.gz>, the
50file is uncompressed as it is read.
51
52=head1 OPTIONS
53
54Option names may be abbreviated if the abbreviation is unique or is an
55exact match for an option.  A parameter to an option may be specified
56as B<--arg>=I<param> or S<B<--arg> I<param>>, though the first form is
57required for options that take optional parameters.
58
59=over 4
60
61=item B<--scan-model>=I<MODEL>
62
63Select a specific scan detection model.  If not specified, the default
64value for I<MODEL> is C<0>.  See the L</METHOD OF OPERATION> section
65for more details.
66
67=over 4
68
69=item S< 0 >
70
71Use the Threshold Random Walk (TRW) and Bayesian Logistic Regression
72(BLR) scan detection models in series.
73
74=item S< 1 >
75
76Use only the TRW scan detection model.
77
78=item S< 2 >
79
80Use only the BLR scan detection model.
81
82=back
83
84=item B<--output-path>=I<PATH>
85
86Write the textual output to I<PATH>, where I<PATH> is a filename, a
87named pipe, the keyword C<stderr> to write the output to the standard
88error, or the keyword C<stdout> or C<-> to write the output to the
89standard output (and bypass the paging program).  If I<PATH> names an
90existing file, B<rwscan> exits with an error unless the SILK_CLOBBER
91environment variable is set, in which case I<PATH> is overwritten.  If
92this switch is not given, the output is either sent to the pager or
93written to the standard output.
94
95=item B<--trw-internal-set>=I<SETFILE>
96
97Specify an IPset file containing B<all> valid internal IP addresses.
98This parameter is required when using the TRW scan detection model,
99since the TRW model requires the list of targeted IPs (i.e., the IPs
100to detect the scanning activity to).  This switch is ignored when the
101TRW model is not used.  For information on creating IPset files, see
102the B<rwset(1)> and B<rwsetbuild(1)> manual pages.  Prior to SiLK 3.4,
103this switch was named B<--trw-sip-set>.
104
105=item B<--trw-sip-set>=I<SETFILE>
106
107This is a deprecated alias for B<--trw-internal-set>.
108
109=item B<--trw-theta0>=I<PROB>
110
111Set the theta_0 parameter for the TRW scan model to I<PROB>, which must
112be a floating point number between 0 and 1.  theta_0 is defined as the
113probability that a connection succeeds given the hypothesis that the
114remote source is benign (not a scanner).  The default value for this
115option is 0.8.  This option should only be used by experts familiar
116with the TRW algorithm.
117
118=item B<--trw-theta1>=I<PROB>
119
120Set the theta_1 parameter for the TRW scan model to I<PROB>, which must
121be a floating point number between 0 and 1.  theta_1 is defined as the
122probability that a connection succeeds given the hypothesis that the
123remote source is malicious (a scanner).  The default value for this
124option is 0.2.  This option should only be used by experts familiar
125with the TRW algorithm.
126
127=item B<--no-titles>
128
129Turn off column titles.  By default, titles are printed.
130
131=item B<--no-columns>
132
133Disable fixed-width columnar output.
134
135=item B<--column-separator>=I<C>
136
137Use specified character between columns.  When this switch is not
138specified, the default of 'B<|>' is used.
139
140=item B<--no-final-delimiter>
141
142Do not print the column separator after the final column.  Normally a
143delimiter is printed.
144
145=item B<--delimited>
146
147=item B<--delimited>=I<C>
148
149Run as if B<--no-columns> B<--no-final-delimiter> B<--column-sep>=I<C>
150had been specified.  That is, disable fixed-width column output; if
151character I<C> is provided, it is used as the delimiter between
152columns instead of the default 'B<|>'.
153
154=item B<--integer-ips>
155
156Print IP addresses as decimal integers instead of in their canonical
157representation.
158
159=item B<--model-fields>
160
161Show scan model detail fields.  This switch controls whether
162additional informational fields about the scan detection models are
163printed.
164
165=item B<--scandb>
166
167Produce output suitable for loading into a database.  Sample database
168schema are given below under L</EXAMPLES>.  This option is equivalent
169to B<--no-titles> B<--no-columns> B<--no-final-delimiter>
170B<--model-fields> B<--integer-ips>.
171
172=item B<--threads>=I<THREADS>
173
174Specify the number of worker threads to create for scan detection
175processing.  By default, one thread will be used.  Changing this
176number to match the number of available CPUs will often yield a large
177performance improvement.
178
179=item B<--queue-depth>=I<DEPTH>
180
181Specify the depth of the work queue.  The default is to make the work
182queue the same size as the number of worker threads, but this can be
183changed.  Normally, the default is fine.
184
185=item B<--verbose-progress>=I<CIDR>
186
187Report progress as B<rwscan> processes input data.  The I<CIDR>
188argument should be an integer that corresponds to the netblock size of
189each line of progress.  For example, B<--verbose-progress>=I<8> would
190print a progress message for each /8 network processed.
191
192=item B<--verbose-flows>
193
194Cause B<rwscan> to print very verbose information for each flow.  This
195switch is primarily useful for debugging.
196
197=item B<--verbose-results>
198
199=item B<--verbose-results>=I<NUM>
200
201Print detailed information on each IP processed by B<rwscan>.  If a
202I<NUM> argument is provided, only print verbose results for sources
203that sent at least I<NUM> flows. This information includes scan model
204calculations, overall scan scores, etc.  This option will generate a
205lot of output, and is primarily useful for debugging.
206
207=item B<--site-config-file>=I<FILENAME>
208
209Read the SiLK site configuration from the named file I<FILENAME>.
210When this switch is not provided, B<rwscan> searches for the site
211configuration file in the locations specified in the L</FILES>
212section.
213
214=item B<--help>
215
216Print the available options and exit.
217
218=item B<--version>
219
220Print the version number and information about how SiLK was
221configured, then exit the application.
222
223=back
224
225=head1 METHOD OF OPERATION
226
227B<rwscan>'s default behavior is to consult two scan detection models
228to determine whether a source is a scanner.  The primary model used is
229the Threshold Random Walk (TRW) model.  The TRW algorithm takes
230advantage of the tendency of scanners to attempt to contact a large
231number of IPs that do not exist on the target network.
232
233By keeping track of the number of "hits" (successful connections) and
234"misses" (attempts to connect to IP addresses that are not active on
235the target network), scanners can be detected quickly and with a high
236degree of accuracy.  Sequential hypothesis testing is used to analyze
237the probability that a source is a scanner as each flow record is
238processed.  Once the scan probability exceeds a configured maximum,
239the source is flagged as a scanner, and no further analysis of traffic
240from that host is necessary.
241
242The TRW model is not 100% accurate, however, and only finds scans in
243TCP flow data. In the case where the TRW model is inconclusive, a
244secondary model called BLR is invoked.  BLR stands for "Bayesian
245Logistic Regression."  Unlike TRW, the BLR approach must analyze all
246traffic from a given source IP to determine whether that IP is a
247scanner.
248
249Because of this, BLR operates much slower than TRW. However, the BLR
250model has been shown to detect scans that are not detected by the TRW
251model, particularly scans in UDP and ICMP data, and vertical TCP scans
252which focus on finding services on a single host.  It does this by
253calculating metrics from the flow data from each source, and using
254those metrics to arrive at an overall likelihood that the flow data
255represents scanning activity.
256
257The metrics BLR uses for detecting scans in TCP flow data are:
258
259=over 4
260
261=item *
262
263the ratio of flows with no ACK bit set to all flows
264
265=item *
266
267the ratio of flows with fewer than three packets to all flows
268
269=item *
270
271the average number of source ports per destination IP address
272
273=item *
274
275the ratio of the number of flows that have an average of 60
276bytes/packet or greater to all flows
277
278=item *
279
280the ratio of the number of unique destination IP addresses to the
281total number of flows
282
283=item *
284
285the ratio of the number of flows where the flag combination indicates
286backscatter to all flows
287
288=back
289
290The metrics BLR uses for detecting scans in UDP flow data are:
291
292=over 4
293
294=item *
295
296the ratio of flows with fewer than three packets to all flows
297
298=item *
299
300the maximum run length of IP addresses per /24 subnet
301
302=item *
303
304the maximum number of unique low-numbered (less than 1024) destination
305ports contacted on any one host
306
307=item *
308
309the maximum number of consecutive low-numbered destination ports
310contacted on any one host
311
312=item *
313
314the average number of unique source ports per destination IP address
315
316=item *
317
318the ratio of flows with 60 or more bytes/packet to all flows
319
320=item *
321
322the ratio of unique source ports (both low and high) to the number of
323flows
324
325=back
326
327The metrics BLR uses for detecting scans in ICMP flow data are:
328
329=over 4
330
331=item *
332
333the maximum number of consecutive /24 subnets that were contacted
334
335=item *
336
337the maximum run length of IP addresses per /24 subnet
338
339=item *
340
341the maximum number of IP addresses contacted in any one /24 subnet
342
343=item *
344
345the total number of IP addresses contacted
346
347=item *
348
349the ratio of ICMP echo requests to all ICMP flows
350
351=back
352
353Because the TRW model has a lower false positive rate than the BLR
354model, any source identified as a scanner by TRW will be identified as
355a scanner by the hybrid model without consulting BLR.  BLR is only
356invoked in the following cases:
357
358=over 4
359
360=item *
361
362The traffic being analyzed is UDP or ICMP traffic, which B<rwscan>'s
363implementation of TRW cannot process.
364
365=item *
366
367The TRW model has identified the source as benign.  This occurs when
368the scan probability drops below a configured minimum during
369sequential hypothesis testing.
370
371=item *
372
373The TRW model has identified the source as unknown (where the scan
374probability never exceeded the minimum or maximum thresholds during
375sequential hypothesis testing).
376
377=back
378
379In situations where the use of one model is preferred, the other model
380can be disabled using the B<--scan-model> switch.  This may have an
381impact on the performance and/or accuracy of the system.
382
383=head1 LIMITATIONS
384
385B<rwscan> detects scans in IPv4 flows only.
386
387=head1 EXAMPLES
388
389In the following examples, the dollar sign (C<$>) represents the shell
390prompt.  The text after the dollar sign represents the command line.
391Lines have been wrapped for improved readability, and the back slash
392(C<\>) is used to indicate a wrapped line.
393
394=head2 Basic Usage
395
396Assuming a properly sorted SiLK Flow file as input, the basic usage
397for Bayesian Logistic Regression (BLR) scan detection requires only
398the input file, F<data.rw>, and output file, F<scans.txt>, arguments.
399
400 $ rwscan --scan-model=2 --output-path=scans.txt data.rw
401
402Basic usage of Threshold Random Walk (TRW) scan detection requires the
403IP addresses of the targeted network (i.e., the internal IP space),
404specified in the F<internal.set> IPset file.
405
406 $ rwscan --trw-internal-set=internal.set --output-path=scans.txt data.rw
407
408=head2 Typical Usage
409
410More commonly, an analyst uses B<rwfilter(1)> to query the data
411repository for flow records within a time window.  First, the analyst
412has B<rwset(1)> put the source addresses of I<outgoing> flow records
413into an IPset, resulting in the IPset containing the IPs of active
414hosts on the internal network.  Next, the I<incoming> traffic is piped
415to B<rwsort(1)> and then to B<rwscan>.
416
417 $ rwfilter --start=2004/12/29:00 --type=out,outweb --all-dest=stdout \
418   | rwset --sip=internal.set
419
420 $ rwfilter --start=2004/12/29:00 --type=in,inweb --all-dest=stdout \
421   | rwsort --fields=sip,proto,dip                                  \
422   | rwscan --trw-internal-set=internal.set --scan-model=0          \
423        --output-path=scans.txt
424
425=head2 Storing Scans in a PostgreSQL Database
426
427Instead of having the analyst run B<rwscan> directly, often the output
428from B<rwscan> is put into a database where it can be queried by
429B<rwscanquery(1)>.  The output produced by the B<--scandb> switch is
430suitable for loading into a database of scans.  The process for using
431the PostgreSQL database is described in this section.
432
433Schemas for Oracle, MySQL, and SQLite are provided below, but the
434details to create users with the proper rolls are not included.
435
436Here is the schema for PostgreSQL:
437
438 CREATE DATABASE scans
439
440 CREATE SCHEMA scans
441
442 CREATE SEQUENCE scans_id_seq
443
444 CREATE TABLE scans (
445   id          BIGINT      NOT NULL    DEFAULT nextval('scans_id_seq'),
446   sip         BIGINT      NOT NULL,
447   proto       SMALLINT    NOT NULL,
448   stime       TIMESTAMP without time zone NOT NULL,
449   etime       TIMESTAMP without time zone NOT NULL,
450   flows       BIGINT      NOT NULL,
451   packets     BIGINT      NOT NULL,
452   bytes       BIGINT      NOT NULL,
453   scan_model  INTEGER     NOT NULL,
454   scan_prob   FLOAT       NOT NULL,
455   PRIMARY KEY (id)
456 )
457
458 CREATE INDEX scans_stime_idx ON scans (stime)
459 CREATE INDEX scans_etime_idx ON scans (etime)
460 ;
461
462A database user should be created for the purposes of populating the
463scan database, e.g.:
464
465 CREATE USER rwscan WITH PASSWORD 'secret';
466
467 GRANT ALL PRIVILEGES ON DATABASE scans TO rwscan;
468
469Additionally, a user with read-only access should be created for use
470by the B<rwscanquery> tool:
471
472 CREATE USER rwscanquery WITH PASSWORD 'secret';
473
474 GRANT SELECT ON DATABASE scans TO rwscanquery;
475
476To import B<rwscan>'s B<--scandb> output into a PostgreSQL database,
477use a command similar to the following:
478
479 $ cat /tmp/scans.import.txt            \
480   | psql -c                            \
481     "COPY scans                        \
482         (sip, proto, stime, etime,     \
483         flows, packets, bytes,         \
484         scan_model, scan_prob)         \
485     FROM stdin DELIMITER as '|'" scans
486
487=head2 Sample Schema for Oracle
488
489 CREATE TABLE scans (
490   id          integer unsigned    not null unique,
491   sip         integer unsigned    not null,
492   proto       tinyint unsigned    not null,
493   stime       datetime            not null,
494   etime       datetime            not null,
495   flows       integer unsigned    not null,
496   packets     integer unsigned    not null,
497   bytes       integer unsigned    not null,
498   scan_model  integer unsigned    not null,
499   scan_prob   float unsigned      not null,
500   primary key (id)
501 );
502
503=head2 Sample Schema for MySQL
504
505 CREATE TABLE scans (
506   id          integer unsigned    not null auto_increment,
507   sip         integer unsigned    not null,
508   proto       tinyint unsigned    not null,
509   stime       datetime            not null,
510   etime       datetime            not null,
511   flows       integer unsigned    not null,
512   packets     integer unsigned    not null,
513   bytes       integer unsigned    not null,
514   scan_model  integer unsigned    not null,
515   scan_prob   float unsigned      not null,
516   primary key (id),
517   INDEX (stime),
518   INDEX (etime)
519 ) TYPE=InnoDB;
520
521=head2 Sample Schema and Import Command for SQLite
522
523 CREATE TABLE scans (
524   id          INTEGER PRIMARY KEY AUTOINCREMENT,
525   sip         INTEGER             NOT NULL,
526   proto       SMALLINT            NOT NULL,
527   stime       TIMESTAMP           NOT NULL,
528   etime       TIMESTAMP           NOT NULL,
529   flows       INTEGER             NOT NULL,
530   packets     INTEGER             NOT NULL,
531   bytes       INTEGER             NOT NULL,
532   scan_model  INTEGER             NOT NULL,
533   scan_prob   FLOAT               NOT NULL
534 );
535 CREATE INDEX scans_stime_idx ON scans (stime);
536 CREATE INDEX scans_etime_idx ON scans (etime);
537
538To import B<rwscan>'s B<--scandb> output into a SQLite database, use
539the following command:
540
541 $ perl -nwe 'chomp;
542     print "INSERT INTO scans VALUES (NULL,",
543           (join ",",map { / / ? qq("$_") : $_ } split /\|/),
544           ");\n";' \
545 scans.txt | sqlite3 scans.sqlite
546
547=head1 ENVIRONMENT
548
549=over 4
550
551=item SILK_CLOBBER
552
553The SiLK tools normally refuse to overwrite existing files.  Setting
554SILK_CLOBBER to a non-empty value removes this restriction.
555
556=item SILK_CONFIG_FILE
557
558This environment variable is used as the value for the
559B<--site-config-file> when that switch is not provided.
560
561=item SILK_DATA_ROOTDIR
562
563This environment variable specifies the root directory of data
564repository.  As described in the L</FILES> section, B<rwscan> may
565use this environment variable when searching for the SiLK site
566configuration file.
567
568=item SILK_PATH
569
570This environment variable gives the root of the install tree.  When
571searching for configuration files, B<rwscan> may use this environment
572variable.  See the L</FILES> section for details.
573
574=back
575
576=head1 FILES
577
578=over 4
579
580=item F<${SILK_CONFIG_FILE}>
581
582=item F<${SILK_DATA_ROOTDIR}/silk.conf>
583
584=item F<@SILK_DATA_ROOTDIR@/silk.conf>
585
586=item F<${SILK_PATH}/share/silk/silk.conf>
587
588=item F<${SILK_PATH}/share/silk.conf>
589
590=item F<@prefix@/share/silk/silk.conf>
591
592=item F<@prefix@/share/silk.conf>
593
594Possible locations for the SiLK site configuration file which are
595checked when the B<--site-config-file> switch is not provided.
596
597=back
598
599=head1 SEE ALSO
600
601B<rwscanquery(1)>, B<rwfilter(1)>, B<rwsort(1)>, B<rwset(1)>,
602B<rwsetbuild(1)>, B<silk(7)>
603
604=head1 BUGS
605
606When used in an IPv6 environment, B<rwscan> converts IPv6 flow records
607that contain addresses in the ::ffff:0:0/96 prefix to IPv4.  IPv6
608records outside of that prefix are silently ignored.
609
610=cut
611
612$SiLK: rwscan.pod 57cd46fed37f 2017-03-13 21:54:02Z mthomas $
613
614Local Variables:
615mode:text
616indent-tabs-mode:nil
617End:
618