1=pod 2 3=head1 NAME 4 5B<rwscan> - Detect scanning activity in a SiLK dataset 6 7=head1 SYNOPSIS 8 9 rwscan [--scan-model=MODEL] [--output-path=PATH] 10 [--trw-internal-set=SETFILE] 11 [--trw-theta0=PROB] [--trw-theta1=PROB] 12 [--no-titles] [--no-columns] [--column-separator=CHAR] 13 [--no-final-delimiter] [{--delimited | --delimited=CHAR}] 14 [--integer-ips] [--model-fields] [--scandb] 15 [--threads=THREADS] [--queue-depth=DEPTH] 16 [--verbose-progress=CIDR] [--verbose-flows] 17 [ {--verbose-results | --verbose-results=NUM} ] 18 [--site-config-file=FILENAME] 19 [FILES...] 20 21 rwscan --help 22 23 rwscan --version 24 25=head1 DESCRIPTION 26 27B<rwscan> reads sorted SiLK Flow records, performs scan detection 28analysis on those records, and outputs textual columnar output for the 29scanning IP addresses. B<rwscan> writes its out to the 30B<--output-path> or to the standard output when B<--output-path> is 31not specified. 32 33The types of scan detection analysis that B<rwscan> supports are 34Threshold Random Walk (TRW) and Bayesian Logistic Regression (BLR). 35Details about these techniques are described in the L</METHOD OF 36OPERATION> section below. 37 38B<rwscan> is designed to write its data into a database. This 39database can be queried using the B<rwscanquery(1)> tool. See the 40L</EXAMPLES> section for the recommended database schema. 41 42The input to B<rwscan> should be pre-sorted using B<rwsort(1)> by the 43source IP, protocol, and destination IP (i.e., 44B<--fields=sip,proto,dip>). 45 46B<rwscan> reads SiLK Flow records from the files named on the command 47line or from the standard input when no file names are specified. To 48read the standard input in addition to the named files, use C<-> or 49C<stdin> as a file name. If an input file name ends in C<.gz>, the 50file is uncompressed as it is read. 51 52=head1 OPTIONS 53 54Option names may be abbreviated if the abbreviation is unique or is an 55exact match for an option. A parameter to an option may be specified 56as B<--arg>=I<param> or S<B<--arg> I<param>>, though the first form is 57required for options that take optional parameters. 58 59=over 4 60 61=item B<--scan-model>=I<MODEL> 62 63Select a specific scan detection model. If not specified, the default 64value for I<MODEL> is C<0>. See the L</METHOD OF OPERATION> section 65for more details. 66 67=over 4 68 69=item S< 0 > 70 71Use the Threshold Random Walk (TRW) and Bayesian Logistic Regression 72(BLR) scan detection models in series. 73 74=item S< 1 > 75 76Use only the TRW scan detection model. 77 78=item S< 2 > 79 80Use only the BLR scan detection model. 81 82=back 83 84=item B<--output-path>=I<PATH> 85 86Write the textual output to I<PATH>, where I<PATH> is a filename, a 87named pipe, the keyword C<stderr> to write the output to the standard 88error, or the keyword C<stdout> or C<-> to write the output to the 89standard output (and bypass the paging program). If I<PATH> names an 90existing file, B<rwscan> exits with an error unless the SILK_CLOBBER 91environment variable is set, in which case I<PATH> is overwritten. If 92this switch is not given, the output is either sent to the pager or 93written to the standard output. 94 95=item B<--trw-internal-set>=I<SETFILE> 96 97Specify an IPset file containing B<all> valid internal IP addresses. 98This parameter is required when using the TRW scan detection model, 99since the TRW model requires the list of targeted IPs (i.e., the IPs 100to detect the scanning activity to). This switch is ignored when the 101TRW model is not used. For information on creating IPset files, see 102the B<rwset(1)> and B<rwsetbuild(1)> manual pages. Prior to SiLK 3.4, 103this switch was named B<--trw-sip-set>. 104 105=item B<--trw-sip-set>=I<SETFILE> 106 107This is a deprecated alias for B<--trw-internal-set>. 108 109=item B<--trw-theta0>=I<PROB> 110 111Set the theta_0 parameter for the TRW scan model to I<PROB>, which must 112be a floating point number between 0 and 1. theta_0 is defined as the 113probability that a connection succeeds given the hypothesis that the 114remote source is benign (not a scanner). The default value for this 115option is 0.8. This option should only be used by experts familiar 116with the TRW algorithm. 117 118=item B<--trw-theta1>=I<PROB> 119 120Set the theta_1 parameter for the TRW scan model to I<PROB>, which must 121be a floating point number between 0 and 1. theta_1 is defined as the 122probability that a connection succeeds given the hypothesis that the 123remote source is malicious (a scanner). The default value for this 124option is 0.2. This option should only be used by experts familiar 125with the TRW algorithm. 126 127=item B<--no-titles> 128 129Turn off column titles. By default, titles are printed. 130 131=item B<--no-columns> 132 133Disable fixed-width columnar output. 134 135=item B<--column-separator>=I<C> 136 137Use specified character between columns. When this switch is not 138specified, the default of 'B<|>' is used. 139 140=item B<--no-final-delimiter> 141 142Do not print the column separator after the final column. Normally a 143delimiter is printed. 144 145=item B<--delimited> 146 147=item B<--delimited>=I<C> 148 149Run as if B<--no-columns> B<--no-final-delimiter> B<--column-sep>=I<C> 150had been specified. That is, disable fixed-width column output; if 151character I<C> is provided, it is used as the delimiter between 152columns instead of the default 'B<|>'. 153 154=item B<--integer-ips> 155 156Print IP addresses as decimal integers instead of in their canonical 157representation. 158 159=item B<--model-fields> 160 161Show scan model detail fields. This switch controls whether 162additional informational fields about the scan detection models are 163printed. 164 165=item B<--scandb> 166 167Produce output suitable for loading into a database. Sample database 168schema are given below under L</EXAMPLES>. This option is equivalent 169to B<--no-titles> B<--no-columns> B<--no-final-delimiter> 170B<--model-fields> B<--integer-ips>. 171 172=item B<--threads>=I<THREADS> 173 174Specify the number of worker threads to create for scan detection 175processing. By default, one thread will be used. Changing this 176number to match the number of available CPUs will often yield a large 177performance improvement. 178 179=item B<--queue-depth>=I<DEPTH> 180 181Specify the depth of the work queue. The default is to make the work 182queue the same size as the number of worker threads, but this can be 183changed. Normally, the default is fine. 184 185=item B<--verbose-progress>=I<CIDR> 186 187Report progress as B<rwscan> processes input data. The I<CIDR> 188argument should be an integer that corresponds to the netblock size of 189each line of progress. For example, B<--verbose-progress>=I<8> would 190print a progress message for each /8 network processed. 191 192=item B<--verbose-flows> 193 194Cause B<rwscan> to print very verbose information for each flow. This 195switch is primarily useful for debugging. 196 197=item B<--verbose-results> 198 199=item B<--verbose-results>=I<NUM> 200 201Print detailed information on each IP processed by B<rwscan>. If a 202I<NUM> argument is provided, only print verbose results for sources 203that sent at least I<NUM> flows. This information includes scan model 204calculations, overall scan scores, etc. This option will generate a 205lot of output, and is primarily useful for debugging. 206 207=item B<--site-config-file>=I<FILENAME> 208 209Read the SiLK site configuration from the named file I<FILENAME>. 210When this switch is not provided, B<rwscan> searches for the site 211configuration file in the locations specified in the L</FILES> 212section. 213 214=item B<--help> 215 216Print the available options and exit. 217 218=item B<--version> 219 220Print the version number and information about how SiLK was 221configured, then exit the application. 222 223=back 224 225=head1 METHOD OF OPERATION 226 227B<rwscan>'s default behavior is to consult two scan detection models 228to determine whether a source is a scanner. The primary model used is 229the Threshold Random Walk (TRW) model. The TRW algorithm takes 230advantage of the tendency of scanners to attempt to contact a large 231number of IPs that do not exist on the target network. 232 233By keeping track of the number of "hits" (successful connections) and 234"misses" (attempts to connect to IP addresses that are not active on 235the target network), scanners can be detected quickly and with a high 236degree of accuracy. Sequential hypothesis testing is used to analyze 237the probability that a source is a scanner as each flow record is 238processed. Once the scan probability exceeds a configured maximum, 239the source is flagged as a scanner, and no further analysis of traffic 240from that host is necessary. 241 242The TRW model is not 100% accurate, however, and only finds scans in 243TCP flow data. In the case where the TRW model is inconclusive, a 244secondary model called BLR is invoked. BLR stands for "Bayesian 245Logistic Regression." Unlike TRW, the BLR approach must analyze all 246traffic from a given source IP to determine whether that IP is a 247scanner. 248 249Because of this, BLR operates much slower than TRW. However, the BLR 250model has been shown to detect scans that are not detected by the TRW 251model, particularly scans in UDP and ICMP data, and vertical TCP scans 252which focus on finding services on a single host. It does this by 253calculating metrics from the flow data from each source, and using 254those metrics to arrive at an overall likelihood that the flow data 255represents scanning activity. 256 257The metrics BLR uses for detecting scans in TCP flow data are: 258 259=over 4 260 261=item * 262 263the ratio of flows with no ACK bit set to all flows 264 265=item * 266 267the ratio of flows with fewer than three packets to all flows 268 269=item * 270 271the average number of source ports per destination IP address 272 273=item * 274 275the ratio of the number of flows that have an average of 60 276bytes/packet or greater to all flows 277 278=item * 279 280the ratio of the number of unique destination IP addresses to the 281total number of flows 282 283=item * 284 285the ratio of the number of flows where the flag combination indicates 286backscatter to all flows 287 288=back 289 290The metrics BLR uses for detecting scans in UDP flow data are: 291 292=over 4 293 294=item * 295 296the ratio of flows with fewer than three packets to all flows 297 298=item * 299 300the maximum run length of IP addresses per /24 subnet 301 302=item * 303 304the maximum number of unique low-numbered (less than 1024) destination 305ports contacted on any one host 306 307=item * 308 309the maximum number of consecutive low-numbered destination ports 310contacted on any one host 311 312=item * 313 314the average number of unique source ports per destination IP address 315 316=item * 317 318the ratio of flows with 60 or more bytes/packet to all flows 319 320=item * 321 322the ratio of unique source ports (both low and high) to the number of 323flows 324 325=back 326 327The metrics BLR uses for detecting scans in ICMP flow data are: 328 329=over 4 330 331=item * 332 333the maximum number of consecutive /24 subnets that were contacted 334 335=item * 336 337the maximum run length of IP addresses per /24 subnet 338 339=item * 340 341the maximum number of IP addresses contacted in any one /24 subnet 342 343=item * 344 345the total number of IP addresses contacted 346 347=item * 348 349the ratio of ICMP echo requests to all ICMP flows 350 351=back 352 353Because the TRW model has a lower false positive rate than the BLR 354model, any source identified as a scanner by TRW will be identified as 355a scanner by the hybrid model without consulting BLR. BLR is only 356invoked in the following cases: 357 358=over 4 359 360=item * 361 362The traffic being analyzed is UDP or ICMP traffic, which B<rwscan>'s 363implementation of TRW cannot process. 364 365=item * 366 367The TRW model has identified the source as benign. This occurs when 368the scan probability drops below a configured minimum during 369sequential hypothesis testing. 370 371=item * 372 373The TRW model has identified the source as unknown (where the scan 374probability never exceeded the minimum or maximum thresholds during 375sequential hypothesis testing). 376 377=back 378 379In situations where the use of one model is preferred, the other model 380can be disabled using the B<--scan-model> switch. This may have an 381impact on the performance and/or accuracy of the system. 382 383=head1 LIMITATIONS 384 385B<rwscan> detects scans in IPv4 flows only. 386 387=head1 EXAMPLES 388 389In the following examples, the dollar sign (C<$>) represents the shell 390prompt. The text after the dollar sign represents the command line. 391Lines have been wrapped for improved readability, and the back slash 392(C<\>) is used to indicate a wrapped line. 393 394=head2 Basic Usage 395 396Assuming a properly sorted SiLK Flow file as input, the basic usage 397for Bayesian Logistic Regression (BLR) scan detection requires only 398the input file, F<data.rw>, and output file, F<scans.txt>, arguments. 399 400 $ rwscan --scan-model=2 --output-path=scans.txt data.rw 401 402Basic usage of Threshold Random Walk (TRW) scan detection requires the 403IP addresses of the targeted network (i.e., the internal IP space), 404specified in the F<internal.set> IPset file. 405 406 $ rwscan --trw-internal-set=internal.set --output-path=scans.txt data.rw 407 408=head2 Typical Usage 409 410More commonly, an analyst uses B<rwfilter(1)> to query the data 411repository for flow records within a time window. First, the analyst 412has B<rwset(1)> put the source addresses of I<outgoing> flow records 413into an IPset, resulting in the IPset containing the IPs of active 414hosts on the internal network. Next, the I<incoming> traffic is piped 415to B<rwsort(1)> and then to B<rwscan>. 416 417 $ rwfilter --start=2004/12/29:00 --type=out,outweb --all-dest=stdout \ 418 | rwset --sip=internal.set 419 420 $ rwfilter --start=2004/12/29:00 --type=in,inweb --all-dest=stdout \ 421 | rwsort --fields=sip,proto,dip \ 422 | rwscan --trw-internal-set=internal.set --scan-model=0 \ 423 --output-path=scans.txt 424 425=head2 Storing Scans in a PostgreSQL Database 426 427Instead of having the analyst run B<rwscan> directly, often the output 428from B<rwscan> is put into a database where it can be queried by 429B<rwscanquery(1)>. The output produced by the B<--scandb> switch is 430suitable for loading into a database of scans. The process for using 431the PostgreSQL database is described in this section. 432 433Schemas for Oracle, MySQL, and SQLite are provided below, but the 434details to create users with the proper rolls are not included. 435 436Here is the schema for PostgreSQL: 437 438 CREATE DATABASE scans 439 440 CREATE SCHEMA scans 441 442 CREATE SEQUENCE scans_id_seq 443 444 CREATE TABLE scans ( 445 id BIGINT NOT NULL DEFAULT nextval('scans_id_seq'), 446 sip BIGINT NOT NULL, 447 proto SMALLINT NOT NULL, 448 stime TIMESTAMP without time zone NOT NULL, 449 etime TIMESTAMP without time zone NOT NULL, 450 flows BIGINT NOT NULL, 451 packets BIGINT NOT NULL, 452 bytes BIGINT NOT NULL, 453 scan_model INTEGER NOT NULL, 454 scan_prob FLOAT NOT NULL, 455 PRIMARY KEY (id) 456 ) 457 458 CREATE INDEX scans_stime_idx ON scans (stime) 459 CREATE INDEX scans_etime_idx ON scans (etime) 460 ; 461 462A database user should be created for the purposes of populating the 463scan database, e.g.: 464 465 CREATE USER rwscan WITH PASSWORD 'secret'; 466 467 GRANT ALL PRIVILEGES ON DATABASE scans TO rwscan; 468 469Additionally, a user with read-only access should be created for use 470by the B<rwscanquery> tool: 471 472 CREATE USER rwscanquery WITH PASSWORD 'secret'; 473 474 GRANT SELECT ON DATABASE scans TO rwscanquery; 475 476To import B<rwscan>'s B<--scandb> output into a PostgreSQL database, 477use a command similar to the following: 478 479 $ cat /tmp/scans.import.txt \ 480 | psql -c \ 481 "COPY scans \ 482 (sip, proto, stime, etime, \ 483 flows, packets, bytes, \ 484 scan_model, scan_prob) \ 485 FROM stdin DELIMITER as '|'" scans 486 487=head2 Sample Schema for Oracle 488 489 CREATE TABLE scans ( 490 id integer unsigned not null unique, 491 sip integer unsigned not null, 492 proto tinyint unsigned not null, 493 stime datetime not null, 494 etime datetime not null, 495 flows integer unsigned not null, 496 packets integer unsigned not null, 497 bytes integer unsigned not null, 498 scan_model integer unsigned not null, 499 scan_prob float unsigned not null, 500 primary key (id) 501 ); 502 503=head2 Sample Schema for MySQL 504 505 CREATE TABLE scans ( 506 id integer unsigned not null auto_increment, 507 sip integer unsigned not null, 508 proto tinyint unsigned not null, 509 stime datetime not null, 510 etime datetime not null, 511 flows integer unsigned not null, 512 packets integer unsigned not null, 513 bytes integer unsigned not null, 514 scan_model integer unsigned not null, 515 scan_prob float unsigned not null, 516 primary key (id), 517 INDEX (stime), 518 INDEX (etime) 519 ) TYPE=InnoDB; 520 521=head2 Sample Schema and Import Command for SQLite 522 523 CREATE TABLE scans ( 524 id INTEGER PRIMARY KEY AUTOINCREMENT, 525 sip INTEGER NOT NULL, 526 proto SMALLINT NOT NULL, 527 stime TIMESTAMP NOT NULL, 528 etime TIMESTAMP NOT NULL, 529 flows INTEGER NOT NULL, 530 packets INTEGER NOT NULL, 531 bytes INTEGER NOT NULL, 532 scan_model INTEGER NOT NULL, 533 scan_prob FLOAT NOT NULL 534 ); 535 CREATE INDEX scans_stime_idx ON scans (stime); 536 CREATE INDEX scans_etime_idx ON scans (etime); 537 538To import B<rwscan>'s B<--scandb> output into a SQLite database, use 539the following command: 540 541 $ perl -nwe 'chomp; 542 print "INSERT INTO scans VALUES (NULL,", 543 (join ",",map { / / ? qq("$_") : $_ } split /\|/), 544 ");\n";' \ 545 scans.txt | sqlite3 scans.sqlite 546 547=head1 ENVIRONMENT 548 549=over 4 550 551=item SILK_CLOBBER 552 553The SiLK tools normally refuse to overwrite existing files. Setting 554SILK_CLOBBER to a non-empty value removes this restriction. 555 556=item SILK_CONFIG_FILE 557 558This environment variable is used as the value for the 559B<--site-config-file> when that switch is not provided. 560 561=item SILK_DATA_ROOTDIR 562 563This environment variable specifies the root directory of data 564repository. As described in the L</FILES> section, B<rwscan> may 565use this environment variable when searching for the SiLK site 566configuration file. 567 568=item SILK_PATH 569 570This environment variable gives the root of the install tree. When 571searching for configuration files, B<rwscan> may use this environment 572variable. See the L</FILES> section for details. 573 574=back 575 576=head1 FILES 577 578=over 4 579 580=item F<${SILK_CONFIG_FILE}> 581 582=item F<${SILK_DATA_ROOTDIR}/silk.conf> 583 584=item F<@SILK_DATA_ROOTDIR@/silk.conf> 585 586=item F<${SILK_PATH}/share/silk/silk.conf> 587 588=item F<${SILK_PATH}/share/silk.conf> 589 590=item F<@prefix@/share/silk/silk.conf> 591 592=item F<@prefix@/share/silk.conf> 593 594Possible locations for the SiLK site configuration file which are 595checked when the B<--site-config-file> switch is not provided. 596 597=back 598 599=head1 SEE ALSO 600 601B<rwscanquery(1)>, B<rwfilter(1)>, B<rwsort(1)>, B<rwset(1)>, 602B<rwsetbuild(1)>, B<silk(7)> 603 604=head1 BUGS 605 606When used in an IPv6 environment, B<rwscan> converts IPv6 flow records 607that contain addresses in the ::ffff:0:0/96 prefix to IPv4. IPv6 608records outside of that prefix are silently ignored. 609 610=cut 611 612$SiLK: rwscan.pod 57cd46fed37f 2017-03-13 21:54:02Z mthomas $ 613 614Local Variables: 615mode:text 616indent-tabs-mode:nil 617End: 618