1The Webalizer - A web server log file analysis tool 2Copyright 1997-2013 by Bradford L. Barrett 3 4Distributed under the GNU GPL. See the files "COPYING" and 5"Copyright" supplied with the distribution for additional info. 6 7 8What is The Webalizer? 9---------------------- 10 11The Webalizer is a web server log file analysis program which produces 12usage statistics in HTML format for viewing with a browser. The results 13are presented in both columnar and graphical format, which facilitates 14interpretation. Yearly, monthly, daily and hourly usage statistics are 15presented, along with the ability to display usage by site, URL, referrer, 16user agent (browser), search string, entry/exit page, username and country 17(some information is only available if supported and present in the log 18files being processed). Processed data may also be exported into most 19database and spreadsheet programs that support tab delimited data formats. 20 21The Webalizer supports CLF (common log format) log files, as well as 22Combined log formats as defined by NCSA and others, and variations 23of these which it attempts to handle intelligently. In addition, The 24Webalizer supports wu-ftpd xferlog (FTP) formatted logs, squid proxy logs 25and W3C extended format logs. 26 27Gzip compressed logs may be used as input directly. Any log filename 28that ends with a '.gz' extension will be assumed to be in gzip format and 29uncompressed on the fly as it is being read. The Webalizer now also has 30the ability to handle BZip2 compressed logs, if enabled at compile time. 31Similar to gzipped logs, any log filename that ends with a '.bz2' will be 32assumed to be in bzip2 format and uncompressed on the fly as it is being 33read. 34 35For sites that do not enable hostname lookups (DNS resolution) on their 36web servers (and have only IP addresses in their logs), The Webalizer 37provides its own internal DNS lookup capability as well as geolocation 38services (GeoDB). The optional GeoIP library from MaxMind Inc. is also 39supported and may be used instead of the native GeoDB database. 40 41A utility program, "The Webalizer (DNS) Cache file Manager", or 'wcmgr' 42is also provided which allows the creation and manipulation of the DNS 43cache files used and produced by the webalizer. See the file DNS.README 44for additional information regarding DNS support. 45 46This documentation applies to The Webalizer Version 2.23 47 48Running the Webalizer 49--------------------- 50 51The Webalizer was designed to be run from a Unix command line prompt or 52as a cron job. There are several command line options which will modify 53the results it produces, and configuration files can be used as well. 54The format of the command line is: 55 56webalizer [options ...] [log-file] 57 58Where 'options' can be one or more of the supported command line 59switches described below. 'log-file' is the name of the log file 60to process (see below for more detailed information). If a dash 61("-") is specified for the log-file name, STDIN will be used. 62 63 64Once executed, the general flow of the program follows: 65 66o A default configuration file is scanned for. A file named 67 'webalizer.conf' is searched for in the current directory, and if 68 found, its configuration data is parsed. If the file is not 69 present in the current directory, the file '/etc/webalizer.conf' 70 is searched for and, if found, is used instead. 71 72o Any command line arguments given to the program are parsed. This 73 may include the specification of a configuration file, which is 74 processed at the time it is encountered. 75 76o If a log file was specified, it is opened and made ready for 77 processing. If no log file was given, or the filename '-' is 78 specified on the command line, STDIN is used for input. 79 80o If an output directory was specified, the program does a 'chdir' to 81 that directory in preparation for generating output. If no output 82 directory was given, the current directory is used. 83 84o If a non-zero number of DNS Children processes were specified, they 85 will be started, and the specified log file will be processed, 86 either creating or updating the specified DNS cache file. 87 88o If no hostname was given, the program attempts to get the hostname 89 using a uname system call. If that fails, 'localhost' is used. 90 91o A history file is searched for. This file keeps previous month 92 totals used on the main index.html page. The default file is 93 named 'webalizer.hist', kept in the specified output directory, 94 however may be changed using the "HistoryName" configuration file 95 keyword. 96 97o If incremental processing was specified, a data file is searched for 98 and loaded if found, containing the 'internal state' data of the 99 program at the end of a previous run. The default file is named 100 'webalizer.current', kept in the specified output directory, however 101 may be changed using the "IncrementalName" configuration file keyword. 102 103o Main processing begins on the log file. If the log spans multiple 104 months, a separate HTML document is created for each month. 105 106o After main processing, the main 'index.html' page is created, which 107 has totals by month and links to each months HTML document. 108 109o A new history file is saved to disk, which includes totals generated 110 by The Webalizer during the current run. 111 112o If incremental processing was specified, a data file is written that 113 contains the 'internal state' data at the end of this run. 114 115 116Incremental Processing 117---------------------- 118 119Version 1.2x of The Webalizer adds incremental run capability. Simply 120put, this allows processing large log files by breaking them up into 121smaller pieces, and processing these pieces instead. What this means 122in real terms is that you can now rotate your log files as often as you 123want, and still be able to produce monthly usage statistics without the 124loss of any detail. This is accomplished by saving and restoring all 125relevant internal data to a disk file between runs. Doing so allows the 126program to 'start where it left off' so to speak, and allows the 127preservation of detail from one run to the next. 128 129Some special precautions need to be taken when using the incremental 130run capability of The Webalizer. Configuration options should not be 131changed between runs, as that could cause corruption of the internal 132stored data. For example, changing the MangleAgents level will cause 133different representations of user agents to be stored, producing invalid 134results in the user agents section of the report. If you need to change 135configuration options, do it at the end of the month after normal 136processing of the previous month and before processing the current month. 137You may also want to delete the 'webalizer.current' file as well (or 138whatever name was specified using the "IncrementalName" configuration 139option). 140 141The Webalizer also attempts to prevent data duplication by keeping 142track of the timestamp of the last record processed. This timestamp 143is then compared to current records being processed, and any records 144that were logged previous to that timestamp are ignored. This, in 145theory, should allow you to re-process logs that have already been 146processed, or process logs that contain a mix of processed/not yet 147processed records, and not produce duplication of statistics. The 148only time this may break is if you have duplicate timestamps in two 149separate log files... any records in the second log file that do have 150the same timestamp as the last record in the previous log file processed, 151will be discarded as if they had already been processed. There are 152lots of ways to prevent this however, for example, stopping the web 153server before rotating logs will prevent this situation. This setup 154also necessitates that you always process logs in chronological order, 155otherwise data loss will occur as a result of the timestamp compare. 156 157 158Output Produced 159--------------- 160 161The Webalizer produces several reports (html) and graphics for each 162month processed. In addition, a summary page is generated for the 163current and previous months (up to 12), a history file is created 164and if incremental mode is used, the current month's processed data. 165The exact location and names of these files can be changed using 166configuration files and command line options. The files produced, 167(default names) are: 168 169index.html - Main summary page (extension may be changed) 170usage.png - Yearly graph displayed on the main index page 171usage_YYYYMM.html - Monthly summary page (extension may be changed) 172usage_YYYYMM.png - Monthly usage graph for specified month/year 173daily_usage_YYYYMM.png - Daily usage graph for specified month/year 174hourly_usage_YYYYMM.png - Hourly usage graph for specified month/year 175site_YYYYMM.html - All sites listing (if enabled) 176url_YYYYMM.html - All urls listing (if enabled) 177ref_YYYYMM.html - All referrers listing (if enabled) 178agent_YYYYMM.html - All user agents listing (if enabled) 179search_YYYYMM.html - All search strings listing (if enabled) 180webalizer.hist - Previous month history (may be changed) 181webalizer.current - Incremental Data (may be changed) 182site_YYYYMM.tab - tab delimited sites file 183url_YYYYMM.tab - tab delimited urls file 184ref_YYYYMM.tab - tab delimited referrers file 185agent_YYYYMM.tab - tab delimited user agents file 186user_YYYYMM.tab - tab delimited usernames file 187search_YYYYMM.tab - tab delimited search string file 188 189The yearly (index) report shows statistics for a 12 month period, and 190links to each month. The monthly report has detailed statistics for 191that month with additional links to any URLs and referrers found. 192The various totals shown are explained below. 193 194Hits 195 196 Any request made to the server which is logged, is considered a 'hit'. 197The requests can be for anything... html pages, graphic images, audio 198files, CGI scripts, etc... Each valid line in the server log is 199counted as a hit. This number represents the total number of requests 200that were made to the server during the specified report period. 201 202Files 203 204 Some requests made to the server, require that the server then send 205something back to the requesting client, such as a html page or graphic 206image. When this happens, it is considered a 'file' and the files 207total is incremented. The relationship between 'hits' and 'files' can 208be thought of as 'incoming requests' and 'outgoing responses'. 209 210Pages 211 212 Pages are, well, pages! Generally, any HTML document, or anything 213that generates an HTML document, would be considered a page. This 214does not include the other stuff that goes into a document, such as 215graphic images, audio clips, etc... This number represents the number 216of 'pages' requested only, and does not include the other 'stuff' that 217is in the page. What actually constitutes a 'page' can vary from 218server to server. The default action is to treat anything with the 219extension '.htm', '.html' or '.cgi' as a page. A lot of sites will 220probably define other extensions, such as '.phtml', '.php3' and '.pl' 221as pages as well. Some people consider this number as the number of 222'pure' hits... I'm not sure if I totally agree with that viewpoint. 223Some other programs (and people :) refer to this as 'Pageviews'. 224 225Sites 226 227 Each request made to the server comes from a unique 'site', which can 228be referenced by a name or ultimately, an IP address. The 'sites' 229number shows how many unique IP addresses made requests to the server 230during the reporting time period. This DOES NOT mean the number of 231unique individual users (real people) that visited, which is impossible 232to determine using just logs and the HTTP protocol (however, this 233number might be about as close as you will get). 234 235Visits 236 237 Whenever a request is made to the server from a given IP address 238(site), the amount of time since a previous request by the address 239is calculated (if any). If the time difference is greater than a 240pre-configured 'visit timeout' value (or has never made a request before), 241it is considered a 'new visit', and this total is incremented (both 242for the site, and the IP address). The default timeout value is 30 243minutes (can be changed), so if a user visits your site at 1:00 in 244the afternoon, and then returns at 3:00, two visits would be registered. 245Note: in the 'Top Sites' table, the visits total should be discounted 246on 'Grouped' records, and thought of as the "Minimum number of visits" 247that came from that grouping instead. Note: Visits only occur on 248PageType requests, that is, for any request whose URL is one of the 249'page' types defined with the PageType and PagePrefix option, and not 250excluded by the OmitPage option. Due to the limitation of the HTTP 251protocol, log rotations and other factors, this number should not be 252taken as absolutely accurate, rather, it should be considered a pretty 253close "guess". 254 255KBytes 256 257 The KBytes (kilobytes) value shows the amount of data, in KB, that 258was sent out by the server during the specified reporting period. This 259value is generated directly from the log file, so it is up to the 260web server to produce accurate numbers in the logs (some web servers 261do stupid things when it comes to reporting the number of bytes). In 262general, this should be a fairly accurate representation of the amount 263of outgoing traffic the server had, regardless of the web servers 264reporting quirks. 265 266Note: A kilobyte is 1024 bytes, not 1000 :) 267 268Top Entry and Exit Pages 269 270 The Top Entry and Exit tables give a rough estimate of what URLs 271are used to enter your site, and what the last pages viewed are. 272Because of limitations in the HTTP protocol, log rotations, etc... 273this number should be considered a good "rough guess" of the actual 274numbers, however will give a good indication of the overall trend in 275where users come into, and exit, your site. 276 277 278Command Line Options 279-------------------- 280 281The Webalizer supports many different configuration options that will 282alter the way the program behaves and generates output. Most of these 283can be specified on the command line, while some can only be specified 284in a configuration file. The command line options are listed below, 285with references to the corresponding configuration file keywords. 286 287-------------------------------------------------------------------------- 288 289General Options 290--------------- 291 292-h Display all available command line options and exit program. 293 294-v Be Verbose. This will cause the program to print additional 295 information at run time. It is the same as specifying 296 "Quiet no", "ReallyQuiet no" and "Debug yes" config options. 297 298-V Display the program version and exit. Additional program 299 specific information will be displayed if 'verbose' mode is 300 also used (e.g. '-vV'), which can be useful when submitting 301 bug reports. 302 303-d Display additional 'debugging' information for errors and 304 warnings produced during processing. This normally would 305 not be used except to determine why you are getting all those 306 errors and wanted to see the actual data. Normally The 307 Webalizer will just tell you it found an error, not the 308 actual data. This option will display the data as well. 309 Config file keyword: Debug 310 311-F Specify the log file type to process. Normally, the 312 Webalizer expects to find a valid CLF or Combined format 313 we server log file. This option allows you to process 314 wu-ftpd xferlogs, squid and W3C formatted web logs as well. 315 Values can be either 'clf', 'ftp', 'squid' or 'w3c' with 316 'clf' being the default. Only the first character needs 317 to be specified (eg: -Fs will process a squid log). 318 Config file keyword: LogType 319 320-f Fold out of sequence log records back into analysis, by 321 treating them as if they were the same date/time as the 322 last good record. Normally, out of sequence log records 323 are ignored. If you run apache, don't worry about this. 324 Config file keyword: FoldSeqErr 325 326-i Ignore history file. USE WITH CAUTION. This causes The 327 Webalizer to ignore any existing history file produced from 328 previous runs and generate its output from scratch. The 329 effect will be as if The Webalizer is being run for the 330 first time and any previous statistics will be lost (although 331 the HTML documents, if any, will not be deleted) on the main 332 index.html (yearly) web page. 333 Config file keyword: IgnoreHist 334 335-b Ignore incremental data file. USE WITH CAUTION. This causes 336 The Webalizer to ignore any existing incremental (state) data 337 file produced by previous runs. By ignoring the incremental 338 data file, all previous processing for the current month will 339 be lost, and those logs must be re-processed. 340 Config file keyword: IgnoreState 341 342-p Preserve state (incremental processing). This allows the 343 processing of partial logs in increments. At the end of 344 the program, all relevant internal data is saved, so that 345 it may be restored the next time the program is run. This 346 allows sites that must rotate their logs more than once a 347 month to still be able to use The Webalizer, and not worry 348 about having to gather and feed an entire months logs to 349 the program at the end of the month. See the section on 350 "Incremental Processing" below for additional information. 351 The default is to not perform incremental processing. Use 352 this command line option to enable the feature. 353 Config file keyword: Incremental 354 355-q Quiet mode. Normally, The Webalizer will produce various 356 messages while it runs letting you know what its doing. 357 This option will suppress those messages. It should be 358 noted that this WILL NOT suppress errors and warnings, which 359 are output to STDERR. 360 Config file keyword: Quiet 361 362-Q ReallyQuiet mode. This allows suppression of _all_ messages 363 generated by The Webalizer, including warnings and errors. 364 Useful when The Webalizer is run as a cron job. 365 Config file keyword: ReallyQuiet 366 367-T Display timing information. The Webalizer keeps track of the 368 time it begins and ends processing, and normally displays the 369 total processing time at the end of each run. If quiet mode 370 (-q or 'Quiet yes' in configuration file) is specified, this 371 information is not displayed. This option forces the display 372 of timing totals if quiet mode has been specified, otherwise 373 it is redundant and will have no effect. 374 Config file keyword: TimeMe 375 376-c file This option specifies a configuration file to use. Configuration 377 files allow greater control over how The Webalizer behaves, and 378 there are several ways to use them. As of version 0.98, The 379 Webalizer searches for a default configuration file in the 380 current directory named "webalizer.conf", and if not found, 381 will search in the /etc/ directory for a file of the same name. 382 In addition, you may specify a configuration file to use with 383 this command line option. 384 385-n name This option specifies the hostname for the reports generated. 386 The hostname is used in the title of all reports, and is also 387 prepended to URLs in the reports. This allows The Webalizer 388 to be run on log files for 'virtual' web servers or web servers 389 that are different than the machine the reports are located on, 390 and still allows clicking on the URLs to go to the proper 391 location. If a hostname is not specified, either on the 392 command line or in a configuration file, The Webalizer attempts 393 to determine the hostname using a 'uname' system call. If this 394 fails, "localhost" will be used as the hostname. 395 Config file keyword: HostName 396 397-o dir This options specifies the output directory for the reports. 398 If not specified here or in a configuration file, the current 399 default directory will be used for output. 400 Config file keyword: OutputDir 401 402-x name This option allows the generated pages to have an extension 403 other than '.html', which is the default. Do not include the 404 leading period ('.') when you specify the extension. 405 Config file keyword: HTMLExtension 406 407-P name Specify the file extensions for 'pages'. Pages (sometimes 408 called 'PageViews') are normally html documents and CGI 409 scripts that display the whole page, not just parts of it. 410 Some system will need to define a few more, such as 'phtml', 411 'php3' or 'pl' in order to have them counted as well. The 412 default is 'htm*' and 'cgi' for web logs and 'txt' for ftp. 413 Config file keyword: PageType 414 415-O name Specify URLs which are not counted as 'pages'. Requests 416 matching one of these URLs will not be counted as a page, even 417 if they have an extension matching one of the PageTypes defined 418 above or have no extension at all. 419 Config file keyword: OmitPage 420 421-t name This option specifies the title string for all reports. This 422 string is used, in conjunction with the hostname (if not blank) 423 to produce the actual title. If not specified, the default of 424 "Usage Statistics for" will be used. 425 Config file keyword: ReportTitle 426 427-Y Suppress Country graph. Normally, The Webalizer produces 428 country statistics in both Graph and Columnar forms. This 429 option will suppress the Country Graph from being generated. 430 Config file keyword: CountryGraph 431 432-G Suppress hourly graph. Normally, The Webalizer produces 433 hourly statistics in both Graph and Columnar forms. This 434 option will suppress the Hourly Graph only from being generated. 435 Config file keyword: HourlyGraph 436 437-H Suppress Hourly statistics. Normally, The Webalizer produces 438 hourly statistics in both Graph and Columnar forms. This 439 option will suppress the Hourly Statistics table only from 440 being generated. 441 Config file keyword: HourlyStats 442 443-K num Specify how many months should be displayed in the main index 444 (yearly summary) table. Default is 12 months. Can be set to 445 anything between 12 and 120 months (1 to 10 years). 446 Config file keyword: IndexMonths 447 448-k num Specify how many months should be displayed in the main index 449 (yearly summary) graph. Default is 12 months. Can be set to 450 anything between 12 and 72 months (1 to 6 years). 451 Config file keyword: GraphMonths 452 453-L Disable Graph Legends. The color coded legends displayed on 454 the in-line graphs can be disabled with this option. The 455 default is to display the legends. 456 Config file keyword: GraphLegend 457 458-l num Graph Lines. Specify the number of background reference 459 lines displayed on the in-line graphics produced. The default 460 is 2 lines, however can range anywhere from zero ('0') for 461 no lines, up to 20 lines (looks funny!). 462 Config file keyword: GraphLines 463 464-P name Page type. This is the extension of files you consider to 465 be pages for Pages calculations (sometimes called 'pageviews'). 466 The default is 'htm*' and 'cgi' (plus whatever HTMLExtension 467 you specified if it is different). Don't use a period! 468 469-m num Specify a 'visit timeout'. Visits are calculated by looking at 470 the time difference between the current and last request made 471 by a specific host. If the difference is greater that the 472 visit timeout value, the request is considered a new visit. 473 This value is specified in number of seconds. The default 474 is 30 minutes (1800). 475 Config file keyword: VisitTimeout 476 477-M num Mangle user agent names. Normally, The Webalizer will keep 478 track of the user agent field verbatim. Unfortunately, there are 479 a ton of different names that user agents go by, and the field 480 also reports other items such as machine type and OS used. For 481 Example, Netscape 4.03 running on Windows 95 will report a 482 different string than Netscape 4.03 running on Windows NT, so even 483 though they are the same browser type, they will be considered 484 as two totally different browsers by The Webalizer. For that 485 matter, Netscape 4.0 running on Windows NT will report different 486 names if one is run on an Alpha and the other on an Intel 487 processor! Internet Exploder is even worse, as it reports itself 488 as if it were Netscape and you have to search the given string a 489 little deeper to discover that it is really MSIE! In order to 490 consolidate generic browser types, this option will cause The 491 Webalizer to 'mangle' the user agent field, attempting to 492 consolidate generic browser types. There are 6 levels that can be 493 specified, each producing different levels of detail. Level 5 494 displays only the browser name (MSIE or Mozilla) and the major 495 version number. Level 4 will also display the minor version 496 number (single decimal place). Level 3 will display the minor 497 version number to two decimal places. Level 2 will add any 498 sub-level designation (such as Mozilla/3.01Gold or MSIE 3.0b). 499 Level 1 will also attempt to add the system type. The default 500 Level 0 will disable name mangling and leave the user agent 501 field unmodified, producing the greatest amount of detail. 502 Configuration file keyword: MangleAgents 503 504-g num This option allows you to specify the level of domains name 505 grouping to be performed. The numeric value represents the 506 level of grouping, and can be thought of as the 'number of 507 dots' to be displayed. The default value of 0 disables any 508 domain name grouping. 509 Configuration file keyword: GroupDomains 510 511-D name This allows the specification of a DNS Cache file name. This 512 filename MUST be specified if you have dns lookups enabled 513 (using the -N command line switch or DNSChildren configuration 514 keyword). The filename is relative to the default output 515 directory if an absolute path is not specified (ie: starts 516 with a leading '/'). This option is only available if DNS 517 support was enabled at compile time, otherwise an 'Invalid 518 Keyword' error will be generated. See the DNS.README file 519 for additional information regarding DNS lookups. 520 Configuration file keyword: DNSCache 521 522-N num Number of DNS child processes to use for reverse DNS lookups. 523 If specified, a DNSCache name MUST be specified also. If you 524 do not wish a DNS cache file to be generated, specify a value 525 of zero ('0') to disable it. This does not prevent using an 526 existing cache file, only the generation of one at run time. 527 See the DNS.README file for additional information. 528 Configuration file keyword: DNSChildren 529 530-j Enable native GeoDB geolocation services. 531 Configuration file keyword: GeoDB 532 533-J name Specify an alternate GeoDB database filename to use. This 534 shouldn't normally be needed. If used, the filename 'name' 535 is relative to the output directory being used unless an 536 absolute path is specified (ie: starts with a leading '/'). 537 Configuration file keyword: GeoDBDatabase 538 539-w Enable GeoIP support if it is available. 540 Configuration file keyword: GeoIP 541 542-W name Specify an alternate GeoIP database filename to use. This 543 shouldn't normally be needed. If used, the filename 'name' 544 is relative to the specified output directory unless an 545 absolute name is given (ie: starts with a leading '/'). 546 Configuration file keyword: GeoIPDatabase 547 548-z name Specify location of the country flag graphics and enable 549 their display in the top country table. The directory name 550 is relative to the output directory unless an absolute path 551 is specified (ie: starts with a leading '/'). 552 Configuration file keyword: FlagDir 553 554Hide Options 555------------ 556 557The following options take a string argument to use as a comparison 558for matching. Except for the IndexAlias option, the string argument 559can be plain text, or plain text that either starts or ends with the 560wildcard character '*'. 561 562For Example: 563 564Given the string "yourmama/was/here", the arguments "was", "*here" and 565"your*" will all produce a match. 566 567 568-a name This option allows hiding of user agents (browsers) from the 569 "Top User Agents" table in the report. This option really 570 isn't too useful as there are a zillion different names that 571 current browsers go by, depending where they were obtained, 572 however you might have some particular user agents that hit 573 your site a lot that you would like to exclude from the list. 574 You must have a web server that includes user agents in its 575 log files for this option to be of any use. In addition, it 576 is also useless if you disable the user agent table in the 577 report (see the -A command line option or "TopAgents" 578 configuration file keyword). You can specify as many of these 579 as you want on the command line. The wildcard character '*' 580 can be used either in front of or at the end of the string. 581 (ie: Mozilla/4.0* would match anything that starts with the 582 string "Mozilla/4.0"). 583 Config file keyword: HideAgent 584 585-r name This option allows hiding of referrers from the "Top Referrer" 586 table in the report. Referrers are URLs, either on your own 587 local site or a remote site, that referred the user to a URL 588 on your web server. This option is normally used to hide 589 your own server from the table, as your own pages are usually 590 the top referrers to your own pages (well, you get the idea). 591 You must have a web server that includes referrer information 592 in the log files for this option to be of any use. In addition, 593 it is also useless if you disable the referrers table in the 594 report (see the -R command line option or "TopReferrers" 595 configuration file keyword). You can specify as many of these 596 as you like on the command line. 597 Config file keyword: HideReferrer 598 599-s name This option allows hiding of sites from the "Top Sites" table 600 in the report. Normally, you will only want to hide your own 601 domain name from the report, as it usually is one of the top 602 sites to visit your web server. This option is of no use if 603 you disable the top sites table in the report (see the -S 604 command line option or "TopSites" configuration file option). 605 Config file keyword: HideSite 606 607-X This causes all individual sites to be hidden, which results 608 in only grouped sites to be displayed on the report. 609 Config file keyword: HideAllSites 610 611-u name This option allows hiding of URLs from the "Top URLs" table 612 in the report. Normally, this option is used to hide images, 613 audio files and other objects your web server dishes out that 614 would otherwise clutter up the table. This option is of no 615 use if you disable the top URLs table in the report (see the 616 -U command line option or "TopURLs" configuration file keyword). 617 Config file keyword: HideURL 618 619-I name This option allows you to specify additional index.html aliases. 620 The Webalizer usually strips the string 'index.*' from URLs 621 before processing (unless disabled using the 'DefaultIndex' 622 config option), which has the effect of turning a URL such 623 as /somedir/index.html into just /somedir/ which is really the 624 same URL and should be treated as such. This option allows you 625 to specify _additional_ strings that are to be treated the same 626 way. Use with care, improper use could cause unexpected results. 627 For example, if you specify the alias string of 'home', a URL 628 such as /somedir/homepages/brad/home.html would be converted 629 into just /somedir/ which probably isn't what was intended. 630 This option is useful if your web server uses a different default 631 index page other than the standard 'index.html' or 'index.htm', 632 such as 'home.html' or 'homepage.html'. The string specified 633 is searched for _anywhere_ in the URL, so "home.htm" would 634 turn both "/somedir/home.htm" and "/somedir/home.html" into 635 just "/somedir/". Wildcards are _not_ allowed on this one. 636 Config file keyword: IndexAlias 637 638Table Size Options 639------------------ 640 641-e num This option specifies the number of entries to display in the 642 "Top Entry Pages" table. To disable the table, use a value of 643 zero (0). 644 Config file keyword: TopEntry 645 646-E num This option specifies the number of entries to display in the 647 "Top Exit Pages" table. To disable the table, use a value of 648 zero (0). 649 Config file keyword: TopExit 650 651-A num This option specifies the number of entries to display in the 652 "Top User Agents" table. To disable the table, use a value of 653 zero (0). 654 Config file keyword: TopAgents 655 656-C num This option specifies the number of entries to display in the 657 "Top Countries" table. To disable the table, use a value of 658 zero (0). 659 Config file keyword: TopCountries 660 661-R num This option specifies the number of entries to display in the 662 "Top Referrers" table. To disable the table, use a value of 663 zero (0). 664 Config file keyword: TopReferrers 665 666-S num This option specifies the number of entries to display in the 667 "Top Sites" table. To disable the table, use a value of 668 zero (0). 669 Config file keyword: TopSites 670 671-U num This option specifies the number of entries to display in the 672 "Top URLs" table. To disable the table, use a value of 673 zero (0). 674 Config file keyword: TopURLs 675 676-------------------------------------------------------------------------- 677 678 679CONFIGURATION FILES 680------------------- 681 682The Webalizer allows configuration files to be used in order to simplify 683life for all. There are several ways that configuration files are accessed 684by the Webalizer. When The Webalizer first executes, it looks for a 685default configuration file named "webalizer.conf" in the current directory, 686and if not found there, will look for "/etc/webalizer.conf". In addition, 687configuration files may be specified on the command line with the '-c' 688option. There are lots of different ways you can combine the use of 689configuration files and command line options to produce various results. 690The Webalizer always looks for and reads configuration options from a 691default configuration file before doing anything else. Because of this, 692you can override options found in the default file by use of additional 693configuration files specified on the command line or command line options 694themselves. If you specify a configuration file on the command line, you 695can override options in it by additional command line options which follow. 696For example, most users will most likely want to create the default file 697/etc/webalizer.conf and place options in it to specify the hostname, log 698file, table options, etc... At the end of the month when a different log 699file is to be used (the end of month log), you can run The Webalizer as 700usual, but put the different filename on the end of the command line, which 701will override the log file specified in the configuration file. It should 702be noted that you cannot override some configuration file options by the 703use of command line arguments. For example, if you specify "Quiet yes" in 704a configuration file, you cannot override this with a command line argument, 705as the command line option only _enables_ the feature (-q option). 706 707The configuration files are standard ASCII text files that may be created 708or edited using any standard editor. Blank lines and lines that begin 709with a pound sign ('#') are ignored. Any other lines are considered to 710be configuration lines, and have the form "Keyword Value", where the 711'Keyword' is one of the currently available configuration keywords defined 712below, and 'Value' is the value to assign to that particular option. Any 713text found after the keyword up to the end of the line is considered the 714keyword's value, so you should not include anything after the actual value 715on the line that is not actually part of the value being assigned. The 716file "sample.conf" provided with the distribution contains lots of useful 717documentation and examples as well. It should be noted that you do not 718have to use any configuration files at all, in which case, default values 719will be used (which should be sufficient for most sites). 720 721-------------------------------------------------------------------------- 722 723General Configuration Keywords 724------------------------------ 725 726LogFile This defines the log file to use. It should be a fully qualified 727 name (ie: contain the path), but relative names will work as 728 well. If not specified, the logfile defaults to STDIN. 729 730LogType This specified the log file type being used. Normally, The 731 Webalizer processes web logs in either CLF or Combined format. 732 You may also process wu-ftpd xferlog formatted logs, squid 733 proxy logs or W3C formatted web logs by setting the appropriate 734 type using this keyword. Values may be either 'clf', 'ftp', 735 'squid' or 'w3c'. Ensure that you specify the proper file type, 736 otherwise you will be presented with a long stream of 'invalid 737 record' messages when the Webalizer is run ;) 738 Command line argument: -F 739 740OutputDir This defines the output directory to use for the reports. If 741 it is not specified, the current directory is used. 742 Command line argument: -o 743 744HistoryName Allows specification of a history path/filename if desired. 745 The default is to use the file named 'webalizer.hist', kept 746 in the normal output directory (OutputDir above). Any name 747 specified is relative to the normal output directory unless 748 an absolute path name is given (ie: starts with a '/'). 749 750ReportTitle This specifies the title to use for the generated reports. 751 It is used in conjunction with the hostname (unless blank) 752 to produce the final report titles. If not defined, the 753 default of "Usage Statistics for" is used. 754 Command line argument: -t 755 756HostName This defines the hostname. The hostname is used in the 757 report title as well as being prepended to URLs in the 758 "Top URLs" table. This allows The Webalizer to be run 759 on "virtual" web servers, or servers that do not reside 760 on the local machine, and allows clicking on the URL to 761 go to the right place. If not specified, The Webalizer 762 attempts to get the hostname via a 'uname' system call, 763 and if that fails, will default to "localhost". 764 Command line argument: -n 765 766UseHTTPS Causes the links in the 'Top URLs' table to use 'https://' 767 instead of the default 'http://' prefix. Not much use if 768 you run a mix of secure/insecure servers on your machine. 769 Only useful if you run the analysis on a secure servers 770 logs, and want the links in the table to work properly. 771 772HTAccess Enables the creation of a default .htaccess file in the 773 output directory. If enabled, the file will be created 774 (with a single "DirectoryIndex" directive), unless one 775 already exists. The default is 'no', which disables the 776 creation of any .htaccess files. 777 778Quiet This allows you to enable or disable informational messages 779 while it is running. The values for this keyword can be 780 either 'yes' or 'no'. Using "Quiet yes" will suppress these 781 messages, while "Quiet no" will enable them. The default 782 is 'no' if not specified, which will allow The Webalizer 783 to display informational messages. It should be noted that 784 this option has no effect on Warning or Error messages that 785 may be generated, as they go to STDERR. 786 Command line argument: -q 787 788ReallyQuiet This allows all generated output to be suppressed, including 789 warning and error messages. The values for this keyword 790 can be either 'yes' or 'no', with 'no' being the default. 791 Command line argument: -Q 792 793TimeMe This allows you to display timing information regardless of 794 any "quiet mode" specified. Useful only if you did in fact 795 tell the webalizer to be quiet either by using the -q command 796 line option or the "Quiet" keyword, otherwise timing stats 797 are normally displayed anyway. Values may be either 'yes' 798 or 'no', with the default being 'no'. 799 Command line argument: -T 800 801GMTTime This keyword allows timestamps to be displayed in GMT (UTC) 802 time instead of local time. Normally The Webalizer will 803 display timestamps in the time-zone of the local machine 804 (ie: PST or EDT). This keyword allows you to specify the 805 display of timestamps in GMT (UTC) time instead. Values 806 may be either 'yes' or 'no'. Default is 'no'. 807 808Debug This tells The Webalizer to display additional information 809 when it encounters Warnings or Errors. Normally, The 810 Webalizer will just tell you it found a bad record or 811 field. This option will enable the display of the actual 812 data that produced the Warning or Error as well. Useful 813 only if you start getting lots of Warnings or Errors and 814 want to determine the cause. Values may be either 'yes' 815 or 'no', with the default being 'no'. 816 Command line argument: -d 817 818IgnoreHist This suppresses the reading of a history file. USE WITH 819 EXTREME CAUTION as the history file is how The Webalizer 820 keeps track of previous months. The effect of this option 821 is as if The Webalizer was being run for the very first 822 time, and any previous data is discarded. Values may be 823 either 'yes' or 'no', with the default being 'no'. 824 Command line argument: -i 825 826IgnoreState This suppresses the reading of an existing incremental 827 data file. USE WITH EXTREME CAUTION! By ignoring an 828 existing incremental data file, all previous processing 829 for the current month will be lost, and those logs must 830 be re-processed. Values may be 'yes' or 'no', with the 831 default being 'no'. 832 Command line argument: -b 833 834FoldSeqErr Allows log records that are out of sequence to be folded 835 back into the analysis, by treating them as if they had 836 the same date/time as the last good record. Normally, 837 out of sequence log records are simply ignored. If you 838 run apache, don't worry about this. 839 840VisitTimeout Set the 'visit timeout' value. Visits are determined by 841 looking at the time difference between the current and last 842 request made by a specific site. If the difference in time 843 is greater than the visit timeout value, the request is 844 considered a new visit. The value is in number of seconds, 845 and defaults to 30 minutes (1800). 846 Command line argument: -m 847 848PageType Allows you to define the 'page' type extension. Normally, 849 people consider HTML and CGI scripts as 'pages'. This 850 option allows you to specify what extensions you consider 851 a page. Default is 'htm*' and 'cgi' for web logs, and 852 'txt' for ftp logs. 853 Command line argument: -P 854 855PagePrefix Allows all requests with a specified prefix to be considered 856 as 'pages'. If you want everything under /documents to be 857 treated as pages no matter what their extension is. Also 858 useful if you have cgi-scripts with PATH_INFO. 859 860OmitPage Allows specified URLs to not be counted as pages under any 861 circumstance, even if they have an extension matching a 862 PageType or PagePrefix as defined above. 863 864GraphLegend Enable/disable the display of color coded legends on the 865 produced graphs. Default is 'yes', to display them. 866 Command line argument: -L 867 868GraphLines Specify the number of background reference lines to display 869 on produced graphs. The default is 2. To disable the use 870 of background lines, use zero ('0'). 871 Command line argument: -l 872 873IndexMonths Specify the number of months to display in the main index 874 (yearly summary) table. Default is 12 months. Can be set 875 to anything between 12 and 120 months (1 to 10 years). 876 Command line argument: -K 877 878YearHeaders Enable/disable the display of year headers in the main index 879 (yearly summary) table. If enabled, year headers will be 880 shown when the table is displaying more than 16 months worth 881 of data. Values can be 'yes' or 'no'. Default is 'yes'. 882 883GraphMonths Specify the number of months to display in the main index 884 (yearly summary) graph. Default is 12 months. Can be set 885 to anything between 12 and 72 months (1 to 6 years). 886 Command line argument: -k 887 888CountryGraph This keyword is used to either enable or disable the creation 889 and display of the Country Usage graph. Values may be either 890 'yes' or 'no', with the default being 'yes'. 891 Command line argument: -Y 892 893CountryFlags Enables or disables the display of flags in the top country 894 table. If enabled, the default directory 'flags' directly 895 under the output directory will be used unless a different 896 path is specified with the 'FlagDir' option below. 897 Command line argument: -zflags 898 899FlagDir Specifies the location of flag graphics. If not specified, 900 the default is in the 'flags' directory directly under the 901 output directory being used for the reports. If specified, 902 the display of flags will be enabled by default. 903 Command line argument: -z 904 905DailyGraph This keyword is used to either enable or disable the creation 906 and display of the Daily Usage graph. Values may be either 907 'yes' or 'no', with the default being 'yes'. 908 909DailyStats This keyword is used to either enable or disable the creation 910 and display of the Daily Usage statistics table. Values may 911 be either 'yes' or 'no', with the default being 'yes'. 912 913HourlyGraph This keyword is used to either enable or disable the creation 914 and display of the Hourly Usage graph. Values may be either 915 'yes' or 'no', with the default being 'yes'. 916 Command line argument: -G 917 918HourlyStats This keyword is used to either enable or disable the creation 919 and display of the Hourly Usage statistics table. Values may 920 be either 'yes' or 'no', with the default being 'yes'. 921 Command line argument: -H 922 923IndexAlias This allows additional 'index.html' aliases to be defined. 924 Normally, The Webalizer scans for and strips the string 925 "index." from URLs before processing them (unless disabled 926 using the DefaultIndex config option below). This turns a 927 URL such as /somedir/index.html into just /somedir/ which 928 is really the same URL. This keyword allows _additional_ 929 names to be treated in the same fashion for sites that use 930 different default names, such as "home.html". The string 931 is scanned for anywhere in the URL, so care should be used 932 if and when you define additional aliases. For example, 933 if you were to use an alias such as 'home', the URL 934 /somedir/homepages/brad/home.html would be turned into just 935 /somedir/ which probably isn't the intended result. Instead, 936 you should have specified 'home.htm' which would correctly 937 turn the URL into /somedir/homepages/brad/ like intended. 938 It should also be noted that specified aliases are scanned 939 for in EVERY log record... A bunch of aliases will noticeably 940 degrade performance as each record has to be scanned for 941 every alias defined. You don't have to specify 'index.' as 942 it is always the default (unless disabled with the config 943 option "DefaultIndex" described below). 944 Command line argument: -I 945 946DefaultIndex This option is used to enable/disable the use of "index." as 947 a default index name to be stripped from the end of a URL. 948 Most sites should not need to use this option, however some 949 may find it useful, particularly those whose default index 950 file name is something different, or those sites that use 951 'index.php' or similar URLs to generate dynamic content. 952 This option does not effect any of the names that may be 953 defined using the IndexAlias option, and those names will 954 still function as described. Values may be 'yes' or 'no', 955 with 'yes' being the default. 956 957MangleAgents The MangleAgents keyword specifies the level of user agent 958 name mangling, if any. There are 6 levels that may be specified, 959 each producing a different level of detail displayed. Level 5 960 displays only the browser name (MSIE or Mozilla) and the major 961 version number. Level 4 adds the minor version (single 962 decimal place). Level 3 adds the minor version to two decimal 963 places. Level 2 will also add any sub-level designation 964 (such as Mozilla/3.01Gold or MSIE 3.0b). Level 1 will also 965 attempt to add the system type. The default level 0 will 966 leave the user agent field unmodified and produces the 967 greatest amount of detail. 968 Command line argument: -M 969 970SearchEngine This keyword allows specification of search engines and 971 their query strings. Search strings are obtained from 972 the referrer field in the record, and in order to work 973 properly, the Webalizer needs to know what query strings 974 different search engines use. The SearchEngine allows 975 you to specify the search engine and its query string 976 to parse the search string from. The line is formatted 977 as: "SearchEngine engine-string query-string" where 978 'engine-string' is a substring for matching the search 979 engine with, such as "yahoo.com" or "altavista". The 980 'query-string' is the unique query string that is added 981 to the URL for the search engine, such as "search=" or 982 "MT=" with the actual search strings appended to the 983 end. There is no command line option for this keyword. 984 985SearchCaseI The SearchCaseI option specifies if search strings should 986 be lowercased (case insensitive) or not. Since most 987 search engines use case insensitive searches (ie: a 988 search for "Hello" is the same as "HELLO" or "hello"), 989 converting to lowercase will improve keyword accuracy, 990 which is the default. If desired, case sensitivity can 991 be forced with this option. The value can be 'yes' or 992 'no', with 'yes' (case insensitive) being the default. 993 994Incremental This allows incremental processing to be enabled or disabled. 995 Incremental processing allows processing partial logs without 996 the loss of detail data from previous runs in the same month. 997 This feature saves the 'internal state' of the program so that 998 it may be restored in following runs. See the section above 999 titled "Incremental Processing" for additional information. 1000 The value may be 'yes' or 'no', with the default being 'no'. 1001 Command line argument: -p 1002 1003IncrementalName 1004 Allows specification of the incremental data filename if 1005 desired. Normally, the file named "webalizer.current' is 1006 used, kept in the standard output directory. If specified, 1007 filenames are relative to the standard output directory, 1008 unless an absolute name is given (ie: starts with '/'). 1009 1010StripCGI Determines if CGI variables should be stripped from the 1011 end of URLs or not. Normally, these variables are removed 1012 from URLs to improve accuracy, however some sites may wish 1013 to keep them preserved (particularly on highly dynamic 1014 sites). Values may be either 'yes' or 'no', with 'yes' 1015 being the default. 1016 1017TrimSquidURL Allows squid log URLs to be reduced in granularity by 1018 truncating them after a specified number of '/' path 1019 separators after the http:// portion. A value of 1 will 1020 cause all URLs to be summarized by domain only. The 1021 default value is zero (0), which leaves URLs unmodified. 1022 1023DNSCache Specifies the DNS cache filename. This name is relative 1024 to the default output directory unless an absolute name 1025 is given (ie: starts with '/'). See the DNS.README file 1026 for additional information. 1027 Command line argument: -D 1028 1029DNSChildren The number of DNS children processes to run in order to 1030 create/update the DNS cache file. If specified, the DNS 1031 cache filename must also be specified (see above). Use 1032 a value of zero ('0') to disable. See the DNS.README 1033 file for additional information. 1034 Command line argument: -N 1035 1036CacheIPs Specifies if unresolved addresses should also be cached 1037 in the DNS database. If enabled, unresolved IP addresses 1038 will be stored along with resolved addresses. This may 1039 be useful on some sites that have lots of unresolved IPs 1040 visiting so they are not looked up each time the program 1041 is run. Values may be 'yes' or 'no'. Default is 'no'. 1042 1043CacheTTL Specifies the Time To Live (TTL) value for cached DNS 1044 entries in days. Default value is 7 (1 week). Can be 1045 any value between 1 and 100. 1046 1047GeoDB Controls the use of the native GeoDB geolocation services 1048 provided by The Webalizer. Values may be 'yes' or 'no' 1049 with 'no' being the default. 1050 Command line argument: -j 1051 1052GeoDBDatabase Specifies and alternate GeoDB database filename to use. 1053 This is relative to the output directory being used unless 1054 an absolute path is given (ie: starts with a '/'). 1055 Command line argument: -J 1056 1057GeoIP Controls the use of GeoIP geolocation services. If The 1058 Webalizer was compiled with GeoIP support, it is used by 1059 default. Values may be 'yes' or 'no'. Default is 'yes'. 1060 Command line argument: -w 1061 1062GeoIPDatabase Specifies an alternate GeoIP database filename to use. 1063 This name is relative to the default output directory 1064 unless an absolute name is given (ie: starts with '/'). 1065 Command line argument: -W 1066 1067 1068Top Table Keywords 1069------------------ 1070 1071TopAgents This allows you to specify how many "Top" user agents are 1072 displayed in the "Top User Agents" table. The default 1073 is 15. If you do not want to display user agent statistics, 1074 specify a value of zero (0). The display of user agents 1075 will only work if your web server includes this information 1076 in its log file (ie: a combined log format file). 1077 Command line argument: -A 1078 1079AllAgents Will cause a separate HTML page to be generated for all 1080 normally visible User Agents. A link will be added to 1081 the bottom of the "Top User Agents" table if enabled. 1082 Value can be either 'yes' or 'no', with 'no' being the 1083 default. 1084 1085TopCountries This allows you to specify how many "Top" countries are 1086 displayed in the "Top Countries" table. The default is 1087 30. If you want to disable the countries table, specify 1088 a value of zero (0). 1089 Command line argument: -C 1090 1091TopReferrers This allows you to specify how many "Top" referrers are 1092 displayed in the "Top Referrers" table. The default is 1093 30. If you want to disable the referrers table, specify 1094 a value of zero (0). The display of referrer information 1095 will only work if your web server includes this information 1096 in its log file (ie: a combined log format file). 1097 Command line argument: -R 1098 1099AllReferrers Will cause a separate HTML page to be generated for all 1100 normally visible Referrers. A link will be added to the 1101 "Top Referrers" table if enabled. Value can be either 1102 'yes' or 'no', with 'no' being the default. 1103 1104TopSites This allows you to specify how many "Top" sites are 1105 displayed in the "Top Sites" table. The default is 30. 1106 If you want to disable the sites table, specify a value 1107 of zero (0). 1108 Command line argument: -S 1109 1110TopKSites Identical to TopSites, except for the 'by KByte' table. 1111 Default is 10. No command line switch for this one. 1112 1113AllSites Will cause a separate HTML page to be generated for all 1114 normally visible Sites. A link will be added to the 1115 bottom of the "Top Sites" table if enabled. Value can 1116 be either 'yes' or 'no', with 'no' being the default. 1117 1118TopURLs This allows you to specify how many "Top" URLs are 1119 displayed in the "Top URLs" table. The default is 30. 1120 If you want to disable the URLs table, specify a value 1121 of zero (0). 1122 Command line argument: -U 1123 1124TopKURLs Identical to TopURLs, except for the 'by KByte' table. 1125 Default is 10. No command line switch for this one. 1126 1127AllURLs Will cause a separate HTML page to be generated for all 1128 normally visible URLs. A link will be added to the 1129 bottom of the "Top URLs" table if enabled. Value can 1130 be either 'yes' or 'no', with 'no' being the default. 1131 1132TopEntry Allows you to specify how many "Top Entry Pages" are 1133 displayed in the table. The default is 10. If you 1134 want to disable the table, specify a value of zero (0). 1135 Command line argument: -e 1136 1137TopExit Allows you to specify how many "Top Exit Pages" are 1138 displayed in the table. The default is 10. If you 1139 want to disable the table, specify a value of zero (0). 1140 Command line argument: -E 1141 1142TopSearch Allows you to specify how many "Top Search Strings" are 1143 displayed in the table. The default is 20. If you 1144 want to disable the table, specify a value of zero (0). 1145 Only works if using combined log format (ie: contains 1146 referrer information). 1147 1148TopUsers This allows you to specify how many "Top" usernames are 1149 displayed in the "Top Usernames" table. Usernames are 1150 only available if you use http authentication on your 1151 web server, or when processing wu-ftpd xferlogs. The 1152 default value is 20. If you want to disable the Username 1153 table, specify a value of zero (0). 1154 1155AllUsers Will cause a separate HTML page to be generated for all 1156 normally visible usernames. A link will be added to the 1157 bottom of the "Top Usernames" table if enabled. Value 1158 can be either 'yes' or 'no', with 'no' being the default. 1159 1160AllSearchStr Will create a separate HTML page to be generated for all 1161 normally visible Search Strings. A link will be added 1162 to the bottom of the "Top Search Strings" table if 1163 enabled. Value can be either 'yes' or 'no', with 'no' 1164 being the default. 1165 1166 1167Hide Object Keywords 1168-------------------- 1169 1170These keywords allow you to hide user agents, referrers, sites, URLs 1171and usernames from the various "Top" tables. The value for these keywords 1172are the same as those used in their command line counterparts. You 1173can specify as many of these as you want without limit. Refer to the 1174section above on "Command Line Options" for a description of the string 1175formatting used as the value. Values cannot exceed 80 characters in 1176length. 1177 1178HideAgent This allows specified user agents to be hidden from the 1179 "Top User Agents" table. Not very useful, since there 1180 a zillion different names by which browsers go by today, 1181 but could be useful if there is a particular user agent 1182 (ie: robots, spiders, real-audio, etc..) that hits your 1183 site frequently enough to make it into the top user agent 1184 listing. This keyword is useless if 1) your log file does 1185 not provide user agent information or 2) you disable the 1186 user agent table. 1187 Command line argument: -a 1188 1189HideReferrer This allows you to hide specified referrers from the 1190 "Top Referrers" table. Normally, you would only specify 1191 your own web server to be hidden, as it is usually the 1192 top generator of references to your own pages. Of course, 1193 this keyword is useless if 1) your log file does not include 1194 referrer information or 2) you disable the top referrers 1195 table. 1196 Command line argument: -r 1197 1198HideSite This allows you to hide specified sites from the "Top 1199 Sites" table. Normally, you would only specify your own 1200 web server or other local machines to be hidden, as they 1201 are usually the highest hitters of your web site, especially 1202 if you have their browsers home page pointing to it. 1203 Command line argument: -s 1204 1205HideAllSites This allows hiding all individual sites from the display, 1206 which can be useful when a lot of groupings are being 1207 used (since grouped records cannot be hidden). It is 1208 particularly useful in conjunction with the GroupDomain 1209 feature, however can be useful in other situations as well. 1210 Value can be either 'yes' or 'no', with 'no' the default. 1211 Command line argument: -X 1212 1213HideURL This allows you to hide URLs from the "Top URLs" table. 1214 Normally, this is used to hide items such as graphic files, 1215 audio files or other 'non-html' files that are transferred 1216 to the visiting user. 1217 Command line argument: -u 1218 1219HideUser This allows you to hide Usernames from the "Top Usernames" 1220 table. Usernames are only available if you use http based 1221 authentication on your web server. 1222 1223 1224Group Object Keywords 1225--------------------- 1226 1227The Group* keywords allow object grouping based on Site, URL, Referrer, 1228User Agent and Usernames. Combined with the Hide* keywords, you can 1229customize exactly what will be displayed in the 'Top' tables. For example, 1230to only display totals for a particular directory, use a GroupURL and 1231HideURL with the same value (ie: '/help/*'). Group processing is only 1232done after the individual record has been fully processed, so name mangling 1233and site total updates have already been performed. Because of this, groups 1234are not counted in the main site total (as that would cause duplication). 1235Groups can be displayed in bold and shaded as well. Grouped records are 1236not, by default, hidden from the report. This allows you to display a 1237grouped total, while still being able to see the individual records, even 1238if they are part of the group. If you want to hide the detail records, 1239follow the Group* directive with a Hide* one using the same value. There 1240are no command line switches for these keywords. The Group* keywords also 1241accept an optional label to be displayed instead of the actual value used. 1242This label should be separated from the value by at least one whitespace 1243character, such as a space or tab character. If the match string contains 1244whitespace (spaces or tabs), the string should be quoted, using either 1245single or double quotes. See the sample configuration file for examples. 1246 1247GroupReferrer Allows grouping Referrers. Can be handy for some of the 1248 major search engines that have multiple host names a 1249 referral could come from. 1250 1251GroupURL This keyword allows grouping URLs. Useful for grouping 1252 complete directory trees. 1253 1254GroupSite This keywords allows grouping Sites. Most used for 1255 grouping top level domains and unresolved IP address 1256 for local dial-ups, etc... 1257 1258GroupAgent Groups User Agents. A handy example of how you could use 1259 this one is to use "Mozilla" and "MSIE" as the values for 1260 GroupAgent and HideAgent keywords. Make sure you put the 1261 "MSIE" one first. 1262 1263GroupDomains Allows automatic grouping of domains. The numeric value 1264 represents the level of grouping, and can be thought of 1265 as 'the number of dots' to display. A 1 will display 1266 second level domains only (xxx.xxx), a 2 will display 1267 third level domains (xxx.xxx.xxx) etc... The default 1268 value of 0 disables any domain grouping. 1269 Command line argument: -g 1270 1271GroupUser Allows grouping of usernames. Combined with a group 1272 name, this can be handy for displaying statistics on 1273 a particular group of users without displaying their 1274 real usernames. 1275 1276GroupShading Allows shading of table rows for groups. Value can be 1277 'yes' or 'no', with the default being 'yes'. 1278 1279GroupHighlight Allows bolding of table rows for groups. Value can be 1280 'yes' or 'no', with the default being 'yes'. 1281 1282 1283Ignore/Include Object Keywords 1284---------------------- 1285 1286These keywords allow you to completely ignore log records when generating 1287statistics, or to force their inclusion regardless of ignore criteria. 1288Records can be ignored or included based on site, URL, user agent, referrer 1289and username. Be aware that by choosing to ignore records, the accuracy of 1290the generated statistics become skewed, making it impossible to produce 1291an accurate representation of load on the web server. These keywords 1292behave identical to the Hide* keywords above, where the value can have 1293a leading or trailing wildcard '*'. These keywords, like the Hide* ones, 1294have an absolute limit of 80 characters for their values. These keywords 1295do not have any command line switch counterparts, so they may only be 1296specified in a configuration file. It should also be pointed out that 1297using the Ignore/Include combination to selectively exclude an entire 1298site while including a particular 'chunk' is _extremely_ inefficient, 1299and should be avoided. Try grep'ing the records into a separate file 1300and process it instead. 1301 1302IgnoreSite This allows specified sites to be completely ignored from 1303 the generated statistics. 1304 1305IgnoreURL This allows specified URLs to be completely ignored from 1306 the generated statistics. One use for this keyword would 1307 be to ignore all hits to a 'temporary' directory where 1308 development work is being done, but is not accessible to 1309 the outside world. 1310 1311IgnoreReferrer This allows records to be ignored based on the referrer 1312 field. 1313 1314IgnoreAgent This allows specified User Agent records to be completely 1315 ignored from the statistics. Maybe useful if you really 1316 don't want to see all those hits from MSIE :) 1317 1318IgnoreUser This allows specified username records to be completely 1319 ignored from the statistics. Usernames can only be used 1320 if you use http authentication on your server. 1321 1322IncludeSite Force the record to be processed based on hostname. This 1323 takes precedence over the Ignore* keywords. 1324 1325IncludeURL Force the record to be processed based on URL. This takes 1326 precedence over the Ignore* keywords. 1327 1328IncludeReferrer Force the record to be processed based on referrer. 1329 This takes precedence over the Ignore* keywords. 1330 1331IncludeAgent Force the record to be processed based on user agent. 1332 This takes precedence over the Ignore* keywords. 1333 1334IncludeUser Force the record to be processed based on username. 1335 Usernames are only available if you use http based 1336 authentication on your server. This takes precedence over 1337 the Ignore* keywords. 1338 1339 1340Dump Object Keywords 1341-------------------- 1342 1343The Dump* Keywords allow text files to be generated that can then be used 1344for import into most database, spreadsheet and other external programs. 1345The file is a standard tab delimited text file, meaning that each column 1346is separated by a tab (0x09) character. A header record may be included 1347if required, using the 'DumpHeader' keyword. Since these files contain 1348all records that have been processed, including normally hidden records, 1349an alternate location for the files can be specified using the 'DumpPath' 1350keyword, otherwise they will be located in the default output directory. 1351 1352DumpPath Specifies an alternate location for the dump files. The 1353 default output location will be used otherwise. The value 1354 is the path portion to use, and normally should be an 1355 absolute path (ie: has a leading '/' character), however 1356 relative path names can be used as well, and will be 1357 relative to the output directory location. 1358 1359DumpExtension Allows the dump filename extensions to be specified. The 1360 default extension is "tab", however may be changed with 1361 this option. 1362 1363DumpHeader Allows a header record to be written as the first record 1364 of the file. Value can be either 'yes' or 'no', with 1365 the default being 'no'. 1366 1367DumpSites Dump tab delimited sites file. Value can be either 'yes' 1368 or 'no', with the default being 'no'. The filename used 1369 is site_YYYYMM.tab (YYYY=year, MM=month). 1370 1371DumpURLs Dump tab delimited url file. Value can be either 'yes' or 1372 'no', with the default being 'no'. The filename used is 1373 url_YYYYMM.tab (YYYY=year, MM=month). 1374 1375DumpReferrers Dump tab delimited referrer file. Value can be either 1376 'yes' or 'no', with the default being 'no'. Filename 1377 used is ref_YYYYMM.tab (YYYY=year, MM=month). Referrer 1378 information is only available if present in the log 1379 file (ie: combined web server log). 1380 1381DumpAgents Dump tab delimited user agent file. Value can be either 1382 'yes' or 'no', with the default being 'no'. Filename 1383 used is agent_YYYYMM.tab (YYYY=year, MM=month). User 1384 agent information is only available if present in the 1385 log file (ie: combined web server log). 1386 1387DumpUsers Dump tab delimited username file. Value can be either 1388 'yes' or 'no', with the default being 'no'. Filename 1389 used is user_YYYYMM.tab (YYYY=year, MM=month). The 1390 username data is only available if processing a wu-ftpd 1391 xferlog or http authentication is used on the web server 1392 and that information is present in the log. 1393 1394DumpSearchStr Dump tab delimited search string file. Value can be 1395 either 'yes' or 'no', with the default being 'no'. 1396 Filename used is search_YYYYMM.tab (YYYY=year, MM=month). 1397 the search string data is only available if referrer 1398 information is present in the log being processed and 1399 recognized search engines were found and processed. 1400 1401 1402 1403HTML Generation Keywords 1404------------------------ 1405 1406These keywords allow you to customize the HTML code that The Webalizer 1407produces, such as adding a corporate logo or links to other web pages. 1408You can specify as many of these keywords as you like, and they will be 1409used in the order that they are found in the file. Values cannot exceed 141080 characters in length, so you may have to break long lines up into two 1411or more lines. There are no command line counterparts to these keywords. 1412 1413HTMLExtension Allows generated pages to use something other than the 1414 default 'html' extension for the filenames. Do not 1415 include the leading period ('.') when you specify the 1416 extension. 1417 Command line argument: -x 1418 1419HTMLPre Allows code to be inserted at the very beginning of the 1420 HTML files. Defaults to the standard HTML 3.2 DOCTYPE 1421 record. Be careful not to include any HTML here, as it 1422 is inserted _before_ the <HTML> tag in the file. Use it 1423 for server-side scripting capabilities, such as php3, to 1424 insert scripting files and other directives. 1425 1426HTMLHead Allows you to insert HTML code between the <HEAD></HEAD> 1427 block. There is no default. Useful for adding scripts 1428 to the HTML page, such as Javascript or php3, or even 1429 just for adding a few META tags to the document. 1430 1431HTMLBody This keyword defines HTML code to be placed immediately 1432 after the <HEAD> section of the report, just before the 1433 title and "summary period/generated on" lines. If used, 1434 the first HTMLHead line MUST include a <BODY> tag. Put 1435 whatever else you want in subsequent lines, but keep in 1436 mind the placement of this code in relation to the title 1437 and other aspects of the web page. Some typical uses 1438 are to change the page colors and possibly add a corporate 1439 logo (graphic) in the top right. If not specified, a 1440 default <BODY> tag is used that defines page color, text 1441 color and link colors (see "sample.conf" file for example). 1442 1443HTMLPost This keyword defines HTML code that is placed after the 1444 title and "summary period/generated on" lines, just before 1445 the initial horizontal rule <HR> tag. Normally this keyword 1446 isn't needed, but is provided in case you included a large 1447 graphic or some other weird formatting tag in the HTMLHead 1448 section that needs to be cleaned up or terminated before the 1449 main report section. 1450 1451HTMLTail This keyword defines HTML code that is placed at the bottom 1452 right side of the report. It is inserted in a <TABLE> section 1453 between table data <TD>..</TD> tags, and is top and right 1454 aligned within the table. Normally this keyword is used to 1455 provide a link back to your home page or insert a small 1456 graphic at the bottom right of the page. 1457 1458HTMLEnd This allows insertion of closing code, at the very end of 1459 the page. The default is to put the closing </BODY> and 1460 </HTML> tags. If specified, you _must_ specify these tags 1461 yourself. 1462 1463LinkReferrer This specifies if the referrers listed in the top referrer 1464 table should be displayed as plain text, or as a link to the 1465 referrer. Values can be either 'yes' or 'no', with 'no' 1466 being the default. 1467 1468 1469Graph Color Commands 1470-------------------- 1471 1472These keywords allow altering the colors used in the various graphs 1473produced by the Webalizer. The value is specified as a standard HTML 1474RGB hexdecimal color string, without the leading '#' character. The 1475value is case insensitive. If not specified, the default color shown 1476will be used. 1477 1478ColorHit Color used for 'Hits'. Default is '00805C' (green) 1479 1480ColorFile Color used for 'Files'. Default is '0040FF' (blue) 1481 1482ColorSite Color used for 'Sites'. Default is 'FF8000' (orange) 1483 1484ColorKbyte Color used for 'KBytes'. Default is 'FF0000' (red) 1485 1486ColorPage Color used for 'Pages'. Default is '00E0FF' (cyan) 1487 1488ColorVisit Color used for 'Visits'. Default is 'FFFF00' (yellow) 1489 1490ColorMisc Color used for miscellaneous titles in various 'Top' 1491 tables (not graphs). Default is '00E0FF' (cyan) 1492 1493PieColor1 Pie Chart color #1. Default is '800080' (purple) 1494 1495PieColor2 Pie Chart color #2. Default is '80FFC0' (lt. green) 1496 1497PieColor3 Pie Chart color #3. Default is 'FF00FF' (lt. purple) 1498 1499PieColor4 Pie Chart color #4. Default is 'FFC080' (tan) 1500 1501 1502-------------------------------------------------------------------------- 1503 1504 1505Notes on Web Log Files 1506---------------------- 1507 1508The Webalizer supports CLF log formats, which should work for just 1509about everyone. If you want User Agent or Referrer information, you 1510need to make sure your web server supplies this information in its 1511log file, and in a format that the Webalizer can understand. While 1512The Webalizer will try to handle many of the subtle variations in 1513log formats, some will not work at all. Most web servers output 1514CLF format logs by default. For Apache, in order to produce the 1515proper log format, add the following to the httpd.conf file: 1516 1517LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\"" 1518 1519This instructs the Apache web server to produce a 'combined' log 1520that includes the referrer and user agent information on the end of 1521each record, enclosed in quotes (This is the standard recommended 1522by both Apache and NCSA). Netscape and other web servers have 1523similar capabilities to alter their log formats. (note: the above 1524works for apache servers up to V1.2. V1.3 and higher now have additional 1525ways to specify log formats... refer to included documentation). 1526 1527Notes on FTP Log Files 1528---------------------- 1529 1530The Webalizer supports ftp logs produced by wu-ftpd, proftpd and others, 1531as a standard 'xferlog'. To process an ftp log, you must either use the 1532-Ff command line option or have "LogType ftp" in your configuration file. 1533It is recommended that you create a separate configuration file for ftp 1534analysis, since the values used for your web server will most likely not 1535be suited for ftp log analysis (ie: page types, hostname, etc.. should 1536be different). 1537 1538Because of the difference in web and ftp logs, there are a few limitations: 1539 1540o Because there is no concept of a 'response code' in ftp world, response 1541 codes are restricted to either 200 (OK) or 206 (Partial Content), based 1542 on the completion status found in xferlog (for wu-ftpd, 'i'=incomplete 1543 and will generate a 206, 'c'=complete and will generate a 200). If your 1544 ftp server doesn't supply the completion status, all requests will be 1545 assigned a response code of 200. This allows the usage graph to display 1546 all transfer requests (hits), and how many of those completed in success 1547 (files - ie: 200 response codes). 1548 1549o Page totals won't accurately reflect reality, since there isn't really 1550 the concept of a 'page' in regards to ftp services. I have found that 1551 setting the PageType value to "README", "FIRST", etc... seems to work 1552 fairly well however, and will give a pretty good indication of how 1553 many 'non-binary' files were requested. Of course, the content of your 1554 ftp site will be different, so your results may vary. 1555 1556o Visit totals also won't accurately reflect reality, since visits are 1557 triggered on PageType requests (see above). What you usually wind up 1558 with is visits=sites in most cases. 1559 1560o Entry/Exit pages will not be calculated for ftp logs. 1561 1562o For obvious reasons, referrers and user agents are not supported. 1563 1564o You _cannot_ analyze both web and ftp logs at the same time.. they must 1565 be done separately in different runs. 1566 1567 1568Notes on Referrers 1569------------------ 1570 1571Referrers are weird critters... They take many shapes and forms, which makes 1572it much harder to analyze than a typical URL, which at least has some 1573standardization. What is contained in the referrer field of your log 1574files varies depending on many factors, such as what site did the referral, 1575what type of system it comes from and how the actual referral was generated. 1576Why is this? Well, because a user can get to your site in many ways... They 1577may have your site bookmarked in their browser, they may simply type your 1578sites URL field in their browser, they could have clicked on a link on some 1579remote web page or they may have found your site from one of the many search 1580engines and site indexes found on the web. The Webalizer attempts to deal 1581with all this variation in an intelligent way by doing certain things to 1582the referrer string which makes it easier to analyze. Of course, if your 1583web server doesn't provide referrer information, you probably don't really 1584care and are asking yourself why you are reading this section... 1585 1586Most referrers will take the form of "http://somesite.com/somepage.html", 1587which is what you will get if the user clicks on a link somewhere on the 1588web in order to get to your site. Some will be a variation of this, and 1589look something like "file:/some/such/sillyname", which is a reference from 1590a HTML document on the users local machine. Several variations of this can 1591be used, depending on what type of system the user has, if he/she is on 1592a local network, the type of network, etc... To complicate things even 1593more, dynamic HTML documents and HTML documents that are generated by 1594CGI scripts or external programs produce lots of extra information which 1595is tacked on to the end of the referrer string in an almost infinite number 1596of ways. If the user just typed your URL into their browser or clicked on 1597a bookmark, there won't be any information in the referrer field and will 1598take the form "-". 1599 1600In order to handle all these variations, The Webalizer parses the referrer 1601field in a certain way. First, if the referrer string begins with "http", 1602it assumes it is a normal referral and converts the "http://" and following 1603hostname to lowercase in order to simplify hiding if desired. For example, 1604the referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will become 1605"http://www.myhost.com/This/Is/A/HTML/Document.html". Notice that only the 1606"http://" and hostname are converted to lower case... The rest of the 1607referrer field is left alone. This follows standard convention, as the 1608actual method (HTTP) and hostname are always case insensitive, while the 1609document name portion is case sensitive. 1610 1611Referrers that came from search engines, dynamic HTML documents, CGI 1612scripts and other external programs usually tack on additional information 1613that it used to create the page. A common example of this can be found 1614in referrals that come from search engines and site indexes common on the 1615web. Sometimes, these referrers URLs can be several hundred characters 1616long and include all the information that the user typed in to search for 1617your site. The Webalizer deals with this type of referrer by stripping 1618off all the query information, which starts with a question mark '?'. 1619The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will 1620be converted to just "http://search.yahoo.com/search". 1621 1622When a user comes to your site by using one of their bookmarks or by 1623typing in your URL directly into their browser, the referrer field is 1624blank, and looks like "-". Most sites will get more of these referrals 1625than any other type. The Webalizer converts this type of referral into 1626the string "- (Direct Request)". This is done in order to make it easier 1627to hide via a command line option or configuration file option. This is 1628because the character "-" is a valid character elsewhere in a referrer 1629field, and if not turned into something unique, could not be hidden without 1630possibly hiding other referrers that shouldn't be. 1631 1632 1633Notes on Character Escaping 1634--------------------------- 1635 1636The HTTP protocol defines certain ways that URLs can look and behave. To 1637some extent, referrer fields follow most of the same conventions. Character 1638escaping is a technique by which non-printable or other non-ASCII (and even 1639some ASCII) characters can be used in a URL. This is done by placing the 1640Hexadecimal value of the character in the URL, preceded by a percent sign '%'. 1641Since Hex values are made up of ASCII characters, any character can be 1642escaped to ensure only printable ASCII characters are present in the URL. 1643Some systems take this concept to the extreme and escape all sorts of stuff, 1644even characters that don't need to be escaped. To deal with this, The 1645Webalizer will un-escape URLs and referrers before being processed. For 1646Example, the URL "/www.webalizer.org/%7Efoo/bar.html" is the same URL as 1647"/www.webalizer.org/~foo/bar.html", a very common form of a URL to access 1648users web pages. If the URLs were not un-escaped, they would be treated as 1649two separate documents, even though they are really one and the same. 1650 1651 1652Search String Analysis 1653---------------------- 1654 1655 The Webalizer will do a minimal analysis on referrer strings that 1656it finds, looking for well known search string patterns. Most of 1657the major search engines are supported, such as Yahoo!, Altavista, 1658Lycos, etc... Unfortunately, search engines are always changing 1659their internal/CGI query formats, new search engines are coming on 1660line every day, and the ability to detect _all_ search strings is 1661nearly impossible. However, it should be accurate enough to give 1662a good indication of what users were searching for when they stumbled 1663across your site. Note: as of version 1.31, search engines can now 1664be specified within a configuration file. See the sample.conf file 1665for examples of how to specify additional search engines. 1666 1667 1668 1669Notes on Visits/Entry/Exit Figures 1670---------------------------------- 1671 1672The majority of data analyzed and reported on by The Webalizer is 1673as accurate and correct as possible based on the input log file. 1674However, due to the limitation of the HTTP protocol, the use of 1675firewalls, proxy servers, multi-user systems, the rotation of your 1676log files, and a myriad of other conditions, some of these numbers 1677cannot, without absolute accuracy, be calculated. In particular, 1678Visits, Entry Pages and Exit Pages are suspect to random errors 1679due to the above and other conditions. The reason for this is 1680twofold, 1) Log files are finite in size and time interval, and 16812) There is no way to distinguish multiple individual users apart 1682given only an IP address. Because log files are finite, they have 1683a beginning and ending, which can be represented as a fixed time 1684period. There is no way of knowing what happened previous to this 1685time period, nor is it possible to predict future events based on 1686it. Also, because it is impossible to distinguish individual users 1687apart, multiple users that have the same IP address all appear to 1688be a single user, and are treated as such. This is most common where 1689corporate users sit behind a proxy/firewall to the outside world, 1690and all requests appear to come from the same location (the address 1691of the proxy/firewall itself). Dynamic IP assignment (used with 1692dial-up Internet accounts) also present a problem, since the same 1693user will appear as to come from multiple places. 1694 1695For example, suppose two users visit your server from XYZ company, 1696which has their network connected to the Internet by a proxy server 1697'fw.xyz.com'. All requests from the network look as though they 1698originated from 'fw.xyz.com', even though they were really initiated 1699from two separate users on different PCs. The Webalizer would 1700see these requests as from the same location, and would record only 17011 visit, when in reality, there were two. Because entry and exit 1702pages are calculated in conjunction with visits, this situation 1703would also only record 1 entry and 1 exit page, when in reality, 1704there should be 2. 1705 1706As another example, say a single user at XYZ company is surfing 1707around your website.. They arrive at 11:52pm the last day of 1708the month, and continue surfing until 12:30am, which is now a 1709new day (in a new month). Since a common practice is to rotate 1710(save then clear) the server logs at the end of the month, you 1711now have the users visit logged in two different files (current 1712and previous months). Because of this (and the fact that the 1713Webalizer clears history between months), the first page the 1714user requests after midnight will be counted as an entry page. 1715This is unavoidable, since it is the first request seen by that 1716particular IP address in the new month. 1717 1718For the most part, the numbers shown for visits, entry and exit 1719pages are pretty good 'guesses', even though they may not be 100% 1720accurate. They do provide a good indication of overall trends, 1721and shouldn't be that far off from the real numbers to count much. 1722You should probably consider them as the 'minimum' amount possible, 1723since the actual (real) values should always be equal or greater 1724in all cases. 1725 1726 1727Exporting Webalizer Data 1728------------------------ 1729 1730The Webalizer now has the ability to dump all object tables to tab 1731delimited ASCII text files, which can then be imported into most 1732popular database and spreadsheet programs. The files are not normally 1733produced, as on some sites they could become quite large, and are only 1734enabled by the use of the Dump* configuration keywords. The filename 1735extensions default to '.tab' however may be changed using the 1736'DumpExtension' keyword. Since this data contains all items, even 1737those normally hidden, it may not be desirable to have them located 1738in the output directory where they may be visible to normal web users.. 1739For this reason, the 'DumpPath' configuration keyword is available, 1740and allows the placement of these files somewhere outside the normal 1741web server document tree. An optional 'header' record may be written 1742to these files as well, and is useful when the data is to be imported 1743into a spreadsheet.. databases will not normally need the header. If 1744enabled, the header is simply the column names as the first record of 1745the file, tab separated. 1746 1747 1748Log files and The Webalizer 1749--------------------------- 1750 1751Most sites will choose to have The Webalizer run from cron at specified 1752intervals. Care should be taken to ensure that data is not lost as a 1753result of log file rotations. A suggested practice is to rotate your 1754web server logs at the end of each month as close to midnight as possible, 1755then have The Webalizer process the 'end of month' log file before running 1756statistics on the new, current log. On our systems, a shell script called 1757'rotate_logs' is run at midnight, the end of each month. This script file 1758looks like: 1759 1760------------------------- file: rotate_logs ------------------------------ 1761#!/bin/sh 1762 1763# halt the server 1764kill `cat /var/lib/httpd/logs/httpd.pid` 1765 1766# define backup names 1767OLD_ACCESS_LOG=/var/lib/httpd/logs/old/access_log.`date +%y%m%d-%H%M%S` 1768OLD_ERROR_LOG=/var/lib/httpd/logs/old/error_log.`date +%y%m%d-%H%M%S` 1769 1770# make end of month copy for analyzer 1771cp /var/lib/httpd/logs/access_log /var/lib/httpd/logs/access_log.backup 1772 1773# move files to archive directory 1774mv /var/lib/httpd/logs/access_log `echo $OLD_ACCESS_LOG` 1775mv /var/lib/httpd/logs/error_log `echo $OLD_ERROR_LOG` 1776 1777# restart web server 1778/usr/sbin/httpd 1779 1780# compress the archived files 1781/bin/gzip $OLD_ACCESS_LOG 1782/bin/gzip $OLD_ERROR_LOG 1783------------------------- end of file ------------------------------------ 1784 1785This script first stops the web server using a 'kill' command. Apache 1786keeps the PID of the server in the file httpd.pid, so we use it as the 1787argument for the kill. Next, it defines some names for the backup files, 1788which are basically the name of the files with the date and time appended 1789to the end of them. It then makes a copy of the log file, appended with 1790'.backup' in the log directory, moves the current log files to an archive 1791directory (/var/lib/httpd/logs/old) and restarts the server. This setup 1792allows the web server to be down for the minimum amount of time needed, 1793which is important for busy sites. If you don't want to stop the server, 1794you can remove the initial 'kill' command, and replace the '/usr/sbin/httpd' 1795line with "kill -1 `cat /var/lib/httpd/logs/httpd.pid`" command instead, 1796On most web servers, this will cause a restart of the server and create 1797the new log files in the process... 1798 1799At this point, we have made copies of the previous months logs, the web 1800server is going about its business as usual, and we have all the time in 1801the world to do any other additional processing we want. The last two 1802lines of the script compress the archived logs using the GNU zip program 1803(gzip). Remember, we still have a copy of the log which we can now run 1804The Webalizer on without having to do any further processing. 1805 1806Next, we define two crontab entries. The first runs the above 'rotate_logs' 1807script at midnight at the end of the month. The second runs The Webalizer 1808on the '.backup' log file created above at 5 minutes after midnight. This 1809gives other end of month processing jobs a chance to run so we don't bog 1810the system down too much. If you have lots of end of month stuff going on, 1811you can change the timing to suit your needs. The crontab entries look 1812something like: 1813 1814------------------------- crontab entries -------------------------------- 1815# Rotate web server logs and run monthly analysis 18160 0 1 * * /usr/local/adm/rotate_logs 18175 0 1 * * /usr/bin/webalizer -Q /var/lib/httpd/logs/access_log.backup 1818------------------------- end of crontab --------------------------------- 1819 1820As you can see, the log rotations occur at midnight, and the analysis 1821is done at 5 minutes after. Once you verify that The Webalizer ran 1822successfully, the access_log.backup file can be deleted as it isn't 1823needed any more. If you need to re-run the analysis, you still have 1824the compressed archive copy that the shell script created. In order 1825for the above analysis to work properly, you should have already 1826created an /etc/webalizer.conf configuration file suitable for your 1827site, or otherwise specify configuration options or a configuration 1828file on the crontab command line above. 1829 1830If you want The Webalizer to be run more often than once a month, you 1831can specify additional crontab entries to do this as well. Care should 1832be taken however to ensure that The Webalizer is not running when the 1833end of month processing above occurs, or unpredictable results may 1834happen (such as an inability to rotate the logs due to a file lock). 1835The easiest way is to run it on the half hour with a crontab entry like: 1836 183730 * * * * /usr/bin/webalizer 1838 1839 1840Reverse DNS Lookups 1841------------------- 1842 1843The Webalizer fully supports both IPv4 and IPv6 DNS lookups, and 1844maintains a cache of those lookups to reduce processing the same 1845addresses in subsequent runs. The cache file can be created at 1846run-time, or may be created before running the webalizer using either 1847the stand alone 'webazolver' program, or The Webalizer (DNS) Cache 1848file Manager program 'wcmgr'. In order to perform reverse lookups, 1849a DNS Cache file must be specified, either on the command line or in 1850a configuration file. In order to create/update the cache file at 1851run-time, the number of DNS Children must also be specified, and can 1852be anything between 1 and 100. This specifies the number of child 1853processes to be forked, each of which will perform network DNS 1854queries in order to lookup up the addresses and update the cache. 1855Cached entries that are older than a specified TTL (time to live) 1856will be expired, and if encountered again in a log, will be looked 1857up at that time in order to 'freshen' them (verify the name is still 1858the same and update its timestamp). The default TTL is 7 days, however 1859may be set to anything between 1 and 100 days. Using the 'wcmgr' 1860program, entries may also be marked as 'permanent', in which case 1861they will persist (with an infinite TTL) in the cache until manually 1862removed. See the file DNS.README for additional information. 1863 1864 1865Geolocation Lookups 1866------------------- 1867 1868The Webalizer has the ability to perform geolocation lookups on IP 1869addresses using either it's own internal GeoDB database or optionally 1870the GeoIP database from MaxMind, Inc. (www.maxmind.com). If used, 1871unresolved addresses will be searched for in the database and it's 1872country of origin will be returned if found. This actually produces 1873more accurate Country information than DNS lookups, since the DNS 1874address space has additional gcTLDs that do not necessarily map to 1875a specific country (such as '.net' and '.com'). It is possible to 1876use both DNS lookups and geolocation lookups at the same time, which 1877will cause any addresses that could not be resolved using DNS lookups 1878to then be looked up in the database, greatly reducing the number of 1879'Unknown/Unresolved' entries in the generated reports. The native 1880GeoDB geolocation database provided by The Webalizer fully supports 1881IPv4 and IPv6 lookups, is updated regularly, and is the preferred 1882geolocation method for use with The Webalizer. The most current 1883version of the database can be obtained from our ftp site. 1884 1885 1886Language Support 1887---------------- 1888 1889Version 1.0x of The Webalizer added language support. This 1890support is only provided at compile time in the form of an 1891include file containing all the strings used by The Webalizer. 1892The source distribution contains all language files that were 1893available at the time, with English being the default as 1894that is the only human language I speak fluently, and me 1895Espanol es muy malo. Several people have already indicated 1896the desire to do translations into various languages, and as 1897I receive the language files, will make them available via 1898ftp at ftp://ftp.mrunix.net/pub/webalizer/lang. Unless there 1899happens to be a binary distribution in the language you need, 1900you will need to grab the source distribution and compile the 1901program yourself. See the file INSTALL that comes in the source 1902distribution for information on how to use a language other than 1903English. 1904 1905It should also be noted that the GD graphics library, used to 1906produce the in-line graphics in the output HTML, doesn't 1907support extended character sets, so if you are translating 1908the language file, you will no doubt encounter this problem. 1909 1910New: You can now specify the language to use when you are building 1911 program from source, using the configure script. Just add 1912 --with-language=language_name , where 'language_name' is the 1913 name of a valid language file in the /lang/ directory. For 1914 example, --with-language=french will build using French as 1915 the default language. You should consult the INSTALL file 1916 for additional information on building the program from source. 1917 1918 1919Known Issues 1920------------ 1921 1922 o Memory Usage. The Webalizer makes liberal use of memory for internal 1923 data structures during analysis. Lack of real physical memory will 1924 noticeably degrade performance by doing lots of swapping between memory 1925 and disk. One user who had a rather large log file noticed that The 1926 Webalizer took over 7 hours to run with only 16 Meg of memory. Once 1927 memory was increased, the time was reduced to a few minutes. 1928 1929 1930 o Performance. The Hide*, Group*, Ignore*, Include* and IndexAlias 1931 configuration options can cause a performance decrease if lots of 1932 them are used. The reason for this is that every log record must 1933 be scanned for each item in each list. For example, if you are 1934 Hiding 20 objects, Grouping 20 more, and Ignoring 5, each record 1935 is scanned, at most, 46 times (20+20+5 + an IndexAlias scan). 1936 On really large log files, this can have a profound impact. It 1937 is recommended that you use the least amount of these configuration 1938 options that you can, as it will greatly improve performance. 1939 1940 1941Final Notes 1942----------- 1943 1944A lot of time and effort went into making The Webalizer, and to ensure that 1945the results are as accurate as possible. If you find any abnormalities or 1946inconsistent results, bugs, errors, omissions or anything else that doesn't 1947look right, please let me know so I can investigate the problem or correct 1948the error. This goes for the minimal documentation as well. Suggestions 1949for future versions are also welcome and appreciated. 1950