• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..23-Apr-2016-

de/H03-May-2022-1,6531,180

en/H03-May-2022-

es/H23-Apr-2016-4,0782,339

fr/H23-Apr-2016-1,6891,363

it/H31-Dec-2013-

nl/H31-Dec-2013-

pl/H31-Dec-2013-413288

ru/H31-Dec-2013-

scripts/H23-Apr-2016-750506

CHANGES.CONFH A D27-Mar-201125.1 KiB1,070633

COPYINGH A D18-Sep-199917.6 KiB340281

FAQH A D16-Mar-201349.7 KiB1,084816

INSTALLH A D07-Feb-20128.3 KiB220156

Makefile.inH A D03-May-20223.8 KiB13076

NEWSH A D23-Apr-201672.9 KiB1,6981,241

READMEH A D23-Apr-201639.4 KiB900668

README.1stH A D28-Dec-20103.4 KiB12274

README.CONFH A D23-Apr-201641.2 KiB974734

README.PWDH A D28-Dec-20104.9 KiB10478

README.URLH A D28-Dec-20108.3 KiB234149

README.compressH A D28-Dec-201011 KiB277193

README.htdigH A D28-Dec-20107 KiB179121

README.httpsH A D28-Dec-20108.7 KiB186141

README.hyperestraierH A D28-Dec-20103.1 KiB9458

README.langH A D28-Dec-20104.1 KiB170118

README.mnogosearchH A D28-Dec-20105.2 KiB15599

README.namazuH A D28-Dec-20103.5 KiB11569

README.win32H A D28-Dec-20107.8 KiB199143

wwwoffle.conf.manH A D23-Apr-201639.3 KiB1,0991,006

wwwoffle.conf.man.templateH A D16-Mar-20131.6 KiB11471

wwwoffle.manH A D16-Mar-20139.1 KiB299269

wwwoffled.manH A D16-Mar-20134.1 KiB11688

README

1          WWWOFFLE - World Wide Web Offline Explorer - Version 2.9g
2          =========================================================
3
4
5The WWWOFFLE programs simplify World Wide Web browsing from computers that use
6intermittent connections to the internet.
7
8Description
9-----------
10
11The WWWOFFLE server is a proxy web server with special features for use with
12intermittent internet links.  This means that it is possible to browse web pages
13and read them without having to remain connected.
14
15Basic Features
16    - Caching of HTTP, FTP and finger protocols.
17    - Allows the 'GET', 'HEAD', 'POST' and 'PUT' HTTP methods.
18    - Interactive or command line control of online/offline/autodial status.
19    - Highly configurable.
20    - Low maintenance, start/stop and online/offline status can be automated.
21
22While Online
23    - Caching of pages that are viewed for later review.
24    - Conditional fetching to only get pages that have changed.
25        - Based on expiration date, time since last fetched or once per session.
26    - Non cached support for SSL (Secure Socket Layer e.g. https).
27    - Caching for https connections. (compile time option).
28    - Can be used with one or more external proxies based on web page.
29    - Control which pages cannot be accessed.
30        - Allow replacement of blocked pages.
31    - Control which pages are not to be stored in the cache.
32    - Create backups of cached pages when server cannot be contacted.
33        - Option to create backup when server sends back an error page.
34    - Requests compressed pages from web servers (compile time option).
35    - Requests chunked transfer encoding from web servers.
36
37While Offline
38    - Can be configured to use dial-on-demand for pages that are not cached.
39    - Selection of pages to download next time online
40        - Using normal browser to follow links.
41        - Command line interface to select pages for downloading.
42    - Control which pages can be requested when offline.
43    - Provides non-cached access to intranet servers.
44
45Automated Download
46    - Downloading of specified pages non-interactively.
47    - Options to automatically fetch objects in requested pages
48        - Understands various types of pages
49            - HTML 4.0, Java classes, VRML (partial), XML (partial).
50        - Options to fetch different classes of objects
51            - Images, Stylesheets, Frames, Scripts, Java or other objects.
52        - Option to not fetch webbug images (images of 1 pixel square).
53    - Automatically follows links for pages that have been moved.
54    - Can monitor pages at regular intervals to fetch those that have changed.
55    - Recursive fetching
56        - To specified depth.
57        - On any host or limited to same server or same directory.
58        - Chosen from command line or from browser.
59        - Control over which links can be fetched recursively.
60
61Convenience
62    - Optional information footer on HTML pages showing date cached and options.
63    - Options to modify HTML pages
64        - Remove scripts.
65        - Remove Java applets.
66        - Remove stylesheets.
67        - Remove shockwave flash animations.
68        - Indicate cached and uncached links.
69        - Remove the blink tag.
70        - Remove the marquee tag.
71        - Remove refresh tags.
72        - Remove links to pages that are in the DontGet list.
73        - Remove inline frames (iframes) that are in the DontGet list.
74        - Replace images that are in the DontGet list.
75        - Replace webbug images (images of 1 pixel square).
76        - Demoronise HTML character sets.
77        - Fix mixed Cyrillic character sets.
78        - Stop animated GIFs.
79        - Remove Cookies in meta tags.
80    - Provides information about cached pages
81        - Headers, raw and modified.
82        - Contents, images, links etc.
83        - Source code unmodified by WWWOFFLE.
84    - Automatic proxy configuration with Proxy Auto-Config file.
85    - Searchable cache with the addition of the ht://Dig, mnoGoSearch
86      (UdmSearch), Namazu or Hyper Estraier programs.
87    - Built in simple web-server for local pages
88        - HTTP and HTTPS access (compile time option).
89        - Allows CGI scripts.
90    - Timeouts to stop proxy lockups
91        - DNS name lookups.
92        - Remote server connection.
93        - Data transfer.
94    - Continue or stop downloads interrupted by client.
95        - Based on file size of fraction downloaded.
96    - Purging of pages from cache
97        - Based on URL matching.
98        - To keep the cache size below a specified limit.
99        - To keep the free disk space above a specified limit.
100        - Interactive or command line control.
101        - Compression of cached pages based on age.
102    - Provides compressed pages to web browser (compile time option).
103    - Use chunked transfer-encoding to web browser.
104
105Indexes
106    - Multiple indexes of pages stored in cache
107        - Servers for each protocol (http, ftp ...).
108        - Pages on each server.
109        - Pages waiting to be fetched.
110        - Pages requested last time offline.
111        - Pages fetched last time online.
112        - Pages monitored on a regular basis.
113    - Configurable indexes
114        - Sorted by name, date, server domain name, type of file.
115        - Options to delete, refresh or monitor pages.
116        - Selection of complete list of pages or hide un-interesting pages.
117
118Security
119    - Works with pages that require basic username/password authentication.
120    - Automates proxy authentication for external proxies that require it.
121    - Control over access to the proxy
122        - Defaults to local host access only.
123        - Host access configured by hostname or IP address.
124        - Optional proxy authentication for user level access control.
125    - Optional password control for proxy management functions.
126    - HTTPS access to all proxy management web pages (compile time option).
127    - Can censor incoming and outgoing HTTP headers to maintain user privacy.
128
129Configuration
130    - All options controlled using a configuration file.
131    - Interactive web page to allow editing of the configuration file.
132    - User customisable error and information pages.
133    - Log file or syslog reporting with user specified error level.
134
135
136Configuring A Web Browser
137-------------------------
138
139To use the WWWOFFLE programs, requires that your web browser is set up to use it
140as a proxy.  The proxy hostname will be 'localhost' (or the name of the host
141that wwwoffled is running on), and the port number will be the one that is used
142by wwwoffled (default 8080).
143
144There are lots of different browsers and it is not possible to list all the ways
145to configure them here.  There should be an option in one of the menus or
146described in the manual for the browser that explains how to configure a proxy.
147
148You will also need to disable the caching that the web browser performs itself
149between sessions to get the best out of the program.
150
151Depending on which browser you use and which version, it is possible to request
152pages to be refreshed while offline.  This is done using the 'reload' or
153'refresh' button or key on the browser.  On many browsers, there are two ways of
154doing this, one forces the proxy to reload the page, and this is the one that
155will cause the page to be refreshed.
156
157
158Welcome Page
159------------
160
161There is a welcome page at URL 'http://localhost:8080/' that gives a very brief
162description of the program and has links to the index pages, interactive control
163page and the WWWOFFLE internet home pages.
164
165The most important places to get information about WWWOFFLE are the WWWOFFLE
166homepage which has information about WWWOFFLE in general:
167
168http://www.gedanken.org.uk/software/wwwoffle/
169
170
171Index Of Cached Files
172---------------------
173
174To get the index of cached files, use the URL 'http://localhost:8080/index/'.
175There are sufficient links on each of the index pages to allow easy navigation.
176
177The indexes provides several levels of information:
178   A list of the requests in the outgoing directory.
179   A list of the files fetched the last time that the program was online.
180      And for the previous 5 times before that.
181   A list of the files requested the last time that the program was offline.
182      And for the previous 5 times before that.
183   A list of the files that are being monitored.
184   A list of all hosts for each of the protocols (http,ftp etc.).
185      A list of all of the files on a particular host.
186
187These indexes can be sorted in a number of ways:
188   No sorting (directory order on disk).
189   By time of last modification (update).
190   By time of last access.
191   By date of last update with markers for each day.
192   Alphabetically.
193   By file extension.
194   Random.
195
196For each of the pages that are cached there are options to delete the page,
197refresh it, select the interactive refresh page with the URL already filled in
198or add the page to the list that is monitored regularly.
199
200It is also possible to specify in the configuration file what URLs are not to be
201listed in the indexes.
202
203
204Interactive Refresh Page
205------------------------
206
207Pages can be specified by using whatever method is provided by the browser that
208is used or as an alternative there is an interactive refresh page.  This allows
209the user to enter a URL and then fetch it if it is not currently cached or
210refresh it if it is in the cache.  There is also the option here to recursively
211fetch the pages that are linked to by the page that is specified.  This
212recursive fetching can be limited to pages from the same host, narrowed down to
213links in the same directory (or subdirectory) or widened to fetch pages from any
214web server.  This functionality is also provided in the 'wwwoffle' command line
215program.
216
217
218Monitoring Web-Pages
219--------------------
220
221Pages can be specified that are to be checked at regular intervals.  This can
222either be every time that WWWOFFLE is online or at user specifiable times.  The
223page will be monitored when the four specified conditions are all met:
224A month of the year that it can be fetched in (can be set to all months).
225A day of the month that the page can be fetched on (can be set to all days).
226A day of the week that the page can be fetched on (can be set to all days).
227An hour of the day that the page should be fetched after (can be more than one).
228
229For example to get a URL every Saturday morning, use the following:
230
231Month of year: all
232Day of Month : all
233Day of week  : Saturday
234Hour of day  : 0 (24hr clock)
235
236
237Interactive Control Page
238------------------------
239
240The behaviour and mode of operation of the WWWOFFLE daemon can be controlled from
241an interactive control page at 'http://localhost:8080/control/'.  This has a
242number of buttons that change the mode of the proxy server.  These provide the
243same functionality as the 'wwwoffle' command line program.  To provide security,
244this page can be password protected.  There is also the facility to delete pages
245from the cache or from the spooled outgoing requests directory.
246
247
248Interactive Configuration File Editing Page
249-------------------------------------------
250
251The interactive configuration file editing page allows the configuration file
252wwwoffle.conf to be edited.  This facility can be reached via the configuration
253editing page 'http://localhost:8080/configuration/'.  Each item in the
254configuration file has a separate web-page with a form in it that lists the
255current entries in the configuration file and allows each entry to be edited
256individually.  When an entry has been updated, the configuration file needs to
257be re-read.
258
259
260Searching the Cache
261-------------------
262
263The three web indexing programs ht://Dig, mnoGoSearch (UdmSearch), Namazu or
264Hyper Estraier can be used to create an index of the pages in the WWWOFFLE cache
265for later searching.
266
267For ht://Dig version 3.1.0b4 or later is required, it can be found at
268http://www.htdig.org/.
269
270For mnoGoSearch (previously called UdmSearch) version 3.1.0 or later is
271required, it can be found at http://mnogosearch.org/.
272
273For Namazu version 2.0.0 or later is required, it can be found at
274http://www.namazu.org/, also required is mknmz-wwwoffle which can be found at
275http://www.naney.org/comp/distrib/mknmz-wwwoffle/.
276
277For Hyper Estraier version 0.5.3 or later is required, it can be found at
278http://hyperestraier.sourceforge.net/.
279
280The search forms for these programs are 'http://localhost:8080/search/htdig/',
281'http://localhost:8080/search/mnogosearch/',
282'http://localhost:8080/search/namazu/', and
283'http://localhost:8080/search/hyperestraier/'.  These allow the search part of
284the programs to be run to find the cached web-pages that you want.
285
286For more information about configuring these programs to work with WWWOFFLE you
287should read the file README.htdig, README.mnogosearch, README.namazu, or
288README.hyperestraier.
289
290
291Built-In Web-Server
292-------------------
293
294Any URLs to WWWOFFLE on port 8080 that refer to files in the directory '/' refer
295to the files that are stored in the 'html' subdirectory.  This directory also
296contains the message templates that WWWOFFLE uses to generate the internal web
297pages.  When a file is requested from either of these locations it is first
298looked for in the language specific sub-directory specified in the browser's
299request header.  If it is not found in that location then it is looked for in
300the directory named 'default' which by default is a symbolic link to the English
301language pages, but can be changed.  If it is not found in this location then it
302is looked for in the English language directory (since that will have a full set
303of pages).
304
305Any URLs that refer to files in the directory '/local/' are taken from the files
306in the 'local' sub-directory of the spool directory if they exist.  If they do
307not exist then they are searched for in the language subdirectories of the
308'html' directory as described above.  This allows for trivial web-pages to be
309provided without a separate web-server.  CGI scripts are available but disabled
310by the default configuration file.  The MIME type used for these files are those
311that are specified in the configuration file.
312
313Important: The local web-page server will follow symbolic links, but will only
314           allow access to files that are world readable.  See the FAQ for
315           security issues.
316
317
318Deleting Requests
319-----------------
320
321If no password is used for the control pages then it is possible for anybody to
322delete requests that are recorded.  If a password is assigned then users that
323know this password can delete any request (or cached file or other thing).
324Individual users that do not know the password can delete pages that they have
325requested provided that they do it immediately that the "Will Get" page appears,
326the "Delete" button on here has a once-only password that will delete that
327request.
328
329
330Backup Copies of Pages
331----------------------
332
333When a page is fetched while online a remote server error will overwrite any
334existing web page.  In this case a backup copy of the page is made so that when
335the error message has been read while offline the backup copy is placed back
336into the cache.  This is automatic for all cases of files that have remote
337server errors (and that do not use external proxies), no user intervention is
338required.
339
340
341Lock Files
342----------
343
344When one WWWOFFLE process is downloading a file any other WWWOFFLE process that
345tries to read that file will not be able to until the first one has finished.
346This removes the problem of an incomplete page being displayed in the second
347browser, or a second copy of the page being fetched.  If the lock file is not
348removed by the first process within a timeout period then the second process
349will produce an error message indicating the problem.
350
351This is now a configurable option, the default condition is that lock files are
352not used.
353
354
355HTTPS Access to Internal Pages
356------------------------------
357
358All of the web pages that are available through normal HTTP access on port 8080
359(e.g. http://localhost:8080/*) are also available with secure HTTPS access on
360port 8443 if WWWOFFLE is compiled with the libgnutls encryption library.  This
361applies to all pages; indexes, built-in web server and control and configuration
362pages.
363
364
365Caching of HTTPS Web Pages
366--------------------------
367
368It is possible to configure WWWOFFLE so that it will intercept and cache
369selected HTTPS connections.  This is disabled by default and there are three
370steps to enable it.  WWWOFFLE must be compiled with encryption support, the
371enable-caching option in the SSLOptions section of the configuration file must
372be set true and the list of hosts to cache for must be set.
373
374When WWWOFFLE is configured to cache an HTTPS web page it will request the page,
375decrypt it, re-encrypt it and pass it to the browser.  The copy of the page that
376is stored in the cache will be stored without encryption.  With this option all
377other WWWOFFLE features like the DontGet section, the ModifyHTML section, the
378OnlineOptions and others will be used.  Normally most of these options cannot be
379applied to HTTPS pages because the exact URL is not known to WWWOFFLE and the
380unencrypted contents are not visible.
381
382
383HTTPS Server Certificates
384-------------------------
385
386To handle the encryption functions described above WWWOFFLE will create and
387manage a set of server certificates.  One master certificate is used to sign all
388of the other certificates that WWWOFFLE creates.  The created certificates are
389either for the WWWOFFLE server HTTPS access pages or for a fake certificate that
390is created for each server that is cached.  The certificates that are captured
391by WWWOFFLE and stored are the certificates that are sent back by the real HTTPS
392server.  The final set of certificates are the trusted certificates that
393WWWOFFLE can use to confirm that the remote server is the one it claims to be.
394
395The full set of certificates that WWWOFFLE stores can be seen through the
396WWWOFFLE URL http://localhost:8080/certificates/ but is only available if
397WWWOFFLE was compiled with encryption support.
398
399To add trusted certificates to WWWOFFLE place the certificate file (in PEM
400format) into the directory '/var/spool/wwwoffle/certificates/trusted' and
401restart WWWOFFLE.
402
403
404Spool Directory Layout
405----------------------
406
407In the spool directory there is a directory for each of the network protocols
408that are handled.  In this directory there is a directory for each hostname that
409has been contacted and has pages cached.  These directories have the name of the
410host.  In each of these directories, there is an entry for each of the pages
411that are cached, generated using a hashing function to give a constant length.
412The entry consists of two files, one prefixed with 'D' that contains the data
413and one prefixed with 'U' that contains the URL.
414
415The outgoing directory is a single directory that all of the pending requests
416are contained in, the format is the same with two files for each, but using 'O'
417for the file containing the request instead of 'D' and one prefixed with 'U'
418that contains the URL.
419
420The lasttime (and prevtime*) directories are a single directory that contains an
421entry for each of the files that were fetched the last time that the program was
422online.  Each entry consists of two files, one prefixed with 'D' that is a
423hard-link to the real file and one prefixed with 'U' that contains the URL.
424
425The lastout (and prevout*) directories are a single directory that contains an
426entry for each of the files that were requested the last time that the program
427was offline.  Each entry consists of two files, one prefixed with 'D' that is a
428hard-link to the real file and one prefixed with 'U' that contains the URL.
429
430The monitor directory is a single directory that all of the regularly monitored
431requests are contained in, the format is the same as the outgoing directory with
432two files for each, using 'O' and 'U' prefixes.  There is also a file with an
433'M' prefix that contains the information about when to monitor the URL.
434
435
436The Programs and Configuration File
437-----------------------------------
438
439There are two programs that make up this utility, with three distinct functions.
440
441wwwoffle  - A program to interact with and control the HTTP proxy daemon.
442
443wwwoffled - A daemon process that acts as an HTTP proxy.
444wwwoffles - A server that actually does the fetching of the web pages.
445
446The wwwoffles function is combined with the wwwoffled function into the
447wwwoffled program from version 1.1 onwards.  This is to simplify the procedure
448of starting servers, and allow for future improvements.
449
450The configuration file, called wwwoffle.conf by default contains all of the
451parameters that are used to control the way the wwwoffled and wwwoffles
452functions work.  The default installation location for this file is in the
453directory /etc/wwwoffle.
454
455
456WWWOFFLE - User control program
457-------------------------------
458
459The control program (wwwoffle) is used to control the action of the daemon
460program (wwwoffled), or to request pages that are not in the cache.
461
462The daemon program needs to know if the system is online or offline, when to
463fetch the pages that have been previously requested and when to purge the cache
464of old pages.
465
466
467The first mode of operation is for controlling the daemon process.  These are the
468functions that are also available on the interactive control page (except kill).
469
470wwwoffle -online        Indicates to the daemon that the system is online.
471
472wwwoffle -autodial      Indicates to the daemon that the system is in autodial
473                        mode, this will use cached pages if they exist and use
474                        the network as last resort, for dial-on-demand systems.
475
476wwwoffle -offline       Indicates to the daemon that the system is offline.
477
478wwwoffle -fetch         Commands the daemon to fetch the pages that were
479                        requested by clients while the system was offline.
480                        wwwoffle exits when the fetching is complete.
481                        (This requires the daemon to be told it is online).
482
483wwwoffle -config        Cause the configuration file for the daemon process to be
484                        re-read.  The config file can also be re-read by sending
485                        a HUP signal to the wwwoffled process.
486
487wwwoffle -purge         Commands the daemon to purge from the cache the pages
488                        that are older than the number of days specified in the
489                        configuration file, using modification or access
490                        time. Or if a maximum size is specified then delete the
491                        oldest pages until the maximum size is not exceeded.
492
493wwwoffle -status        Request from the wwwoffled proxy server the current
494                        status of the program.  The online or offline mode, the
495                        fetch and purge statuses, the number of current
496                        processes and their PIDs are displayed.
497
498wwwoffle -kill          Causes the daemon to exit cleanly at a convenient point.
499
500
501The second mode of operation is to specify URLs to get.
502
503wwwoffle <URL> .. <URL> Specifies to the daemon URLs that must be fetched.
504                        If online then it is got immediately, else the request
505                        is stored for a later fetch.
506
507wwwoffle <filename> ... The specified HTML file is be read and all of the links
508                        in it used as if they had been specified on the command
509                        line.
510
511wwwoffle -post <URL>    Send a request using the POST method, the data is read
512                        from stdin and should be provided correctly url-encoded.
513
514wwwoffle -put <URL>     Send a request using the PUT method, the data is read
515                        from stdin and should be provided correctly url-encoded.
516
517
518wwwoffle -F             Force the wwwoffle server to refresh the URL.
519                        (Or fetch it if not cached.)
520
521wwwoffle -g[Sisfo]      Specifies that the URLs when fetched are to be parsed
522                        for Stylesheets (s), images (i), scripts (s), frames (f)
523                        or objects (o) and these are also to be fetched.  Using
524                        -g without any following letters will get none of them.
525
526wwwoffle -r[<depth>]    Specifies that the URL when fetched is to have the links
527                        followed and these pages also fetched (to a depth
528                        specified by the optional depth parameter, default 1).
529                        Only links on the same server are to be fetched.
530
531wwwoffle -R[<depth>]    This is the same as the '-r' option except that all of
532                        the links are to be followed, even those to other
533                        servers.
534
535wwwoffle -d[<depth>]    This is the same as the '-r' option except that links
536                        are only followed if they are in the same directory or a
537                        sub-directory.
538
539                        (If the -F, -(d|r|R) or -g[Sisfo] options are set they
540                        override the options in the FetchOptions section of the
541                        config file and only the -g[Sisfo] options are fetched.)
542
543
544The third mode of operation is to get a URL from the cache.
545
546wwwoffle <URL>          Specifies the URL to get.
547
548wwwoffle -o             Get the URL and output it on the standard output.
549                        (Or request it if not already cached.)
550
551wwwoffle -O             Get the URL and output it on the standard output
552                        including the HTTP headers.
553                        (Or request it if not already cached.)
554
555
556The last mode of operation is to provide help in using the other modes.
557
558wwwoffle -h             Gives help about the command line options.
559
560
561With any of the first three modes of operation the WWWOFFLE server can be
562specified in one of three different ways.
563
564wwwoffle -c <config-file>
565                        Can be used to specify the configuration file that
566                        contains the port numbers, server hostname (the first
567                        entry in the LocalHost section) and the password (if
568                        required for the first mode of operation).  If there is
569                        a password then this is the only way to specify it.
570
571wwwoffle -p <host>[:<port>]
572                        Can be used to specify the hostname and port number that
573                        the daemon program listens to for control messages (first
574                        mode) or proxy connections (second and third modes).
575
576WWWOFFLE_PROXY          An environment variable that can be used to specify
577                        either the argument to the -c option (must be the full
578                        pathname) or the argument to the -p option.  (In this
579                        case two ports can be specified, the first for the proxy
580                        connection, the second for the control connection
581                        e.g. 'localhost:8080:8081' or 'localhost:8080'.)
582
583
584WWWOFFLED - Daemon program
585--------------------------
586
587The daemon program (wwwoffled) runs as an HTTP proxy and also accepts
588connections from the control program (wwwoffle).
589
590The daemon program needs to maintain the current state of the system, online or
591offline, as well as the other parameters in the configuration file.
592
593As HTTP proxy requests come in, the program forks a copy of itself (the
594wwwoffles function) to handle the requests.  The server program can also be
595forked in response to the wwwoffle program requesting pages to be fetched.
596
597
598wwwoffled -c <config-file>      Starts the daemon with the named configuration
599                                file.
600
601wwwoffled -d [level]            Starts the daemon in debugging mode, i.e it does
602                                not detach from the terminal and uses standard
603                                error for the log messages.  The optional
604                                numeric level (0 for none to 5 for all or 6 for
605                                more) specifies the level of error messages for
606                                standard error, if not specified then use
607                                log-level from the config file.
608
609wwwoffled -f                    Start the daemon in debugging mode (implies -d)
610                                and when the first HTTP request comes in handle
611                                it without creating a child process and then
612                                exit.
613
614wwwoffled -p                    Print the PID of the daemon on standard out
615                                before detaching from the terminal.
616
617wwwoffled -h                    Gives help about the command line options.
618
619
620There are a number of error and informational messages that are generated by the
621program as it runs.  By default (in the config file) these go to syslog, by
622using the -d flag the daemon does not detach from the terminal and the errors
623are also on standard error.
624
625By using the run-uid and run-gid options in the config file, it is possible to
626change the user id and group id that the program runs as.  This will require
627that the program is started by root and that the specified user has read/write
628access to the spool directory.
629
630
631WWWOFFLES - Server program
632--------------------------
633
634The server (wwwoffles) starts by being forked from the daemon (wwwoffled) in one
635of three different modes.
636
637Real  - When the system is online and acting as a proxy for a client.
638        All requests for web pages are handled by forking a new server which
639        will connect to the remote host and fetch the page.  This page is then
640        stored in the cache as well as being returned to the client.  If the
641        page is already in the cache then the remote server is asked for a newer
642        page if one exists, else the cache one is used.
643
644SpoolOrReal - When the system is in autodial mode and we have not decided if we
645        will go for Spool or Real mode.  Select Spool mode if already cached and
646        Real mode otherwise as a last resort.
647
648Fetch - When the system is online and fetching pages that have been requested.
649        All web page requests in the outgoing directory are fetched by the
650        server connecting to the remote host to get the page.  This page is then
651        stored in the cache, there is no client active.  If the page has been
652        moved then the link is followed and that one fetched.
653
654Spool - When the system is offline and acting as a proxy for a client.
655        All requests for web pages are handled by forking a server that will
656        either return a cached page or store the request.  If the page is
657        cached, it is returned to the client, else a dummy page is returned
658        (and stored in the cache), and the outgoing request is stored.
659        If the cached page refers to a page that failed to be downloaded then it
660        will be deleted from the cache.
661
662Depending on the existence of files in the spool and other conditions, the mode
663can be changed to one of several other modes.
664
665RealNoCache - For requests for pages on the server machine or those specified
666        not to be cached in the configuration file.
667
668RealRefresh - Used by the refresh button on the index or the wwwoffle program
669        to re-fetch a page while the system is online.
670
671RealNoPassword - Used when a password was provided and two copies of the page
672        are required, one with and one without the password.
673
674FetchNoPassword - Used when a password was provided and two copies of the page
675        are required, one with and one without the password.
676
677SpoolGet - Used when the page does not exist in the cache so a request needs to
678        be stored for it in the outgoing directory.
679
680SpoolRefresh - Used when the refresh button on the index or the wwwoffle program
681        are used, the existing spooled page (if there is one) is not
682        overwritten, but a request is stored.
683
684SpoolPragma - Used when the client requests the cache to refresh the page
685        using the 'Pragma: no-cache' header, the existing spooled page (if there
686        is one) is not overwritten, but a request is stored.
687
688InternalPage - Used when the program is generating a web-page internally or is
689        spooling a web-page with modifications.
690
691
692WWWOFFLE-TOOLS - Cache maintenance program
693------------------------------------------
694
695This is a quick hack program that I wrote to allow you to list the contents of
696the cache or move files around in it.  The programs are all named after common
697UNIX commands with a 'wwwoffle-' prefix.
698
699All of the programs should be invoked from the spool directory.
700
701wwwoffle-rm     - Delete the URL that is specified on the command line.
702                  To delete all URLs from a host it is easier to use
703                  'rm -r http/foo' than use this.
704
705wwwoffle-mv     - To rename URLs under one path in the spool to another path.
706                  Because the URL is encoded in the filename just renaming the
707                  files or the directory will not work.  Instead of 'mv http/foo
708                  http/bar' use 'wwwoffle-mv http/foo http/bar'.  Also works for
709                  complex cases: 'wwwoffle-mv http://foo/bar http:///bar/foo'.
710
711wwwoffle-ls     - To list a cached URL or the files in a cache directory in the
712                  style of 'ls -l'.  As examples use 'wwwoffle-ls http/foo' to
713                  list all of the URLs in the cache directory 'http/foo', use
714                  'wwwoffle-ls http://foo/' to list the single URL 'http://foo/'
715                  or use 'wwwoffle-ls outgoing' to list the outgoing requests.
716
717wwwoffle-read   - Read data directly from the cache for the URL named on the
718                  command line and output it on stdout.
719
720wwwoffle-write  - Write data directly to the cache for the URL named on the
721                  command line from stdin.  Note this requires a HTTP header to
722                  be included first or clients may get confused.
723                  (echo "HTTP/1.0 200 OK" ; echo "" ; cat bar.html ) | \
724                  wwwoffle-write http://www.foo/bar.html
725
726wwwoffle-hash   - Print WWWOFFLE's encoding of the submitted URL. This is
727                  useful for scripts working on the WWWOFFLE cache.
728
729wwwoffle-fsck   - Checks the WWWOFFLE cache for consistency, it will rename any
730                  files where the filename does not match the hash of the URL.
731
732wwwoffle-gzip   - Compress the contents of the cache so that they take less
733                  space but WWWOFFLE can still read them.
734
735wwwoffle-gunzip - Uncompress the contents of the cache.
736
737All of the programs are the same executable and the name of the file determines
738the function.  The wwwoffle-tools executable can also be used with a command
739line parameter, for example 'wwwoffle-ls' is the same as 'wwwoffle-tools -ls'.
740
741This program also accepts the '-c <config-file>' option and uses the
742'WWWOFFLE_PROXY' environment variable so that the wwwoffle-write program uses
743the correct permissions and uid/gid.
744
745These are basically hacks and should not be considered as fully featured and
746fully debugged programs.
747
748
749audit-usage.pl - Perl script to check log files
750-----------------------------------------------
751
752The audit-usage.pl script (in the contrib directory) can be used to get audit
753information from the output of the wwwoffled program.
754
755If wwwoffled is started as
756
757wwwoffled -c /etc/wwwoffle/wwwoffle.conf -d 4
758
759Then on the standard error output will be generated information about the
760program as it is run.  The debug level needs to be 4 so that the URL information
761is output.
762
763If this is captured into a log file then it can be analysed by the
764audit-usage.pl program.  When run this will tell the host that the connection is
765made from and the URL that is requested.  It also includes the timestamp
766information and connections to the WWWOFFLE control connection.
767
768
769Test Programs
770-------------
771
772In the testprogs directory are two test programs that can be compiled if
773required.  They are not needed for WWWOFFLE to work, but if you are customising
774the information pages for WWWOFFLE to use or trying to debug the HTML parser
775then they will be of use.
776
777These are even more hacks than the wwwoffle-tools programs, use at your own risk.
778
779
780Author and Copyright
781--------------------
782
783The two programs wwwoffle and wwwoffled were written and are copyright by Andrew
784M. Bishop 1996-2011.
785
786The programs known as wwwoffle-tools were written and a copyright by Andrew
787M. Bishop 1997-2011.
788
789The Perl scripts update-config.pl and audit-usage.pl were written and are
790copyright by Andrew M. Bishop 1998-2011.
791
792They can be freely distributed according to the terms of the GNU General Public
793License (see the file `COPYING').
794
795
796Ht://Dig
797- - - -
798
799The htdig package is copyrighted by Andrew Scherpbier. The icons in the
800html/search/htdig directory come from htdig as does the search form
801html/search/htdig/search.html and configuration files in search/htdig/conf/*
802(with modifications by myself).
803
804
805mnoGoSearch (UdmSearch)
806- - - - - - - - - - - -
807
808The mnoGoSearch package is copyrighted by Lavtech.Com Corp and released under
809the GPL.  The mnoGoSearch icon in the html/search/mnogosearch directory comes
810from mnoGoSearch as does the search form html/search/mnogosearch/search.html and
811configuration files in search/mnogosearch/conf/* (with modifications by myself).
812
813
814Namazu
815- - -
816
817The Namazu package is copyrighted by the Namazu Project and mknmz-wwwoffle is
818copyrighted by WATANABE Yoshimasa, both programs are released under the GPL.
819The configuration files in search/namazu/conf/* come from Namazu (with
820modifications by myself).
821
822
823Hyper Estraier
824- - - - - - -
825
826The Hyper Estraier package is copyrighted by Mikio Hirabayashi and is released
827under the LGPL.  The configuration files in search/hyperestraier/conf/* come
828from Hyper Estraier (with modifications by myself).
829
830
831With Source Code Contributions From
832- - - - - - - - - - - - - - - - - -
833
834Yannick Versley <sa6z225@public.uni-hamburg.de>
835        Initial syslog code (much rewritten before inclusion).
836
837Axel Rasmus Wienberg <2wienbe@informatik.uni-hamburg.de>
838        Code to run wwwoffled as a specified uid/gid.
839
840Andreas Dietrich <quasi@baccus.franken.de>
841        Code to detach the program from the terminal like a *real* daemon.
842
843Ullrich von Bassewitz <uz@wuschel.ibb.schwaben.com>
844        Better handling of signals.
845        Optimisation of the file handling in the outgoing directory.
846        The log-level, max-servers and max-fetch-servers config options.
847
848Tilman Bohn <tb@bohn.isdn.uni-heidelberg.de>
849        Autodial mode.
850
851Walter Pfannenmueller <pfn@online.de>
852        Document parsing Java/VRML/XML some HTML.
853
854Ben Winslow <rain@insane.loonybin.net>
855        Configuration file DontGet section optional replacement Url.
856        New FTP commands to get file size and modification time.
857
858Ingo Kloecker <kloecker@math.u-bordeaux.fr>
859        Disable animated GIFs (code now removed and rewritten).
860
861David McNab <david@rebirthing.co.nz>
862        Workaround winsock bug for cygwin (now lingering close on all systems).
863
864Olaf Buddenhagen <olafbuddenhagen@web.de>
865        A patch to do random sorting in the indexes.
866
867Jan Lukoschus <jan+wof@lukoschus.de>
868        The patch for wwwoffle-hash (for wwwoffle-tools).
869
870Paul A. Rombouts <p.a.rombouts@home.nl>
871        The patch to force re-requests of redirection URLs.
872        The patch to allow wildcards to have more than two '*' characters.
873        The patch to allow local CGI scripts to be run.
874        The patch to keep the backup copy of a page in case of server error.
875
876Marc Boucher <MarcB@box100.com>
877        The patch to perform case insensitive matching of URL-SPECs.
878        The patch to handle FTP requests made with a password (like HTTP).
879
880Ilya Dogolazky <ilyad@math.uni-bonn.de>
881        The patch for the fix-mixed-cyrillic option.
882
883Dieter <netbsd@sopwith.solgatos.com>
884        A patch with some 64-bit/32-bit compatibility fixes (that prompted me to
885        go and find and fix a lot more).
886
887Andreas Mohr <andi@rhlx01.fht-esslingen.de>
888        A patch to add "const" to lots of structures and function parameters
889        (this prompted me to go and do a lot more).
890
891Nils Kassube <kassube@gmx.net>
892        The patch for the referer-from option.
893
894
895And Other Useful Contributions From
896- - - - - - - - - - - - - - - - - -
897
898Too many people to mention - (everybody that e-mailed me).
899        Suggestions and bug reports.
900

README.1st

1          WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
2          ========================================================
3
4
5There are several help files, this is a meta-help file that will point you to
6the correct one to read.
7
8
9If you want to know what WWWOFFLE is and does:
10
11doc/README
12
13
14If you have not used WWWOFFLE before:
15
16doc/README
17doc/INSTALL
18doc/README.lang             [Non-English versions]
19doc/README.CONF
20doc/README.win32            [Win32 users]
21
22
23If you have used WWWOFFLE before:
24
25doc/NEWS
26doc/CHANGES.CONF
27doc/README.CONF
28
29
30If you want to know how to install the program or change compile time options:
31
32doc/INSTALL
33doc/README.lang             [Non-English versions]
34doc/README.htdig
35doc/README.mnogosearch
36doc/README.namazu
37doc/README.hyperestraier
38doc/README.win32            [Win32 users]
39
40
41If you want to know how to use the programs when installed:
42
43wwwoffle(1)                 [UNIX users]
44wwwoffled(8)                [UNIX users]
45doc/README
46doc/README.htdig
47doc/README.mnogosearch
48doc/README.namazu
49doc/README.hyperestraier
50
51
52If you want to know how to configure the program:
53
54wwwoffle.conf               [UNIX users]
55wwwoffle.conf(5)            [UNIX users]
56doc/README.CONF
57doc/CHANGES.CONF
58doc/README.htdig
59doc/README.mnogosearch
60doc/README.namazu
61doc/README.hyperestraier
62
63
64If you want to customise the error message and other pages:
65
66html/$LANG/messages/README
67
68
69If you want to do more with WWWOFFLE:
70
71contrib/README          [UNIX users]
72contrib-win32/README    [Win32 users]
73
74
75
76The location and purpose of the sources of information are as listed below.
77
78Files in the doc directory (see also the language specific subdirectories):
79
80ANNOUNCE           The announcement release for this version of WWWOFFLE.
81
82COPYING            The WWWOFFLE licence (the GNU General Public Licence).
83
84NEWS               A list of the changes in functionality in previous versions.
85
86INSTALL            Help on installing the program and the compile-time options.
87
88README             The main README file, usual README stuff about the program.
89README.1st         This file - a meta-index.
90
91README.CONF        Help on the format of the config file.
92CHANGES.CONF       A list of the config file changes from version 1.3 to now.
93
94README.PWD         Information about how password protected pages are stored.
95README.URL         Information about how URLs are processed (decoding/encoding).
96README.compress    Information about the problems with using cache compression.
97README.https       Information about https, SSL/TLS, certificates and trust.
98
99README.lang        A description of the non-English language web-page versions.
100
101README.htdig         How to use ht://Dig to search the cache.
102README.mnogosearch   How to use mnoGoSearch to search the cache.
103README.namazu        How to use namazu to search the cache.
104README.hyperestraier How to use Hyper Estraier to search the cache.
105
106README.win32       Information about the Win32 version of WWWOFFLE.
107
108
109Manual pages (installed names listed, files are in doc directory):
110
111wwwoffle(1)        The wwwoffle user interface program UNIX manual page.
112wwwoffle.conf(5)   The wwwoffle.conf configuration file UNIX manual page.
113wwwoffled(8)       The wwwoffled server demon program UNIX manual page.
114
115
116Other files:
117
118html/$LANG/messages/README Information on how to customise the error messages.
119
120contrib/README        Some various scripts for use with WWWOFFLE.
121contrib-win32/README  Some various scripts for use with WWWOFFLE.
122

README.CONF

1              WWWOFFLE - Configuration File - Version 2.9
2              ===========================================
3
4Introduction
5------------
6
7The configuration file (wwwoffle.conf) specifies all of the parameters that
8control the operation of the proxy server.  The file is split into sections
9each containing a series of parameters as described below.  The file
10CHANGES.CONF explains the changes in the configuration file between this
11version of the program and previous ones.
12
13The file is split into sections, each of which can be empty or contain one or
14more lines of configuration information.  The sections are named and the order
15that they appear in the file is not important.
16
17The general format of each of the sections is the same.  The name of the
18section is on a line by itself to mark the start.  The contents of the section
19are enclosed between a pair of lines containing only the '{' and '}'
20characters or the '[' and ']' characters.  When the '{' and '}' characters are
21used the lines between contain configuration information.  When the '[' and
22']' characters are used then there must only be a single non-empty line
23between them that contains the name of a file (in the same directory)
24containing the configuration information for the section.
25
26Comments are marked by a '#' character at the start of the line and they are
27ignored.  Blank lines are also allowed and ignored.
28
29The phrases URL-SPECIFICATION (or URL-SPEC for short) and WILDCARD have
30specific meanings in the configuration file and are described at the end.  Any
31item enclosed in '(' and ')' in the descriptions means that it is a parameter
32supplied by the user, anything enclosed in '[' and ']' is optional, the '|'
33symbol is used to denote alternate choices.  Some options apply to specific
34URLs only, this is indicated by having a URL-SPECIFICATION enclosed between
35'<' & '>' in the option, the first URL-SPECIFICATION to match is used.  If no
36URL-SPECIFICATION is given then it matches all URLs.
37
38--------------------------------------------------------------------------------
39
40StartUp
41-------
42
43This contains the parameters that are used when the program starts, changes to
44these are ignored if the configuration file is re-read while the program is
45running.
46
47bind-ipv4 = (hostname) | (ip-address) | none
48        Specify the hostname or IP address to bind the HTTP proxy and WWWOFFLE
49        control port sockets to using IPv4 (default='0.0.0.0').  If 'none' is
50        specified then no IPv4 socket is bound.  If this is changed from the
51        default value then the first entry in the LocalHost section may need
52        to be changed to match.
53
54bind-ipv6 = (hostname) | (ip-address) | none
55        Specify the hostname or IP address to bind the HTTP proxy and WWWOFFLE
56        control port sockets to using IPv6 (default='::').  If 'none' is
57        specified then no IPv6 socket is bound.  This requires the IPv6
58        compilation option.  If this is changed from the default value then
59        the first entry in the LocalHost section may need to be changed to
60        match.
61
62http-port = (port)
63        An integer specifying the port number for connections to access the
64        internal WWWOFFLE pages and for HTTP/HTTPS/FTP proxying
65        (default=8080).  This is the port number that must be specified in the
66        client to connect to the WWWOFFLE proxy for HTTP/HTTPS/FTP proxying.
67
68https-port = (port)
69        An integer specifying the port number for encrypted connections to
70        access the internal WWWOFFLE pages and for HTTP/FTP proxying
71        (default=8443).  Requires gnutls compilation option.
72
73wwwoffle-port = (port)
74        An integer specifying the port number for the WWWOFFLE control
75        connections to use (default=8081).
76
77spool-dir = (dir)
78        The full pathname of the top level cache directory (spool directory)
79        (default=/var/spool/wwwoffle or whatever was used when the program was
80        compiled).
81
82run-uid = (user) | (uid)
83        The username or numeric uid to change to when the WWWOFFLE server is
84        started (default=none).  This option only works if the server is
85        started by the root user on UNIX-like systems.
86
87run-gid = (group) | (gid)
88        The group name or numeric gid to change to when the WWWOFFLE server is
89        started (default=none).  This option only works if the server is
90        started by the root user on UNIX-like systems.
91
92use-syslog = yes | no
93        Whether to use the syslog facility for messages or not (default=yes).
94
95password = (word)
96        The password used for authentication of the control pages, for
97        deleting cached pages etc (default=none).  For the password to be
98        secure the configuration file must be set so that only authorised
99        users can read it.
100
101max-servers = (integer)
102        The maximum number of server processes that are started for online and
103        automatic fetching (default=8).
104
105max-fetch-servers = (integer)
106        The maximum number of server processes that are started to fetch pages
107        that were marked in offline mode (default=4).  This value must be less
108        than max-servers or you will not be able to use WWWOFFLE interactively
109        online while fetching.
110
111--------------------------------------------------------------------------------
112
113Options
114-------
115
116Options that control how the program works.
117
118log-level = debug | info | important | warning | fatal
119        The minimum log level for messages in syslog or stderr
120        (default=important).
121
122socket-timeout = (time)
123        The time in seconds that WWWOFFLE will wait for data on a socket
124        connection before giving up (default=120).
125
126dns-timeout = (time)
127        The time in seconds that WWWOFFLE will wait for a DNS (Domain Name
128        Service) lookup before giving up (default=60).
129
130connect-timeout = (time)
131        The time in seconds that WWWOFFLE will wait for the socket connection
132        to be made before giving up (default=30).
133
134connect-retry = yes | no
135        If a connection cannot be made to a remote server then WWWOFFLE should
136        try again after a short delay (default=no).
137
138dir-perm = (octal int)
139        The directory permissions to use when creating spool directories
140        (default=0755).  This option overrides the umask of the user and must
141        be in octal starting with a '0'.
142
143file-perm = (octal int)
144        The file permissions to use when creating spool files (default=0644).
145        This option overrides the umask of the user and must be in octal
146        starting with a '0'.
147
148run-online = (filename)
149        The full pathname of a program to run when WWWOFFLE is switched to
150        online mode (default=none).  The program is started in the background
151        with a single parameter set to the current mode name "online".
152
153run-offline = (filename)
154        The full pathname of a program to run when WWWOFFLE is switched to
155        offline mode (default=none).  The program is started in the background
156        with a single parameter set to the current mode name "offline".
157
158run-autodial = (filename)
159        The full pathname of a program to run when WWWOFFLE is switched to
160        autodial (default=none).  The program is started in the background with
161        a single parameter set to the current mode name "fetch".
162
163run-fetch = (filename)
164        The full pathname of a program to run when a WWWOFFLE fetch starts or
165        stops (default=none).  The program is started in the background with two
166        parameters, the first is the word "fetch" and the second is "start" or
167        "stop".
168
169lock-files = yes | no
170        Enable the use of lock files to stop more than one WWWOFFLE process from
171        downloading the same URL at the same time (default=no).  Disabling the
172        lock-files may result in incomplete pages being displayed or many copies
173        being downloaded if multiple requests are made for the same URL at the
174        same time.
175
176reply-compressed-data = yes | no
177        If the replies that are made to the client are to contain compressed
178        data when requested (default=no).  Requires zlib compilation option.
179
180reply-chunked-data = yes | no
181        If the replies that are made to the client are to use chunked encoding
182        when possible (default=yes).
183
184exec-cgi = (pathname)
185        Enable the use of CGI scripts for the local pages on the WWWOFFLE
186        server that match the wildcard pathname (default=none).
187
188--------------------------------------------------------------------------------
189
190OnlineOptions
191-------------
192
193Options that control how WWWOFFLE behaves when it is online.
194
195[<URL-SPEC>] pragma-no-cache = yes | no
196        Whether to request a new copy of a page if the request from the client
197        has 'Pragma: no-cache' (default=yes).  This option takes precedence
198        over the request-changed and request-changed-once options.
199
200[<URL-SPEC>] cache-control-no-cache = yes | no
201        Whether to request a new copy of a page if the request from the client
202        has 'Cache-Control: no-cache' (default=yes).  This option takes
203        precedence over the request-changed and request-changed-once options.
204
205[<URL-SPEC>] cache-control-max-age-0 = yes | no
206        Whether to request a new copy of a page if the request from the client
207        has 'Cache-Control: max-age=0' (default=yes).  This option takes
208        precedence over the request-changed and request-changed-once options.
209
210[<URL-SPEC>] cookies-force-refresh = yes | no
211        Whether to force the refresh of a page if the request from the client
212        contains a cookie (default=no).  This option takes precedence over
213        the request-changed and request-changed-once options.
214
215[<URL-SPEC>] request-changed = (time)
216        While online pages will only be fetched if the cached version is older
217        than this specified time in seconds (default=600).  Setting this value
218        negative will indicate that cached pages are always used while online.
219        Longer times can be specified with a 'm', 'h', 'd' or 'w' suffix for
220        minutes, hours, days or weeks (e.g. 10m=600).
221
222[<URL-SPEC>] request-changed-once = yes | no
223        While online pages will only be fetched if the cached version has not
224        already been fetched once this session online (default=yes).  This
225        option takes precedence over the request-changed option.
226
227[<URL-SPEC>] request-expired = yes | no
228        While online pages that have expired will always be requested again
229        (default=no).  This option takes precedence over the request-changed
230        and request-changed-once options.
231
232[<URL-SPEC>] request-no-cache = yes | no
233        While online pages that ask not to be cached will always be requested
234        again (default=no).  This option takes precedence over the
235        request-changed and request-changed-once options.
236
237[<URL-SPEC>] request-redirection = yes | no
238        While online pages that redirect the client to another URL temporarily
239        will be requested again. (default=no).  This option takes precedence
240        over the request-changed and request-changed-once options.
241
242[<URL-SPEC>] request-conditional = yes | no
243        While online pages that are requested from the server will be
244        conditional requests so that the server only sends data if the page
245        has changed (default=yes).
246
247[<URL-SPEC>] validate-with-etag = yes | no
248        When making a conditional request to a server enable the use of the
249        HTTP/1.1 cache validator 'Etag' as well as modification time
250        'If-Modified-Since' (default=yes).  The request-conditional option
251        must also be selected for this option to take effect.
252
253[<URL-SPEC>] try-without-password = yes | no
254        If a request is made for a URL that contains a username and password
255        then a request is made for the same URL without a username and
256        password specified (default=yes).  This allows for requests for the
257        URL without a password to re-direct the client to the passworded
258        version.
259
260[<URL-SPEC>] intr-download-keep = yes | no
261        If the client closes the connection while online the currently
262        downloaded incomplete page should be kept (default=no).
263
264[<URL-SPEC>] intr-download-size = (integer)
265        If the client closes the connection while online the page should
266        continue to download if it is smaller than this size in kB
267        (default=1).
268
269[<URL-SPEC>] intr-download-percent = (integer)
270        If the client closes the connection while online the page should
271        continue to download if it is more than this percentage complete
272        (default=80).
273
274[<URL-SPEC>] timeout-download-keep = yes | no
275        If the server connection times out while reading then the currently
276        downloaded incomplete page should be kept (default=no).
277
278[<URL-SPEC>] keep-cache-if-not-found = yes | no
279        If the remote server replies with an error message or a redirection
280        while there is a cached version with status 200 the previously cached
281        version should be kept (default=no).
282
283[<URL-SPEC>] request-compressed-data = yes | no
284        If the requests that are made to the server are to request compressed
285        data (default=yes).  Requires zlib compilation option.
286
287[<URL-SPEC>] request-chunked-data = yes | no
288        If the requests that are made to the server are to request chunked
289        encoding (default=yes).
290
291--------------------------------------------------------------------------------
292
293OfflineOptions
294--------------
295
296Options that control how WWWOFFLE behaves when it is offline.
297
298[<URL-SPEC>] pragma-no-cache = yes | no
299        Whether to request a new copy of a page if the request from the client
300        has 'Pragma: no-cache' (default=yes).  This option should be set to
301        'no' if when browsing offline all pages are re-requested by a 'broken'
302        browser.
303
304[<URL-SPEC>] cache-control-no-cache = yes | no
305        Whether to request a new copy of a page if the request from the client
306        has 'Cache-Control: no-cache' (default=yes).  This option should be
307        set to 'no' if when browsing offline all pages are re-requested by a
308        'broken' browser.
309
310[<URL-SPEC>] cache-control-max-age-0 = yes | no
311        Whether to request a new copy of a page if the request from the client
312        has 'Cache-Control: max-age=0' (default=yes).  This option should be
313        set to 'no' if when browsing offline all pages are re-requested by a
314        'broken' browser.
315
316[<URL-SPEC>] confirm-requests = yes | no
317        Whether to return a page requiring user confirmation instead of
318        automatically recording requests made while offline (default=no).
319
320[<URL-SPEC>] dont-request = yes | no
321        Do not request any URLs that match this when offline (default=no).
322
323--------------------------------------------------------------------------------
324
325SSLOptions
326----------
327
328Options that control how WWWOFFLE behaves when a connection is made to it for an
329https or Secure Sockets Layer (SSL) server.  Normally only tunnelling (with no
330decryption or caching of the data) is possible.  When WWWOFFLE is compiled with
331the gnutls library it is possible configure WWWOFFLE to decrypt, cache and
332re-encrypt the connections.
333
334quick-key-gen = yes | no
335        Normally generation of secret keys for the SSL/https functions uses the
336        default GnuTLS option for random number source.  This can be slow on
337        some machines so this option selects a quicker but less secure random
338        number source (default = no).  Requires GnuTLS compilation option.
339
340expiration-time = (age)
341        The length of time after creation that each certificate will expire
342        (default = 1y).  Requires GnuTLS compilation option.
343
344enable-caching = yes | no
345        If caching (involving decryption and re-encryption) of SSL/https
346        server connections is allowed (default = no).  Requires GnuTLS
347        compilation option.
348
349allow-tunnel = (host[:port])
350        A hostname and port number (a WILDCARD match) for an https/SSL server
351        that can be connected to using WWWOFFLE as a tunnelling proxy (no
352        caching or decryption of the data) (default is no hosts or ports
353        allowed).  This option should be set to *:443 to allow https to the
354        default port number.  There can be more than one option for other
355        ports or hosts as required.  This option takes precedence over the
356        allow-cache option.  The host value is matched against the URL as
357        presented, no hostname to IP or IP to hostname lookups are performed
358        to find alternative equivalent names.
359
360disallow-tunnel = (host[:port])
361        A hostname and port number (a WILDCARD match) for an https/SSL server
362        that can not be connected to using WWWOFFLE as a tunnelling proxy.
363        There can be more than one option for other ports or hosts as
364        required.  This option takes precedence over the allow-tunnel option.
365        The host value is matched against the URL as presented, no hostname to
366        IP or IP to hostname lookups are performed to find alternative
367        equivalent names.
368
369allow-cache = (host[:port])
370        A hostname and port number (a WILDCARD match) for an https/SSL server
371        that can be connected to using WWWOFFLE as a caching proxy (decryption
372        of the data) (default is no hosts or ports allowed).  This option
373        should be set to *:443 to allow https to the default port number.
374        There can be more than one option for other ports or hosts as
375        required.  The host value is matched against the URL as presented, no
376        hostname to IP or IP to hostname lookups are performed to find
377        alternative equivalent names.  Requires GnuTLS compilation option.
378
379disallow-cache = (host[:port])
380        A hostname and port number (a WILDCARD match) for an https/SSL server
381        that can not be connected to using WWWOFFLE as a caching proxy.  This
382        option takes precedence over the allow-cache option.  The host value
383        is matched against the URL as presented, no hostname to IP or IP to
384        hostname lookups are performed to find alternative equivalent names.
385        Requires GnuTLS compilation option.
386
387--------------------------------------------------------------------------------
388
389FetchOptions
390------------
391
392Options that control what linked elements are downloaded when fetching pages
393that were requested while offline.
394
395[<URL-SPEC>] stylesheets = yes | no
396        If style sheets are to be fetched (default=no).
397
398[<URL-SPEC>] images = yes | no
399        If images are to be fetched (default=no).
400
401[<URL-SPEC>] webbug-images = yes | no
402        If images that are declared in the HTML to be 1 pixel square are also to
403        be fetched, requires the images option to also be selected
404        (default=yes).  If these images are not fetched then the
405        replace-webbug-images option in the ModifyHTML section can be used to
406        stop browsers requesting them.
407
408[<URL-SPEC>] icon-images = yes | no
409        If icons (also called favourite icons or shortcut icons) as used by
410        browsers for bookmarks are to be fetched (default=no).
411
412[<URL-SPEC>] only-same-host-images = yes | no
413        If the only images that are fetched are the ones that are on the same
414        host as the page that references them, requires the images option to
415        also be selected (default=no).
416
417[<URL-SPEC>] frames = yes | no
418        If frames are to be fetched (default=no).
419
420[<URL-SPEC>] iframes = yes | no
421        If inline frames (iframes) are to be fetched (default=no).
422
423[<URL-SPEC>] scripts = yes | no
424        If scripts (e.g. Javascript) are to be fetched (default=no).
425
426[<URL-SPEC>] objects = yes | no
427        If objects (e.g. Java class files) are to be fetched (default=no).
428
429--------------------------------------------------------------------------------
430
431IndexOptions
432------------
433
434Options that control what is displayed in the indexes.
435
436create-history-indexes = yes | no
437        Enables creation of the lasttime/prevtime and lastout/prevout indexes
438        (default=yes).  The cycling of the indexes is always performed and
439        they will flush even if this option is disabled.
440
441cycle-indexes-daily = yes | no
442        Cycles the lasttime/prevtime and lastout/prevout indexes daily instead
443        of each time online or fetching (default = no).
444
445<URL-SPEC> list-outgoing = yes | no
446        Choose if the URL is to be listed in the outgoing index (default=yes).
447
448<URL-SPEC> list-latest = yes | no
449        Choose if the URL is to be listed in the lasttime/prevtime and
450        lastout/prevout indexes (default=yes).
451
452<URL-SPEC> list-monitor = yes | no
453        Choose if the URL is to be listed in the monitor index (default=yes).
454
455<URL-SPEC> list-host = yes | no
456        Choose if the URL is to be listed in the host indexes (default=yes).
457
458<URL-SPEC> list-any = yes | no
459        Choose if the URL is to be listed in any of the indexes (default=yes).
460
461--------------------------------------------------------------------------------
462
463ModifyHTML
464----------
465
466Options that control how the HTML that is provided from the cache is modified.
467
468[<URL-SPEC>] enable-modify-html = yes | no
469        Enable the HTML modifications in this section (default=no).  With this
470        option disabled the following HTML options will not have any effect.
471        With this option enabled there is a small speed penalty.
472
473[<URL-SPEC>] add-cache-info = yes | no
474        At the bottom of all of the spooled pages the date that the page was
475        cached and some navigation buttons are to be added (default=no).
476
477[<URL-SPEC>] anchor-cached-begin = (HTML code) |
478        Anchors (links) in the spooled page that are in the cache are to have
479        the specified HTML inserted at the beginning (default="").
480
481[<URL-SPEC>] anchor-cached-end = (HTML code) |
482        Anchors (links) in the spooled page that are in the cache are to have
483        the specified HTML inserted at the end (default="").
484
485[<URL-SPEC>] anchor-requested-begin = (HTML code) |
486        Anchors (links) in the spooled page that are not in the cache but have
487        been requested for download are to have the specified HTML inserted at
488        the beginning (default="").
489
490[<URL-SPEC>] anchor-requested-end = (HTML code) |
491        Anchors (links) in the spooled page that are not in the cache but have
492        been requested for download are to have the specified HTML inserted at
493        the end (default="").
494
495[<URL-SPEC>] anchor-not-cached-begin = (HTML code) |
496        Anchors (links) in the spooled page that are not in the cache or
497        requested are to have the specified HTML inserted at the beginning
498        (default="").
499
500[<URL-SPEC>] anchor-not-cached-end = (HTML code) |
501        Anchors (links) in the spooled page that are not in the cache or
502        requested are to have the specified HTML inserted at the end
503        (default="").
504
505[<URL-SPEC>] disable-script = yes | no
506        Removes all scripts and scripted events (default=no).
507
508[<URL-SPEC>] disable-applet = yes | no
509        Removes all Java applets (default=no).
510
511[<URL-SPEC>] disable-style = yes | no
512        Removes all stylesheets and style references (default=no).
513
514[<URL-SPEC>] disable-blink = yes | no
515        Removes the <blink> tag from HTML but does not disable blink in
516        stylesheets (default=no).
517
518[<URL-SPEC>] disable-marquee = yes | no
519        Removes the <marquee> tag from HTML to stop scrolling text
520        (default=no).
521
522[<URL-SPEC>] disable-flash = yes | no
523        Removes any Shockwave Flash animations (default=no).
524
525[<URL-SPEC>] disable-iframe = yes | no
526        Removes any inline frames (the <iframe> tag) from HTML (default=no).
527
528[<URL-SPEC>] disable-meta-refresh = yes | no
529        Removes any meta tags in the HTML header that re-direct the client to
530        change to another page after an optional delay (default=no).
531
532[<URL-SPEC>] disable-meta-refresh-self = yes | no
533        Removes any meta tags in the HTML header that re-direct the client to
534        reload the same page after a delay (default=no).
535
536[<URL-SPEC>] disable-meta-set-cookie = yes | no
537        Removes any meta tags in the HTML header that cause cookies to be set
538        (default=no).
539
540[<URL-SPEC>] disable-dontget-links = yes | no
541        Disables any links to URLs that are in the DontGet section of the
542        configuration file (default=no).
543
544[<URL-SPEC>] disable-dontget-iframes = yes | no
545        Disables inline frame (iframe) URLs that are in the DontGet section of
546        the configuration file (default=no).
547
548[<URL-SPEC>] replace-dontget-images = yes | no
549        Replaces image URLs that are in the DontGet section of the
550        configuration file with a static URL (default=no).
551
552[<URL-SPEC>] replacement-dontget-image = (URL)
553        The replacement image to use for URLs that are in the DontGet section
554        of the configuration file (default=/local/dontget/replacement.gif).
555
556[<URL-SPEC>] replace-webbug-images = yes | no
557        Replaces image URLs that are 1 pixel square with a static URL
558        (default=no).  The webbug-images option in the FetchOptions section
559        can be used to stop these images from being automatically downloaded.
560
561[<URL-SPEC>] replacement-webbug-image = (URL)
562        The replacement image to use for images that are 1 pixel square
563        (default=/local/dontget/replacement.gif).
564
565[<URL-SPEC>] demoronise-ms-chars = yes | no
566        Replaces strange characters that some Microsoft applications put into
567        HTML with character equivalents that most browsers can display
568        (default=no).  The idea for this comes from the public domain
569        Demoroniser perl script.
570
571[<URL-SPEC>] fix-mixed-cyrillic = yes | no
572        Replaces punctuation characters in cp-1251 encoding that are combined
573        with text in koi-8 encoding that appears in some cyrillic web pages.
574
575[<URL-SPEC>] disable-animated-gif = yes | no
576        Disables the animation in animated GIF files (default=no).
577
578--------------------------------------------------------------------------------
579
580LocalHost
581---------
582
583A list of hostnames that the host running the WWWOFFLE server may be known by.
584This is so that the proxy does not need to contact itself if the request has a
585different name for the same server.
586
587(host)
588        A hostname or IP address that in connection with the port number (in
589        the StartUp section) specifies the WWWOFFLE proxy HTTP server.  The
590        hostnames must match exactly, it is not a WILDCARD match.  The first
591        named host is used as the server name for several features so must be
592        a name that will work from any client host on the network.  The
593        entries can be hostnames, IPv4 addresses or IPv6 addresses enclosed
594        within '[...]'.  None of the hosts named here are cached or fetched
595        via a proxy.
596
597--------------------------------------------------------------------------------
598
599LocalNet
600--------
601
602A list of hostnames whose web servers are always accessible even when offline
603and are not to be cached by WWWOFFLE because they are on a local network.
604
605(host)
606        A hostname or IP address that is always available and is not to be
607        cached by WWWOFFLE.  The host name matching uses WILDCARDs.  A host
608        can be excluded by appending a '!' to the start of the name.  The host
609        value is matched against the URL as presented, no hostname to IP or IP
610        to hostname lookups are performed to find alternative equivalent
611        names.  The entries can be hostnames, IPv4 addresses or IPv6 addresses
612        enclosed within '[...]'.  All entries here are assumed to be reachable
613        even when offline.  None of the hosts named here are cached or fetched
614        via a proxy.
615
616--------------------------------------------------------------------------------
617
618AllowedConnectHosts
619-------------------
620
621A list of client hostnames that are allowed to connect to the server.
622
623(host)
624        A hostname or IP address that is allowed to connect to the server.
625        The host name matching uses WILDCARDs.  A host can be excluded by
626        appending a '!' to the start of the name.  If the IP address or
627        hostname (if available) of the machine connecting matches then it is
628        allowed.  The entries can be hostnames, IPv4 addresses or IPv6
629        addresses enclosed within '[...]'.  All of the hosts named in
630        LocalHost are also allowed to connect.
631
632--------------------------------------------------------------------------------
633
634AllowedConnectUsers
635-------------------
636
637A list of the users that are allowed to connect to the server and their
638passwords.
639
640(username):(password)
641        The username and password of the users that are allowed to connect to
642        the server.  If this section is left empty then no user authentication
643        is done.  The username and password are both stored in plaintext
644        format.  This requires the use of clients that handle the HTTP/1.1
645        proxy authentication standard.
646
647--------------------------------------------------------------------------------
648
649DontCache
650---------
651
652A list of URLs that are not to be cached by WWWOFFLE.
653
654[!]URL-SPECIFICATION
655        Do not cache any URLs that match this.  The URL-SPECIFICATION can be
656        negated to allow matches to be cached.  The URLs that are not cached
657        will not have requests recorded if offline or fetched automatically.
658
659--------------------------------------------------------------------------------
660
661DontGet
662-------
663
664A list of URLs that are not to be got by WWWOFFLE when it is fetching and not
665to be served from the WWWOFFLE cache even if they exist.
666
667[!]URL-SPECIFICATION
668        Do not get any URLs that match this.  The URL-SPECIFICATION can be
669        negated to allow matches to be got.
670
671[<URL-SPEC>] replacement = (URL)
672        The URL to use to replace any URLs that match the URL-SPECIFICATIONs
673        instead of using the standard error message (default=none).  The URLs
674        in /local/dontget/ are suggested replacements (e.g. replacement.gif or
675        replacement.png which are 1x1 pixel transparent images or
676        replacement.js which is an empty javascript file).
677
678<URL-SPEC> get-recursive = yes | no
679        Choose whether to get URLs that match this when doing a recursive
680        fetch (default=yes).
681
682<URL-SPEC> location-error = yes | no
683        When a URL reply contains a 'Location' header that redirects to a URL
684        that is not got (specified in this section) then the reply is modified
685        to be an error message instead (default=no).  This will stop ISP
686        proxies from redirecting users to adverts if the advert URLs are
687        in this section.
688
689--------------------------------------------------------------------------------
690
691DontCompress
692------------
693
694A list of MIME types and file extensions that are not to be compressed by
695WWWOFFLE (because they are already compressed or not worth compressing).
696Requires zlib compilation option.
697
698mime-type = (mime-type)/(subtype)
699        The MIME type of a URL that is not to be compressed in the cache (when
700        purging) or when providing pages to clients.
701
702file-ext = .(file-ext)
703        The file extension of a URL that is not to be requested compressed
704        from a server.
705
706--------------------------------------------------------------------------------
707
708CensorHeader
709------------
710
711A list of HTTP header lines that are to be removed from the requests sent to
712web servers and the replies that come back from them.
713
714[<URL-SPEC>] (header) = yes | no | (string)
715        A header field name (e.g. From, Cookie, Set-Cookie, User-Agent) and
716        the string to replace the header value with (default=no).  The header
717        is case sensitive, and does not have a ':' at the end.  The value of
718        "no" means that the header is unmodified, "yes" or no string can be
719        used to remove the header or a string can be used to replace the
720        header.  This only replaces headers it finds, it does not add any new
721        ones.  An option for Referer here will take precedence over the
722        referer-self and referer-self-dir options.
723
724[<URL-SPEC>] referer-self = yes | no
725        Sets the Referer header to the same as the URL being requested
726        (default=no).  This will add the Referer header if none is contained
727        in the original request.
728
729[<URL-SPEC>] referer-self-dir = yes | no
730        Sets the Referer header to the directory name of the URL being
731        requested (default=no).  This will add the Referer header if none is
732        contained in the original request.  This option takes precedence over
733        referer-self.
734
735[<URL-SPEC>] referer-from = yes | no
736        Removes the Referer header based on a match of the referring URL
737        (default=no).
738
739[<URL-SPEC>] force-user-agent = yes | no
740        Forces a User-Agent header to be inserted into all requests that are
741        made by WWWOFFLE (default=no).  This User-Agent is added only if there
742        is not an existing User-Agent header and is set to the value
743        WWWOFFLE/<version-number>.  This header is inserted before censoring
744        and may be changed by the normal header censoring method.
745
746[<URL-SPEC>] pass-url-unchanged = yes | no
747        Forces WWWOFFLE to ignore the requirements on the correct formatting of
748        URLs and to pass through to the server the URL that was passed to it by
749        the browser (default=no).
750
751--------------------------------------------------------------------------------
752
753FTPOptions
754----------
755
756Options to use when fetching files using the ftp protocol.
757
758anon-username = (string)
759        The username to use for anonymous ftp (default=anonymous).
760
761anon-password = (string)
762        The password to use for anonymous ftp (default determined at run
763        time).  If using a firewall then this may contain a value that is not
764        valid to the FTP server and may need to be set to a different value.
765
766<URL-SPEC> auth-username = (string)
767        The username to use on a host instead of the default anonymous
768        username.
769
770<URL-SPEC> auth-password = (string)
771        The password to use on a host instead of the default anonymous
772        password.
773
774--------------------------------------------------------------------------------
775
776MIMETypes
777---------
778
779MIME Types to use when serving files that were not fetched using HTTP or for
780files on the built-in web-server.
781
782default = (mime-type)/(subtype)
783        The default MIME type (default=text/plain).
784
785.(file-ext) = (mime-type)/(subtype)
786        The MIME type to associate with a file extension.  The '.' must be
787        included in the file extension.  If more than one extension matches
788        then the longest one is used.
789
790--------------------------------------------------------------------------------
791
792Proxy
793-----
794
795This contains the names of the HTTP (or other) proxies to use external to the
796WWWOFFLE server machine.
797
798[<URL-SPEC>] proxy = (host[:port])
799        The hostname and port on it to use as the proxy.
800
801<URL-SPEC> auth-username = (string)
802        The username to use on a proxy host to authenticate WWWOFFLE to it.
803        The URL-SPEC in this case refers to the proxy and not the URL being
804        retrieved.
805
806<URL-SPEC> auth-password = (string)
807        The password to use on a proxy host to authenticate WWWOFFLE to it.
808        The URL-SPEC in this case refers to the proxy and not the URL being
809        retrieved.
810
811[<URL-SPEC>] ssl = (host[:port])
812        A proxy server that should be used for https or Secure Socket Layer
813        (SSL) connections.  Note that for the <URL-SPEC> that only the host is
814        checked and that the other parts must be '*' WILDCARDs.
815
816--------------------------------------------------------------------------------
817
818Alias
819-----
820
821A list of aliases that are used to replace the server name and path with
822another server name and path.
823
824URL-SPECIFICATION = URL-SPECIFICATION
825        Any requests that match the first URL-SPECIFICATION are replaced by
826        the second URL-SPECIFICATION.  The first URL-SPECIFICATION is a
827        wildcard match for the protocol and host/port, the path must match the
828        start of the requested URL exactly and includes all subdirectories.
829
830--------------------------------------------------------------------------------
831
832Purge
833-----
834
835The method to determine which pages to purge, the default age the host
836specific maximum age of the pages in days, and the maximum cache size.
837
838use-mtime = yes | no
839        The method to use to decide which files to purge, last access time
840        (atime) or last modification time (mtime) (default=no).
841
842max-size = (size)
843        The maximum size for the cache in MB after purging (default=-1).  A
844        maximum cache size of -1 (or 0 for backwards compatibility) means
845        there is no limit to the size.  If this and the min-free options are
846        both used the smaller cache size is chosen.  This option take into
847        account the URLs that are never purged when measuring the cache size
848        but will not purge them.
849
850min-free = (size)
851        The minimum amount of free disk space in MB after purging
852        (default=-1).  A minimum disk free of -1 (or 0) means there is no
853        limit to the free space.  If this and the max-size options are both
854        used the smaller cache size is chosen.  This option take into account
855        the URLs that are never purged when measuring the cache size but will
856        not purge them.
857
858use-url = yes | no
859        If true then use the URL to decide on the purge age, otherwise use the
860        protocol and host only (default=no).
861
862del-dontget = yes | no
863        If true then delete the URLs that match the entries in the DontGet
864        section (default=no).
865
866del-dontcache = yes | no
867        If true then delete the URLs that match the entries in the DontCache
868        section (default=no).
869
870[<URL-SPEC>] age = (age)
871        The maximum age in the cache for URLs that match this (default=14).
872        An age of zero means always to delete, negative means not to delete.
873        The URL-SPECIFICATION matches only the protocol and host unless
874        use-url is set to true. Longer times can be specified with a 'w', 'm'
875        or 'y' suffix for weeks, months or years (e.g. 2w=14).
876
877[<URL-SPEC>] compress-age = (age)
878        The maximum age in the cache for URLs that match this to be stored
879        uncompressed (default=-1).  Requires zlib compilation option.  An age
880        of zero means always to compress, negative means never to compress.
881        The URL-SPECIFICATION matches only the protocol and host unless
882        use-url is set to true. Longer times can be specified with a 'w', 'm'
883        or 'y' suffix for weeks, months or years (e.g. 2w=14).
884
885--------------------------------------------------------------------------------
886
887WILDCARD
888--------
889
890A WILDCARD match is one that uses the '*' character to represent any group of
891characters.
892
893This is basically the same as the command line file matching expressions in
894DOS or the UNIX shell, except that the '*' can match the '/' character.
895
896For example
897
898*.gif       matches  foo.gif and bar.gif
899
900*.foo.com   matches  www.foo.com and ftp.foo.com
901
902/foo/*      matches  /foo/bar.html and /foo/bar/foobar.html
903
904--------------------------------------------------------------------------------
905
906URL-SPECIFICATION
907-----------------
908
909When specifying a host and protocol and pathname in many of the sections a
910URL-SPECIFICATION can be used, this is a way of recognising a URL.
911
912For the purposes of this explanation a URL is considered to be made up of five
913parts.
914
915proto          The protocol that is used (e.g. 'http', 'ftp')
916
917host           The server hostname (e.g. 'www.gedanken.org.uk').
918
919port           The port number on the host (e.g. default of 80 for HTTP).
920
921path           The pathname on the host (e.g. '/bar.html') or a directory name
922               (e.g. '/foo/').
923
924args           Optional arguments with the URL used for CGI scripts etc.
925               (e.g. 'search=foo').
926
927For example the WWWOFFLE homepage: http://www.gedanken.org.uk/software/wwwoffle/
928The protocol is 'http', the host is 'www.gedanken.org.uk', the port is
929the default (in this case 80), and the pathname is '/software/wwwoffle/'.
930
931In general this is written as (proto)://(host)[:(port)]/[(path)][?(args)]
932
933Where [] indicates an optional feature, and () indicate a user supplied name
934or number.
935
936Some example URL-SPECIFICATION options are the following:
937
938*://*/*             Any protocol, Any host, Any port, Any path, Any args
939                    (This is the default for options that can have a <URL-SPEC>
940                    prefix when none is specified).
941
942*://*/(path)        Any protocol, Any host, Any port, Named path, Any args
943
944*://*/*?            Any protocol, Any host, Any port, Any path, No args
945
946*://*/(path)?*      Any protocol, Any host, Any port, Named path, Any args
947
948*://(host)          Any protocol, Named host, Any port, Any path, Any args
949
950(proto)://*/*       Named proto, Any host, Any port, Any path, Any args
951
952(proto)://(host)/*  Named proto, Named host, Any port, Any path, Any args
953
954(proto)://(host):/* Named proto, Named host, Default port, Any path, Any args
955
956*://(host):(port)/* Any protocol, Named host, Named port, Any path, Any args
957
958The matching of the host, the path and the args use the WILDCARD matching that
959is described above.  The matching of the path has the special condition that a
960WILDCARD of '/*/foo' will match '/foo' and '/any/path/foo', in other words it
961matches any path prefix.
962
963In some sections that accept URL-SPECIFICATIONs they can be negated by
964inserting the '!' character before it.  This will mean that the comparison
965of a URL with the URL-SPECIFICATION will return the logically opposite value
966to what would be returned without the '!'.  If all of the URL-SPECIFICATIONs
967in a section are negated and '*://*/*' is added to the end then the sense of
968the whole section is negated.
969
970In all sections that accept URL-SPECIFICATIONs the comparison can be made case
971insensitive for the path and arguments part by inserting the '~' character
972before it.  (The host and the protocol comparisons are always case
973insensitive).
974

README.PWD

1          WWWOFFLE - World Wide Web Offline Explorer - Version 2.6
2          ========================================================
3
4
5This is the logic that the WWWOFFLE program follows when handling requests for
6URLs that have a password in the header or in the URL itself.
7
8
9Background Information
10----------------------
11
121) When a browser first requests a page that is password protected a normal
13   request is sent without a password in it.  This is obvious since there is no
14   way to decide in advance which pages have passwords.
15
162) When a server receives a request for page that requires authentication, but
17   for which there is none in the request, it sends back a '401 Unauthorized'
18   response.  This contains a "realm" which defines the range of pages over
19   which this username/password pair is valid.  A realm is not a well defined
20   range, it can be any set of pages on the same server, there is no requirement
21   for them to be related, although they normally are.
22
233) When a browser receives a '401' reply it will prompt the user for a username
24   and password if it does not already have one for the specified realm.  If one
25   is already known then there is no need to prompt the user again.
26
274) The request that the browser sends back this time includes in the header the
28   username and password pair, but otherwise the same request as in (1).
29
305) The server now sends back the requested page.
31
326) Some browsers follow steps (1)-(5) for all pages on the server.  Others try
33   to guess the range of pages that are covered by a realm, they then send the
34   username/password pair for all pages in the same directory for example.  This
35   means that they follow steps (3)-(5) and miss out steps (1) and (2) for these
36   pages.
37
38
39WWWOFFLE Implementation
40-----------------------
41
421) If a password is specified in the request then it is handled as if it were in
43   the URL itself.  This means that the spool file name is hashed in the same
44   way as normal, but it contains the username/password.
45
462) A page is always placed in the cache without a username/password for every
47   page that has a username/password.  This ensures that when the page is later
48   requested while offline the version without the password can be sent to
49   prompt the browser.  This is to solve the problem of browsers sending
50   username/password pairs for all pages, when the browser is closed and
51   restarted, a request for one of the pages (bookmarked perhaps) will not work
52   since the page without the username/password is not present so will be
53   requested for later fetching.
54
553) The mode of operation of the WWWOFFLE server is as follows:
56
57URL   = URL without password
58URLpw = URL with password
59
60WWWOFFLES mode - See README
61
62
63WWWOFFLES | Password  |   URL   |  URLpw  | Action to take
64   mode   | provided? | cached? | cached? |
65----------+-----------+---------+---------+-------------------------------------
66  Spool   |    No     |   No    |   n/a   | Request URL (->F)
67  Spool   |    No     |   Yes   |   n/a   | Spool URL
68  Spool   |    Yes    |   No    |   No    | Request URLpw (->F)
69  Spool   |    Yes    |   No    |   Yes   | Spool URLpw, Request URL (->F)
70  Spool   |    Yes    |   Yes   |   No    | if(!401) Spool URL
71  Spool   |    Yes    |   Yes   |   No    | if(401)  Request URLpw (->F)
72  Spool   |    Yes    |   Yes   |   Yes   | if(!401) Spool URL
73  Spool   |    Yes    |   Yes   |   Yes   | if(401)  Spool URLpw
74----------+-----------+---------+---------+-------------------------------------
75  Fetch   |    No     |   n/a   |   n/a   | Get URL
76  Fetch   |    Yes    |   No    |   n/a   | Get URL, if(401) GET URLpw
77  Fetch   |    Yes    |   Yes   |   n/a   | if(!401) Get URL
78  Fetch   |    Yes    |   Yes   |   n/a   | if(401)  Get URLpw
79----------+-----------+---------+---------+-------------------------------------
80  Real    |    No     |   n/a   |   n/a   | Get URL
81  Real    |    Yes    |   No    |   n/a   | Get URL, if(401) Get URLpw
82  Real    |    Yes    |   Yes   |   n/a   | if(!401) Get URL
83  Real    |    Yes    |   Yes   |   n/a   | if(401)  Get URLpw
84----------+-----------+---------+---------+-------------------------------------
85
86The other minor modes (SpoolOrReal, RealPragma etc.) act like the one that they
87are based on.
88
894) When fetching recursively, a supplied username/password is used only on the
90   same server, but for all requests (fetch mode sorts out which need it).
91
925) When a username is supplied but no password (e.g. a FTP URL with the username
93   in the URL) then always return a page prompting for a password
94
956) When the configuration option try-without-password is false (it defaults to
96   true) this behaviour is modified.  If a URL is requested with a password then
97   the existence or not of the same URL without a password is ignored.  This
98   means that the behaviour is the same as a request for a page that does not
99   have a password, it is only based on the requested page itself.
100
101
102Andrew M. Bishop
10317th September 2000
104

README.URL

1            WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
2            ========================================================
3
4
5This is the logic that WWWOFFLE applies when it is parsing URLs.  This is
6complicated by a number of rules that appear in various standards documents and
7the many different places in the program where URLs are processed.  Also
8described is the handling of WWWOFFLE command URLs, using arguments or form
9data.
10
11
12A lot of extra effort has been taken in version 2.6 of WWWOFFLE to ensure that
13URL handling is much cleaner and less error prone.  Places where changes have
14been made compared to previous versions of the program are noted.
15
16
17Relevant Standards
18------------------
19
20The RFCs and other relevant documents to this README are:
21
22RFC 1738 - Uniform Resource Locators (URL)
23           Section 2.2 specifies how URL-encoding is performed and why.
24
25RFC 1808 - Relative Uniform Resource Locators
26           This describes how relative URLs are to be handled.
27
28RFC 2396 - Uniform Resource Identifiers (URI): Generic Syntax
29           This describes URLs in more detail and updates RFC 1808 where the
30           "parameters" part of a URL path is concerned.
31
32HTML 4   - The HTML 4.0 specification from the World Wide Web Consortium.
33           Section 17.13.3 specifies how URLs for HTML form data are created.
34
35
36URL Format in WWWOFFLE
37----------------------
38
39In WWWOFFLE all URLs are held in a type named URL which is a typedef for a
40structure that contains the information, the type is defined in misc.h.  All
41manipulation of URL information is performed using this URL type, the conversion
42from string to URL is almost the first thing that is done on incoming requests.
43
44The general structure for a URL in WWWOFFLE is the following:
45
46<protocol>://[<username>:<password>@]<hostname>[:<port>]/<pathname>[?<arguments>]
47
48
49Because of the issues about methods by which URLs may be encoded it is possible
50for more than one string to refer to the same URL object.
51
52The most common example is URL-encoding where characters can be replaced by by
53their hexadecimal form following a '%' character.  For example ':' is equivalent
54to '%3a'.  The process of URL-decoding is un-ambiguous, any URL-encoded string
55can be decoded to give a usable representation.  The process of URL-encoding is
56ambiguous because different sets of characters need to be URL-encoded for
57different parts of the URL.  In addition data that results from a POSTed form or
58the arguments to a URL uses a modified version of URL-encoding where the space
59character is replaced by a plus sign.
60
61The nature of URL encoding and decoding means that it must be performed at the
62correct times on the correct data.  If URL encoding is performed twice on the
63same data then errors occur since the '%' characters inserted by the first
64encoding will themselves be encoded the second time.  Similar arguments apply to
65decoding multiple times.
66
67
68
69String to URL Conversion
70------------------------
71
72A string that consists of an unparsed URL is converted to the URL type by
73calling the SplitURL() function.  This will parse the string into a URL datatype
74and return it.  One part of the URL datatype is a canonical form of the URL that
75is used often in the subsequent processing.
76
77The rules that apply to this process are the following (all parsing is done with
78heuristics to handle malformed URLs):
79
80protocol
81- - - -
82
831) If there is no protocol part then an protocol=http.
84
852) In other cases a protocol is extracted from the string.
86
87   a) The protocol is not case sensitive and to avoid confusion it is converted
88      to lower case.
89
90
91username & password
92- - - - - - - - - -
93
941) If there is no username and password part then username=NULL, password=NULL.
95
962) In other cases a username and/or password is extracted from the string.
97
98   a) In RFC1738 Section 3.1 it is specified that the characters '@' and ':' and
99      '/' in the username or password must be URL-encoded.
100
101   b) The strings for the username and password are converted using
102      URLDecodeGeneric().
103
104   c) The strings for the username and password are converted using
105      URLEncodePassword() before being put back into the canonical URL string.
106
107
108hostname
109- - - -
110
1111) If the first character is '/' then it is a local URL hostname=LocalHost
112   (entry in wwwoffle.conf).
113
1142) In other cases a hostname is extracted from the string.
115
116   a) The hostname is not case sensitive and to avoid confusion it is converted
117      to lower case.
118
119
120port
121- -
122
123It should be noted that the port number is not considered as a separate entity,
124but is part of the hostname in the decoded URL.
125
1261) If no port is specified then nothing is done.
127
1282) If the default port for the protocol is specified (e.g. 80 of http) then the
129   port is removed.
130
1313) In other cases the port number in the string is kept.
132
133
134pathname
135- - - -
136
1371) If no path part is given then pathname='/'.
138
1392) In other cases a pathname is extracted from the string.
140
141   a) The pathname may have been URL-encoded in different ways by different user
142      agents.  A canonical format is required in WWWOFFLE since it is used to
143      form the cache filename.
144
145   b) The pathname is converted using URLDecodeGeneric().
146
147   c) The pathname is converted using URLEncodePath().
148
149
150parameters
151- - - - -
152
153| The handling of the "parameters" part of a URL (as described in RFC 1808 and
154| RFC 2369) is changed in version 2.9, they are now considered part of the path.
155| Between version 2.6 and version 2.8 the "parameters" part of the path were
156| handled as separate from the path itself and only one "parameter" was allowed.
157| In RFC 2396 it is made clear that although the parameters are separate from
158| the path they are handled in exactly the same way as the path component that
159| they are attached to.
160
161
162arguments
163- - - - -
164
165The name arguments is what I use, the same thing is called 'query' in the RFCs.
166
1671) If no arguments are given then arguments=NULL.
168
1692) In other cases the arguments are extracted from the string.
170
171   a) The arguments may have been URL encoded in different ways by different
172      user agents.  A canonical format is required in WWWOFFLE since it is used
173      to form the cache filename.
174
175   b) The arguments are converted to canonical form using URLRecodeFormArgs().
176
177   c) The arguments may have used '&amp;' in place of '&' since the former is
178      valid HTML for an href and the latter is not.  A replacement is performed
179      to replace '&amp;' with '&'.
180
181
182| This is a change since version 2.5, previously no decoding/encoding of the
183| arguments was performed.  This lead to the problem where the same URL could
184| be refered to by different names due to URL encoding differences.
185
186| This is a change since version 2.8, previously no replacement of '&amp;'
187| with '&' was performed.
188
189
190
191URL to String Conversion
192------------------------
193
194In most places in the program explicit URL to string conversion is not required
195since the String to URL conversion will have created a canonical string version
196of the current URL.  In places that a new URL needs to be created from nothing
197care is taken to ensure that it is valid, either by inspection or by performing
198string to URL conversion and using the string contained in the result.
199
200
201
202Using URL Arguments or POSTed Form Data
203---------------------------------------
204
205Many of the indexes and other sub-pages that are generated by WWWOFFLE contain
206information that is encoded in the arguments to the URL or returned using the
207POST method in a form.
208
209The format of the argument to a URL or from data is as follows (where '&' and
210';' are interchangeable):
211
212<key1>=<data1>&<key2>=<data2>&<key3>=<data3>&...
213
214Each of the keys and the data may be URL encoded (since the arguments as a whole
215are URL encoded except for the characters '&', ';' and '=').  This means that
216they must be URL decoded before they can be used.
217
218A function called SplitFormArgs() is used to split up the string into a list of
219the separate <key>=<data> strings.  No URL-encoding/decoding is performed in
220this function since it is assumed that arguments will already have been recoded
221(see above) and that form data can be recoded using URLRecodeFormArgs() before
222it is used.
223
224When a URL is passed to a WWWOFFLE function in the arguments of a URL or in a
225form then it will have been URL-encoded.  It must therefore be decoded before it
226is used.  Care needs to be taken to ensure that this is performed correctly or
227URLs will be corrupted.
228
229
230
231----------------
232Andrew M. Bishop
23313 March 2001
234

README.compress

1            WWWOFFLE - World Wide Web Offline Explorer - Version 2.8b
2            =========================================================
3
4
5One feature that has often been requested in WWWOFFLE is compression.  This is
6either for compression from servers on the internet or for compression of files
7in the cache.  Since adding compression of any sort is a big step I have
8implemented it for both these cases and also from WWWOFFLE to the client.
9
10The compression options are selectable at compile time and the individual
11options are chosen using the WWWOFFLE configuration file.  This means that if
12you are not interested in compression, or the compression library is not
13available then the rest of the WWWOFFLE functions are still available.  If the
14program is compiled with the compression library then you are not forced to use
15it.
16
17
18zlib
19----
20
21The simplest way of adding the compression functionality to WWWOFFLE is by
22compiling with zlib.  This will provide support for the deflate and gzip
23compression methods.
24
25The zlib README file describes the programs as:
26
27    zlib 1.1.3 is a general purpose data compression library.  All the code
28    is thread safe.  The data format used by the zlib library
29    is described by RFCs (Request for Comments) 1950 to 1952 in the files
30    ftp://ds.internic.net/rfc/rfc1950.txt (zlib format), rfc1951.txt (deflate
31    format) and rfc1952.txt (gzip format).
32
33The zlib library is not GPL software (like WWWOFFLE is), but the copyright file
34for it says:
35
36    Copyright (C) 1995-1998 Jean-loup Gailly and Mark Adler
37
38    This software is provided 'as-is', without any express or implied
39    warranty.  In no event will the authors be held liable for any damages
40    arising from the use of this software.
41
42
43The zlib library adds different types of functions for the different compression
44methods.
45
46For deflate/inflate there are functions that will take a block of memory and
47compress (deflate) or uncompress (inflate) the contents.  The block of memory is
48considered as part of a large compressed block and therefore the output of the
49compression function depends on the previous inputs.  There are also
50miscellaneous functions that are used to initialise the compression functions,
51to flush out the data at the end and to finish with the compression functions.
52
53For gzip/gunzip there are functions that can open a compressed file and read
54from it (uncompression) or write to it (compression).  All of the gzip/gunzip
55compression/uncompression functions are based on file compression/uncompression
56and not in-memory compression/uncompression.
57
58
59The non-availability of in-memory block-by-block gzip/gunzip functions is a
60problem since WWWOFFLE needs to be able to compress data as it is flowing from
61the server to the client.  A temporary file could be used, but this is not a
62good solution in general.  To work around this problem the gzip/gunzip source
63code in the zlib library was examined and their operation (using the deflate and
64inflate algorithms with some extra fiddling at start and end) was implemented as
65in-memory block-by-block functions.
66
67
68Compressed Files From Server To WWWOFFLE
69----------------------------------------
70
71HTTP/1.1 and Content Negotiation
72- - - - - - - - - - - - - - - -
73
74For the compression functions to be worth having on the link from the server to
75the client there must be servers that support the function.
76
77Fortunately the HTTP/1.1 standard defines a mechanism by which clients can
78indicate to servers that they will accept compressed data.  The servers reply
79with compressed data and indicate the method by which the data has been
80compressed.  This is a specific instance of the content negotiation functions of
81HTTP/1.1.
82
83Unfortunately the definition of HTTP/1.1 and content encoding leads to ambiguous
84results.
85
86
87Theory
88- - -
89
90The way that it works is the following (in theory) for gzip compression.
91
921) The client makes request with a header of 'Accept-Encoding: gzip'
93   this means that it can handle a gzipped version of the URL data.
94
952) The server looks at the request and supplies a 'Content-Encoding: gzip'
96   reply and a compressed version of the data for the requested URL.
97
983) The client receives the data, see the 'Content-Encoding: gzip'
99   header and decompresses the data before using it.
100
101An important remark to make at this point is that the HTTP standard defines the
102'Content-Encoding' header to apply to the complete link from server to client
103through any proxies.  It is not intended to apply separately for the server to
104proxy and proxy to client links.  There is a 'Transfer-Encoding' header for
105this, but it not generally used.
106
107
108Problem 1
109- - - - -
110
111The use of compression is fairly rare and there are problems with clients, even
112without the use of WWWOFFLE.
113
114For example Netscape version 4.76 will ask for gzip compressed data and will
115display the HTML fine.  The problem is that if the images in the page are also
116sent compressed then they are displayed as the 'broken image' icon.  If you view
117any single image from the page then it is OK.  This indicates to me that the
118browser knows how to handle gzipped data for the HTML page and for the images,
119but not for images inside a page!
120
121Mozilla version M18 works fine with the same page and same images, so it must be
122a client problem.
123
124
125Problem 2
126- - - - -
127
128When a request  is sent for a URL  that is naturally compressed then  even if no
129'Accept-Encoding' header is  sent the data comes back  with a 'Content-Encoding'
130header.  So  for example  if a user  requests the  URL http://www.foo/bar.tar.gz
131then the data comes back gzipped with a 'Content-Encoding: gzip' header.  If the
132user saves the file from the browser then  he expects that it is saved to a file
133called bar.tar.gz and that it contains a compressed tar file.
134
135The problem here is that WWWOFFLE adds in an 'Accept-Encoding' header on all
136requests and decompresses the ones that come back with a 'Content-Encoding'
137header and removes the header.  This just breaks what I have described above,
138the file bar.tar.gz that the browser writes out is actually a tar file and not a
139compressed tar file.  WWWOFFLE has no way to know that the data that was
140requested was compressed in its natural form or if the compression was added as
141part of the content negotiation.
142
143
144Problem 3
145- - - - -
146
147When WWWOFFLE is used and it performs the uncompression for the client then this
148can also cause problems.
149
150The Debian Linux package manager program 'apt-get' requests files called
151Packages.gz from the server.  If WWWOFFLE uncompresses these and sends them to
152the client uncompressed then apt-get fails because the file is not compressed
153like it expects.
154
155Solution
156- - - -
157
158The only solution that I can see is that WWWOFFLE does not decompress any files
159that it thinks might be naturally compressed, e.g. *.gz files.
160
161This means that the configuration file for WWWOFFLE needs to contain a list of
162files that it does not request compressed and does not try to decompress.
163
164
165Problem 4
166- - - - -
167
168Due to the browser problems quoted above (Problem 1) there are servers that will
169only send compressed content to browsers that they know will accept it.  This
170relies on the User-Agent header that the browser sends in the request.
171
172The problem here is that when people hide the browser that they are using by
173changing the User-Agent (either in the browser or using the WWWOFFLE
174CensorHeader options) the compression may not be performed.
175
176One server that does this is www.google.com which only sends compressed data to
177clients that it thinks can handle it.
178
179Solution
180- - - -
181
182There are two solutions here, either the user has to choose a fake User-Agent
183that will work (but there is no list of those and different servers may use
184different ones) or the server needs to be modified.
185
186
187Compressed Cache
188----------------
189
190The problems described above with ambiguity in the meaning of the
191'Content-Encoding' header also cause problems with the compressed cache.
192
193
194Problem 1
195- - - - -
196
197If the file is stored in the cache with a 'Content-Encoding' header then
198WWWOFFLE would need to decide if it should decompress it before sending it to
199the browser.  This needs the same list of files not to compress that are
200mentioned in the solution above.
201
202
203Solution 1
204- - - - -
205
206Two solutions to this problem present themselves.
207
2081) Make the 'Content-Encoding' something that WWWOFFLE will recognise as being
209compressed by itself.  For example 'Content-Encoding: wwwoffle-deflate' could be
210used to indicate files that WWWOFFLE compressed in the cache and that need to be
211uncompressed when they are read out again.
212
2132) Add another header into the cached file and use a standard content encoding.
214There is a header called 'Pragma' that can be added to any HTTP header, its
215meaning can be defined by the user, unrecognised headers should be ignored.
216
217The first option is the simplest, but leaves the cache files in a non-standard
218format.  The second option means that the file itself is still a valid HTTP
219header followed by data.
220
221The second option is the one that is implemented.
222
223
224Problem 2
225- - - - -
226
227Another problem with the compressed cache format is that many files are already
228compressed.  For example images will nearly always be compressed (GIF, JPEG and
229PNG all include compression).  These files will not benefit from being compressed again
230
231
232Solution 2
233- - - - -
234
235As for the solution listed for the server transfer problem a list of files not
236to compress in the cache is needed.  In this case since the file exists in the
237cache already it is possible to add a list of MIME types not to be compressed,
238e.g. image/jpeg.
239
240
241Compressed Files From Server To WWWOFFLE
242----------------------------------------
243
244Now that the problems have been examined for the previous two cases the problem
245is solved.  The list of MIME types that are used for the cache compression are
246used for deciding if it is worth compressing the file to send to the browser.
247
248
249Problems with Compression Formats
250---------------------------------
251
252Problems with the format of data sent from servers to WWWOFFLE have caused a
253variety of problems.
254
255The format of data the is normally sent back from servers when deflate
256compression is requested is not what is described in RFC 2616 the HTTP/1.1
257specification.  The format of the data in this case should have a 2 byte header
258and 4 byte trailer (as described in RFC 1950) around the deflated data (as
259described in RFC 1951).  The common format that is used is that the extra header
260and trailer are not sent, just the deflated data.
261
262If this is the de-facto standard on the internet then it would not be a problem
263and WWWOFFLE could request deflated data and not have a problem reading it.
264Unfortunately it is not this simple, there is still the possibility of receiving
265the correct zlib formatted data.  There are also web servers that are even worse
266because they send back a 10 byte gzip header followed by the 2 byte zlib header
267and then the deflated data.
268
269The only solution to this is that WWWOFFLE waits for the first few bytes of data
270to be received and then makes a choice about the format based on what it sees.
271This is the approach that is now taken in version 2.8 of WWWOFFLE, the first 16
272bytes of data are accumulated and then a decision is made.
273
274
275Andrew M. Bishop
2769 December 2003
277

README.htdig

1          WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
2          ========================================================
3
4
5The progam ht://Dig is a free (GPL) internet indexing and search program.  The
6ht://Dig documentation describes itself as follows:
7
8        The ht://Dig system is a complete world wide web indexing and
9        searching system for a small domain or intranet. This system
10        is *not* meant to replace the need for powerful internet-wide
11        search systems like Lycos, Infoseek, Webcrawler and AltaVista.
12        Instead it is meant to cover the search needs for a single
13        company, campus, or even a particular sub section of a web site.
14
15        As opposed to some WAIS-based or web-server based search
16        engines, ht://Dig can span several web servers at a site.  The
17        type of these different web servers doesn't matter as long as
18        they understand the HTTP 1.0 protocol.
19
20        ht://Dig was developed at San Diego State University as a way
21        to search the various web servers on the campus network.
22
23
24I have written WWWOFFLE so that ht://Dig can be used with it to allow the
25entire cache of pages can be indexed.  There are three stages to using the
26program that are described in this document; installation, digging and
27searching.
28
29
30Getting ht://Dig
31----------------
32
33ht://Dig is available from the web site
34
35        http://www.htdig.org/
36
37You need to have version 3.1.0b4 or later of htdig.
38
39No special compile-time configuration of htdig is required to be able to use it
40with WWWOFFLE.
41
42
43I tested with version 3.1.6 using the official Debian package.
44
45
46Configure ht://Dig to run with WWWOFFLE
47---------------------------------------
48
49If you have already got ht://Dig installed on your system, for example as part
50of a Linux distribution, then you may need to make some changes to the
51configuration files.  The problem is that ht://Dig sets some of the parameters
52at compile time in the HTML templates that is uses.  This makes it impossible to
53use the same "common files" with more than one search path on the same system.
54
55Using WWWOFFLE to run ht://Dig will mean that the base URL for the htsearch form
56and all images is '/search/htdig/'.  If the configuration file has a different
57compiled in variable (often '/htdig/') then the images will not be found.  The
58changes that you need to make are the following:
59
60In htsearch.conf (in /var/spool/wwwoffle/search/htdig/conf) you need to add the
61following lines (I have already done it in the default config file):
62
63allow_in_form: image_url_prefix
64image_url_prefix: /search/htdig
65
66The HTML template files that htsearch uses are called footer.html, header.html,
67nomatch.html, syntax.html and wrapper.html.  They will be installed in different
68places depending where your version of ht://Dig came from (for Debian GNU/Linux
69they are in /etc/htdig) you need to replace image references like:
70
71<img src="/htdig/htdig.gif" ...>
72
73with
74
75<img src="$(IMAGE_URL_PREFIX)/htdig.gif" ...>
76
77Unfortunately making this change will mean that the template files will no
78longer work with any other ht://Dig database that uses them.  You could make a
79copy of the template files somewhere else and modify the config file to
80reference them.
81
82Obviously if you are going to make this change then you may as well just edit
83the files and not bother with the $(IMAGE_URL_PREFIX) variable but use
84'/search/htdig' instead.
85
86
87Configure WWWOFFLE to run with ht://Dig
88---------------------------------------
89
90The configuration files for the ht://Dig programs as used with WWWOFFLE will
91have been installed in /var/spool/wwwoffle/search/htdig/conf when WWWOFFLE was
92installed.  The scripts used to run the htdig programs will have been installed
93in /var/spool/wwwoffle/search/htdig/scripts when WWWOFFLE was installed.  In
94both these cases the directory /var/spool/wwwoffle can be changed at compile
95time with options to the configure script.
96
97These files should be correct if the information at the time of running
98configure was set correctly.  Check them, they should have the spool directory
99and the proxy hostname and port set correctly.
100
101Also they should be checked to ensure that the ht://Dig programs are on the path
102(you can edit the PATH variable here if they are not in /usr/local/bin).  The
103merging process can use a lot of disk space when the sort program is run, you
104can change the location of the temporary directory used for this with the TMPDIR
105variable.
106
107
108The Fuzzy Database
109------------------
110
111The ht://Dig programs use a database of fuzzy word endings and synonyms.  This
112needs to be created just once, there is a script provided with WWWOFFLE that
113does this.
114
115        /var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htfuzzy
116
117If you have an existing ht://Dig installation then this step will probably have
118already been performed and is not required again.
119
120Note: When you do this it will take a *long* time since it produces two
121      databases that htsearch uses to help in matching words.
122
123
124Digging and Merging
125-------------------
126
127Digging is the name that is given to the process of searching through the
128web-pages to make the list of words.  Merging is the process of converting the
129raw list of words into a database that can be searched.
130
131The ht://Dig installation will include a script called 'rundig' that
132demonstrates how digging and merging is supposed to work.  To work with WWWOFFLE
133I have produced my own scripts that should be used instead.
134
135        /var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htdig-full
136        /var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htdig-incr
137        /var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htdig-lasttime
138
139The first of these scripts will do a full search and index all of the URLs in
140the cache.  The second one will do an incremental search and will only index
141those that have changed since the last full search was done.  The third will add
142in the files in the lasttime index into the database.
143
144Unfortunately due to the way that the htmerge program works, it will take almost
145as long to do an incremental search or a lasttime search as to do a full search.
146The only differnce is that for the incremental search and lasttime search the
147WWWOFFLE cache is only accessed for the files that have changed.
148
149If you cannot get htdig version 3.1.6 or 3.2.0 to index any pages then try
150removing the line in the file /var/spool/wwwoffle/html/en/robots.txt that says
151'Disallow: /index' since it triggers a bug in htdig that stops it searching
152properly.
153
154
155Searching
156---------
157
158The search page for ht://Dig is located at http://localhost:8080/search/htdig/
159and is linked to from the "Welcome Page".  The word or words that you want to
160search for should be entered here.
161
162This form actually calls the script
163
164        /var/spool/wwwoffle/search/htdig/scripts/wwwoffle-htsearch
165
166to do the searching so it is possible to edit this to modify it if required.
167
168
169Thanks to
170---------
171
172I would like to thank the htdig maintainer (Geoffrey.R.Hutchison@williams.edu)
173for the help that he has provided to get me started with htdig and the patches
174and comments that he has accepted from me into the htdig program.
175
176
177Andrew M. Bishop
1786th Aug 2001
179

README.https

1            WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
2            ========================================================
3
4
5This README describes how HTTPS connections work, the relationship of HTTPS to
6SSL/TLS and how certificates work.  More detail can be found in lots of places
7on the internet.
8
9The HTTPS protocol is a secure version of the HTTP protocol that is used for
10transmitting web pages from the server to the browser.  To allow the server to
11distinguish between them a different port number is used.  By default HTTP will
12use port 80 and HTTPS will use port 443.  WWWOFFLE will use port 8080 by default
13for HTTP connections and when compiled with the GNUTLS library it will use port
148443 by default for HTTPS connections.
15
16An HTTPS connection is just an HTTP connection but using SSL (Secure Socket
17Layer) or TLS (Transport Layer Security) to encrypt and authenticate the data
18between the server and the client.  To be able to make a secure connection
19there are various things that must be performed and options that can be used on
20top of this for extra security.
21
22
23Essential Security
24------------------
25
26The essential feature of a secure connection is that the server and client can
27exchange data without anybody else being able to find out what the data is.
28This means that even though the data that it transmitted goes through an
29insecure network it has been encrypted so that nobody else can decode it.  The
30link must also be protected against an attacker being able to modify the data,
31either be replacing the real data, adding more data or deleting data.
32
33To establish a secure link the steps that are taken by the server and the client
34are the following:
35
36    Agree on an encryption algorithm that both understand.
37    Exchange a secret key to use with the encryption.
38
39The exchange of the encryption key must be performed so that anybody that has
40visibility of the data being transmitted cannot determine the key.  There are
41several methods for doing this, Diffie-Hellman key exchange is one, RSA
42public/private key encryption is another.
43
44After these steps the data on the link can be encrypted with the chosen
45algorithm and the chosen secret key.
46
47
48Server Authentication
49---------------------
50
51An optional feature that is generally used for HTTPS is to have the server
52authenticate itself to the client so that the client can be sure that the server
53is the one it claims to be.  This uses certificates and public/private key
54cryptography to allow certificates to be signed by trusted certificate
55authorities.  If the browser trusts the certificate authority then it can also
56trust any certificate that has been signed by that certificate authority.
57
58
59Optional Client Authentication
60------------------------------
61
62As a further check on the link the server can require that the client
63authenticates itself.  This is supported in most browsers although it is rarely
64used.  Generally only bank or government HTTPS servers require this type of
65authentication.  A certificate must be loaded into the browser to identify the
66user when connected to the server.  The authentication is performed in the same
67way that the server authenticates itself as described above.
68
69
70
71Certificates
72------------
73
74With SSL and TLS the certificates that are used are defined as X509 objects.
75This X509 object contains information about who the certificate represents (the
76name, and location of the entity), the validity of the certificate (start date
77and end date), a version number and serial number and what the certificate is
78for.
79
80For each certificate there is also a public and private key that is created at
81the same time.  The public key is included in the certificate and the
82certificate contains information that can only be created by somebody with
83knowledge of the private key.  Using the public key it is possible to verify
84that the certificate really does belong to the person that knows the private
85key.
86
87The certificate will also be signed by a certificate authority to make it
88useful.  When a certificate has been signed by a certificate authority the
89contents cannot be changed.  If the certificate was changed then this would
90invalidate the certificate authority's signature.  The certificate authority's
91signature can be checked by using the certificate authority's public key.  All
92that an unsigned certificate proves is that the person that created it knows the
93private key.  It does not prove that the person that created it is who they say
94they are.
95
96
97Creating Certificates
98---------------------
99
100A certificate authority will create a master certificate whose usage is
101certificate signing and will then sign that certificate themselves.  This self
102signed certificate is important since it is the root of the chain of trust that
103ends in the browser.  The certificate authority must make sure that this
104certificate is made publicly available, and in particular that it is included
105with all browsers.
106
107The operator of a server who wants to be trusted by its users will create their
108own certificate with the information about themselves and set the usage of the
109certificate to be encryption.  This certificate will be submitted to a
110certificate authority who will sign it with their private key that belongs to
111the master certificate.  This signed certificate is then returned to the server
112operator for use.
113
114
115Trust
116-----
117
118The server can use this signed certificate during HTTPS connections.  The browser
119will be able to use the certificate authority's public key contained in the
120certificate authority's certificate to verify the identity of the server.  This
121process relies on the browser trusting the certificate authority and the
122certificate authority trusting the server operator.
123
124To see the list of certificate authorities that are included in Firefox (version
1251.0x was current at the time of writing this) select from the menu Edit ->
126Preferences (or Tools -> Options on MS Windows).  In the dialog box select the
127"Advanced" tab and go down to "Certificates" and select "Manage Certificates".
128In the certificate manager dialog box select "Authorities".  There will be a
129long list of certificates, many of these are certificate authorities for signing
130other server certificates.
131
132When you make a connection to a server that has a certificate signed by any of
133these certificate authorities then your browser will not give you a warning.  If
134the server that you visit has a certificate that is signed by a different
135certificate authority then the browser will warn you.  This warning only means
136that you cannot be sure who is operating the server that you have connected to.
137The data that is transmitted between you and the server will be encrypted even
138if the certificate is not trusted.
139
140If there are any certificate authorities that are listed in your browser that
141you do not trust then there is the risk of sending data to the wrong server.
142When a server sends you a certificate during an HTTPS connection it will be
143accepted by the browser if it validates with any of the loaded certificate
144authorities.  If you do not trust this certificate authority then the server may
145not be who it says it is.
146
147If any website ever asks you to load in a trusted certificate then be very
148careful that it is not a certificate authority certificate.  (You can identify a
149certificate authority certificate because it will have a usage for certificate
150signing.)  If it is then you must trust the person giving it to you because this
151certificate could be used to sign any other certificate and circumvent the normal
152trust mechanism.
153
154
155WWWOFFLE
156--------
157
158If you have enabled GNUTLS in WWWOFFLE then you have enabled some HTTPS
159capability.
160
161The WWWOFFLE server will create its own certificate authority certificate and a
162certificate for each server name that the server can be accessed by.  These can
163be identified by the name information in the certificate that will contain the
164word "WWWOFFLE" very clearly.
165
166You can either load the certificate authority certificate from WWWOFFLE into
167your browser (see the warning above about loading in new certificate authority
168certificates) or you can accept the server certificate when your browser warns
169you about it.
170
171It is possible to use URLs like https://localhost:8443/ instead of
172http://localhost:8080/ for access to the WWWOFFLE internal pages.
173
174If you decide to allow it then you can cache HTTPS connections by abusing the
175trust relationship between the browser and the certificate authority.  When
176accessing HTTPS servers through WWWOFFLE and caching the results WWWOFFLE will
177re-encrypt the data from the server and send it to the browser using a fake
178certificate.  If you configure your browser to trust WWWOFFLE as a certificate
179authority then the fake certificate from WWWOFFLE will be trusted just like it
180was the real one from the original server.
181
182
183
184Andrew M. Bishop
1857th Jan 2006
186

README.hyperestraier

1            WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
2            ========================================================
3
4
5The program Hyper Estraier is a free (LGPL) text indexing and search program.
6
7        Hyper Estraier is a full-text search system.  You can search lots of
8        documents for some documents including specified words.  If you run a
9        web site, it is useful as your own search engine for pages in your site.
10        Also, it is useful as search utilities of mail boxes and file
11        servers.
12
13Hyper Estraier can be used to search files of many different sorts, including
14HTML files.  There are lots of ways that the files can get into Hyper Estraier,
15but the one used with WWWOFFLE is to read the cache directly.
16
17
18Getting Hyper Estraier
19----------------------
20
21Hyper Estraier is available from the web site
22
23      http://hyperestraier.sourceforge.net/
24
25You need to have version 0.5.7 or later to have support for indexing the
26WWWOFFLE cache.  (Back to version 0.5.3 it worked, but did not index all cached
27pages).
28
29No special compile-time configuration of Hyper Estraier is required to be able
30to use it with WWWOFFLE.
31
32
33I tested with version 0.9.0 using the official Debian package.
34
35
36Configure WWWOFFLE to run with Hyper Estraier
37---------------------------------------------
38
39The configuration files for the Hyper Estraier programs as used with WWWOFFLE
40will have been installed in /var/spool/wwwoffle/search/hyperestraier/conf when
41WWWOFFLE was installed.  The scripts used to run the Hyper Estraier programs
42will have been installed in /var/spool/wwwoffle/search/hyperestraier/scripts
43when WWWOFFLE was installed.  In both these cases the directory
44/var/spool/wwwoffle can be changed at compile time with options to the configure
45script.
46
47These files should be correct if the information at the time of running
48configure was set correctly.  Check them, they should have the spool directory
49set correctly.
50
51Also they should be checked to ensure that the Hyper Estraier and wwwoffle-ls
52programs are on the path (you can edit the PATH variable here if they are not in
53/usr/local/bin).
54
55
56Indexing
57--------
58
59Indexing is the name that is given to the process of searching through the
60web-pages to make the search database.
61
62To work with WWWOFFLE I have produced my own scripts that should be used
63to call the Hyper Estraier indexer progam (estcmd).
64
65   /var/spool/wwwoffle/search/hyperestraier/scripts/wwwoffle-estcmd-full
66
67This script will do a full search and index all of the HTTP URLs in the cache.
68
69
70Searching
71---------
72
73The search page for using Hyper Estraier with WWWOFFLE is
74http://localhost:8080/search/hyperestraier/ and is linked to from the "Welcome
75Page".  The word or words that you want to search for should be entered here.
76
77This form actually calls the script
78
79   /var/spool/wwwoffle/search/hyperestraier/scripts/wwwoffle-estseek
80
81to do the searching so it is possible to edit this to modify it if required.
82
83
84Thanks to
85---------
86
87Thanks to Mikio Hirabayashi <mikio@users.sourceforge.net> for writing the Hyper
88Estraier and making it work with WWWOFFLE.
89
90
91
92Andrew M. Bishop
935th Sep 2005
94

README.lang

1                      WWWOFFLE TRANSLATIONS - Version 2.8
2                      ===================================
3
4There are several non-English language options for WWWOFFLE.  These consist of
5translated documentation and WWWOFFLE message files.
6
7All of the language files are installed when WWWOFFLE is installed.  This are
8(or can be) the HTML messages used by WWWOFFLE, HTML documentation and plain
9text documentation.
10
11The HTML files are selected at runtime based on the browser language settings.
12The default language can be selected by changing the 'default' symbolic link in
13the html directory.
14
15
16Notes for translators
17---------------------
18
19When making a translation of the WWWOFFLE messages and documentation priority
20should be given to the most important files.  The order that I would recommend
21that you translate the files is as follows:
22
23        html/messages/*
24
25        doc/README.CONF         (automatically generates html/README.CONF.html,
26                                 the configuration editing pages and more.)
27
28        html/Welcome.html
29        doc/FAQ                 (automatically generates html/FAQ.html)
30
31        doc/README
32        doc/INSTALL
33
34To automatically update the html/FAQ.html and html/README.CONF.html files from
35the plain text versions run 'make' in the doc directory.
36
37From version 2.8 I have changed all of the HTML files to use the HTML 4.01
38Transitional DTD (look at one of the English messages DOCTYPE header).  I have
39also been through all of the pages to ensure that they produce valid HTML.
40
41
42German = de
43-----------
44
45WWWOFFLE Version: 2.7e
46
47        Nicolai Lissner <nlissne@linux01.gwdg.de>
48
49Previous by: Jens Benecke <jens@jensbenecke.de>
50
51Original by: Nicolai Lissner <nlissne@linux01.gwdg.de>
52
53
54        doc/CHANGES.CONF        (most recent information only)
55        doc/INSTALL
56        doc/LSM
57        doc/NEWS                (most recent information only)
58        doc/README.1st
59        doc/README.html         (HTML version of README)
60        html/Welcome.html
61        html/search/htdig/search.html
62        html/messages/*
63        html/local/index.html
64
65
66French = fr
67-----------
68
69WWWOFFLE Version: 2.7-beta
70
71Jacques L'helgoualc'h <lhh@free.fr>
72
73Original by: Anthony Baire <popov@mail.dotcom.fr>
74             Roland Trique <roland.trique@easynet.fr>
75
76        doc/README.CONF
77        html/Welcome.html
78        html/search/htdig/search.html
79        html/search/udmsearch/search.html
80        html/search/namazu/search.html
81        html/messages/*
82
83
84Spanish = es
85------------
86
87WWWOFFLE Version: 2.6
88
89Gorka Olaizola <gorka@escomposlinux.org>
90
91        doc/CONVERT
92        doc/INSTALL
93        doc/LSM
94        doc/README
95        doc/README.1st
96        doc/README.CONF
97        doc/README.PWD
98        doc/README.URL
99        doc/README.htdig
100        doc/README.udmsearch
101        doc/README.win32
102        contrib/README
103        html/Welcome.html
104        html/FAQ.html
105        html/README.CONF.html   (auto-generated from doc/README.CONF)
106        html/search/*.html
107        html/search/htdig/search.html
108        html/search/udmsearch/search.html
109        html/robots.txt
110        html/wwwoffle.pac
111        html/messages/*
112
113
114Russian = ru
115------------
116
117WWWOFFLE Version: 2.6c
118
119Maxim Popov <popovm@mail.primorye.ru>
120
121        html/Welcome.html
122        html/FAQ.html
123        html/local/index.html
124        html/messages/*
125
126
127Polish = pl
128-----------
129
130WWWOFFLE Version: 2.7-beta
131
132Grzegorz Kowal <g_kowal@poczta.onet.pl>
133
134        doc/README.1st
135        doc/README.lang
136        doc/INSTALL
137        doc/LSM
138        html/Welcome.html
139        html/search/namazu/*.html
140        html/search/htdig/search.html
141        html/search/udmsearch/search.html
142        html/search/udmsearch/results.html
143        html/search/namazu/search.html
144        html/local/index.html
145        html/messages/*
146
147
148Italian = it
149------------
150
151WWWOFFLE Version: 2.6
152
153Antonio Fragola <mrshark@tiscalinet.it>
154
155        html/Welcome.html
156        html/FAQ.html
157        html/search/htdig/search.html
158        html/search/udmsearch/search.html
159        html/messages/*
160
161
162Dutch = nl
163----------
164
165WWWOFFLE Version: 2.8a
166
167Paul Slootman <paul@debian.org>
168
169        html/messages/*
170

README.mnogosearch

1          WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
2          ========================================================
3
4
5The progam mnoGoSearch (formally known as UdmSearch) is a free (GPL) internet
6indexing and search program.  The mnoGoSearch web-page describes itself as
7follows:
8
9        mnoGoSearch (formerly known as UdmSearch) is a full-featured search
10        software for intranet and internet servers. mnoGoSearch is a free
11        software covered by the GNU General Public License.
12
13        mnoGoSearch software has a number of unique features, which makes it
14        appropriate for a wide range of applications from search within your
15        site to specialized search systems such as cooking recipes or newspaper
16        searches, ftp archive search, MP3 search, news articles search or even
17        national-wide portal search engine.
18
19I have written WWWOFFLE so that mnoGoSearch can be used with it to allow the
20entire cache of pages can be indexed.  There are three stages to using the
21program that are described in this document; installation, indexing and
22searching.
23
24
25Installing mnoGoSearch
26----------------------
27
28mnoGoSearch is available from the web site
29
30        http://mnogosearch.org/
31
32You need to have version 3.1.0 or later of mnoGoSearch.
33
34No special compile-time configuration of mnoGoSearch is required to be able to
35use it with WWWOFFLE.
36
37
38I tested with version 3.1.0.
39
40
41Configure WWWOFFLE to run with mnoGoSearch
42------------------------------------------
43
44The configuration files for the mnoGoSearch programs as used with WWWOFFLE will
45have been installed in /var/spool/wwwoffle/search/mnogosearch/conf when WWWOFFLE
46was installed.  The scripts used to run the mnoGoSearch programs will have been
47installed in /var/spool/wwwoffle/search/mnoGoSearch/scripts when WWWOFFLE was
48installed.  In both these cases the directory /var/spool/wwwoffle can be changed
49at compile time with options to the configure script.
50
51These files should be correct if the information at the time of running
52configure was set correctly.  Check them, they should have the spool directory
53and the proxy hostname and port set correctly.
54
55Also they should be checked to ensure that the mnoGoSearch programs are on the
56path (you can edit the PATH variable here if they are not in /usr/local/bin).
57
58
59Configure database to work with mnoGoSearch
60-------------------------------------------
61
62MySQL
63- - -
64
65Create the MySQL database using the 'mysqladmin' command
66
67$ mysqladmin create mnogosearch
68
69Setup the database structure for the mnoGoSearch database.
70
71$ mysql mnogosearch < mnogosearch-3.1.13/create/mysql/create.txt
72$ mysql mnogosearch < mnogosearch-3.1.13/create/mysql/crc-multi.txt
73
74Create the MySQL user called wwwoffle and allow access to the mnogosearch
75database.  This requires running the mysql program and entering commands at the
76'mysql>' prompt (I have broken up the second line to allow it to fit, it should
77all be one line).
78
79$ mysql -u root mysql
80
81mysql> INSERT INTO user (Host,User,Password) VALUES('localhost','wwwoffle','');
82mysql> INSERT INTO db (Host,Db,User,Select_priv,Insert_priv,Update_priv,
83       Delete_priv,Create_priv,Drop_priv)
84       VALUES('localhost','mnogosearch','wwwoffle','Y','Y','Y','Y','Y','Y');
85mysql> FLUSH PRIVILEGES;
86mysql> quit
87
88
89Postgres SQL
90- - - - - -
91
92The Postgres database server needs to be configured so that it users TCP/IP and
93so that access is allowed from the host that the udmsearch program will be run
94from.
95
96The option PGALLOWTCPIP=yes in postmaster.init needs to be set to allow TCP/IP
97access.
98
99The options PGFSYNC=no in postmaster.init needs to be set to get good
100performance.
101
102You will need to create a database user and set up the database for mnoGoSearch.
103
104$ createuser -U postgres --createdb --no-adduser wwwoffle
105$ createdb -U wwwoffle mnogosearch
106
107Setup the database structure for the mnogosearch database.
108
109$ psql -U wwwoffle mnogosearch < mnogosearch-3.1.13/create/pgsql/create.txt
110$ psql -U wwwoffle mnogosearch < mnogosearch-3.1.13/create/pgsql/crc-multi.txt
111
112
113Indexing
114--------
115
116Indexing is the name that is given to the process of searching through the
117web-pages to make the search database.
118
119To work with WWWOFFLE I have produced my own scripts that should be used
120to call the mnoGoSearch indexer.
121
122   /var/spool/wwwoffle/search/mnogosearch/scripts/wwwoffle-mnogosearch-full
123   /var/spool/wwwoffle/search/mnogosearch/scripts/wwwoffle-mnogosearch-lasttime
124
125The first of these scripts will do a full search and index all of the URLs in
126the cache.  The second one will do a search of the URLs in the lasttime
127directory.
128
129
130Searching
131---------
132
133The search page for mnoGoSearch is at http://localhost:8080/search/mnogosearch/
134and is linked to from the "Welcome Page".  The word or words that you want to
135search for should be entered here.
136
137This form actually calls the script
138
139   /var/spool/wwwoffle/search/mnogosearch/scripts/wwwoffle-mnogosearch
140
141to do the searching so it is possible to edit this to modify it if required.
142
143
144Thanks to
145---------
146
147Thanks to Volker Wysk <vw@volker-wysk.de> for providing the initial information
148about using mnoGoSearch.  I have used his useful e-mail about how to configure
149the mnoGoSearch program and MySQL in this document (with modifications).
150
151
152
153Andrew M. Bishop
15412th Aug 2000
155

README.namazu

1          WWWOFFLE - World Wide Web Offline Explorer - Version 2.9
2          ========================================================
3
4
5The program namazu is a free (GPL) web server indexing and search program.
6
7        Namazu is a full-text search engine intended for easy
8        use. Not only does it work as a small or medium scale Web
9        search engine, but also as a personal search system for
10        email or other files.
11
12Namazu is different from most such programs in that it only searches the
13filesystem that is used by the web server and does not access the server
14directly.
15
16The program that performs the searching & indexing for the namazu package is
17called mknmz.  The program mknmz-wwwoffle has been written by WATANABE Yoshimasa
18to allow indexing of the WWWOFFLE cache.  This is performed directly on the
19files in the cache.
20
21
22Getting Namazu
23--------------
24
25Namazu is available from the web site
26
27        http://www.namazu.org/
28
29You need to have version 2.0 or later.
30
31No special compile-time configuration of Namazu is required to be able to use it
32with WWWOFFLE.
33
34
35I tested with version 2.0.5 using the official Debian package.
36
37
38Getting mknmz-wwwoffle
39----------------------
40
41The mknmz-wwwoffle program is available from the web site
42
43        http://www.naney.org/comp/distrib/mknmz-wwwoffle/
44
45You need to have version 0.7 or later.
46
47No special configuration of mknmz-wwwoffle is required to be able to use it with
48WWWOFFLE.
49
50
51I tested with version 0.7.2 using a Debian package from the mknmz-wwwoffle site.
52
53
54Configure WWWOFFLE to run with Namazu & mknmz-wwwoffle
55------------------------------------------------------
56
57The configuration files for the Namazu programs as used with WWWOFFLE will have
58been installed in /var/spool/wwwoffle/search/namazu/conf when WWWOFFLE was
59installed.  The scripts used to run the Namazu programs will have been installed
60in /var/spool/wwwoffle/search/namazu/scripts when WWWOFFLE was installed.  In
61both these cases the directory /var/spool/wwwoffle can be changed at compile
62time with options to the configure script.
63
64These files should be correct if the information at the time of running
65configure was set correctly.  Check them, they should have the spool directory
66and the proxy hostname and port set correctly.
67
68Also they should be checked to ensure that the Namazu programs are on the path
69(you can edit the PATH variable here if they are not in /usr/local/bin).
70
71
72Indexing
73--------
74
75Indexing is the name that is given to the process of searching through the
76web-pages to make the search database.
77
78To work with WWWOFFLE I have produced my own scripts that should be used
79to call the Namazu indexer progam (mknmz).
80
81   /var/spool/wwwoffle/search/namazu/scripts/wwwoffle-mknmz-full
82   /var/spool/wwwoffle/search/namazu/scripts/wwwoffle-mknmz-lasttime
83
84The first of these scripts will do a full search and index all of the URLs in
85the cache.  The second one will do a search on the files in the lasttime
86directory.
87
88
89Searching
90---------
91
92The search page for Namazu is at http://localhost:8080/search/namazu/ and is
93linked to from the "Welcome Page".  The word or words that you want to search
94for should be entered here.
95
96This form actually calls the script
97
98   /var/spool/wwwoffle/search/namazu/scripts/wwwoffle-namazu
99
100to do the searching so it is possible to edit this to modify it if required.
101
102
103Thanks to
104---------
105
106Thanks to WATANABE Yoshimasa <naney@naney.org> for writing the mknmz-wwwoffle
107program without which the Namazu program and WWWOFFLE could not have been used
108together.
109
110
111
112Andrew M. Bishop
11312th Aug 2001
114
115

README.win32

1      WWWOFFLE - World Wide Web Offline Explorer - Version 2.7 - Win32
2      ================================================================
3
4
5    *************************************************************************
6    ** This file is rather out of date because it was written for earlier  **
7    ** versions of WWWOFFLE and cygwin.  If you compile WWWOFFLE for Win32 **
8    ** and can suggest changes then please let me know.                    **
9    *************************************************************************
10
11
12This is the Windows 32-bit version of the World Wide Web OFFline Explorer,
13otherwise know as WWWOFFLE.  A UNIX version of this program has been available
14since the start of 1997.  The possibility of a Windows version of the program
15was brought to my attention by an investigation of the Cygwin development kit.
16
17
18The Cygwin Development Kit
19--------------------------
20
21The Cygwin development kit is described in its FAQ as follows:
22
23        The Cygwin tools are Win32 ports of the popular GNU development
24        tools for Windows NT, 95, and 98. They function through the use
25        of the Cygwin library which provides a UNIX-like API on top of
26        the Win32 API.
27
28        Use the tools to:
29        o Develop Win32 console or GUI applications, using the Win32 API.
30        o Easily port many significant UNIX programs to Windows NT/9x
31          without making significant source code changes. Configure and
32          build most GNU software from source using standard Unix build
33          procedures.
34        o Work in a fairly complete UNIX-like environment, with access to
35          many common UNIX utilities (from both the provided bash shell
36          and the standard Win32 command shell).
37
38More information about the Cygnus development kit and the GNU tools that are in
39the development kit can be obtained from their web-sites.
40
41http://cygwin.com/
42http://www.gnu.org/
43
44To compile WWWOFFLE you should use the latest version of the cygwin library.
45
46
47Using WWWOFFLE
48--------------
49
50Because this version of WWWOFFLE is a port of a UNIX program (with negligible
51changes from the UNIX version) some of the concepts and features may not be
52familiar to users of MS Windows.
53
54Filenames
55- - - - -
56
57On UNIX systems the '/' character is used as a path separator, DOS uses '\', you
58should use the UNIX format in the wwwoffle.conf configuration file and in the
59command line arguments.  One other change that has been made is that on DOS the
60':' character is not allowed in filenames so the '!' character has been used in
61the host sub-directory names instead.
62
63On UNIX systems the filenames are case-sensitive and can be longer than 8.3
64characters.  WWWOFFLE requires that the files that it creates keep their case
65and are longer than 8.3 characters.
66
67On UNIX systems there is no distinction between the separate disk drives like
68there is under DOS.  With a DOS system there are drives 'A:', 'C:', 'D:' etc, on
69UNIX all of the disks are accessed from the root directory '/'.  In the Cygnus
70CDK and hence in WWWOFFLE all pathnames are expected to be in this format, the
71drive that the operating system booted from (normally drive C:) is '/', drive
72'A:' would be '//a/', drive 'D:' would be '//d/'.  You must use this format in
73the wwwoffle.conf configuration file and in the command line arguments.
74
75The default installation location for WWWOFFLE on UNIX is different from that
76for Windows-32.
77
78                        UNIX                                    Windows-32
79
80Cached files:   /var/spool/wwwoffle                     /wwwoffle
81Config file:    /etc/wwwoffle/wwwoffle.conf             /wwwoffle/wwwoffle.conf
82Executables:    /usr/local/bin & /usr/local/sbin        /wwwoffle/bin
83Documentation:  /usr/local/man/man*                     /wwwoffle/doc
84
85In the documentation and the program you may find references to these pathnames
86and filenames, you should make the appropriate conversions.
87
88Other Terms
89- - - - - -
90
91Syslog        - The system logfile, many daemon processes (servers) write their
92                status to this file.
93Daemon        - A program (usually some type of server) that runs in the
94                background and sleeps until it is called upon to do anything.
95Username/uid  - Users on a UNIX system have to log on and are assigned a
96                username and a numberic user ID (uid).
97                [Not the same as a Windows 95/98 logon username]
98Groupname/gid - Users on UNIX are also assigned to a group that has a name and a
99                numeric Group ID (gid).
100                [Not applicable to Windows 95/98]
101
102
103Running WWWOFFLE
104----------------
105
106The WWWOFFLE server program 'wwwoffled' is typically it is started from the boot
107time scripts (the equivalent to autoexec.bat on DOS). On a Win32 system I do not
108know the best way of starting a program at boot time so I leave it to you to
109decide.
110
111The WWWOFFLE helper program 'wwwoffle' is run each time that the dial-up
112connection is started or stopped.  This is normally done by the scripts that are
113automatically run by the PPP connection process.  Again I do not know the best
114way of doing this on Win32, the graphical interface to DUN does not appear to
115allow for this.
116
117Quick Demonstration
118- - - - - - - - - -
119
120To see what WWWOFFLE does, use the following steps for a quick demonstration.
121
1221) Edit the configuration file
123   c:\wwwoffle\wwwoffle.conf
124
1252) Start the WWWOFFLE demon running.
126   c:\wwwoffle\bin\wwwoffled
127
1283) Start your web browser and set up localhost:8080 as the proxy.
129   Disable caching between sessions within the browser.
130
1314) a) Connect to the internet
132   b) Tell the WWWOFFLE demon that you are online
133        c:\wwwoffle\bin\wwwoffle -online
134   c) Start browsing
135   d) Tell the WWWOFFLE demon that you are offline
136        c:\wwwoffle\bin\wwwoffle -offline
137   e) Disconnect from the internet
138
1395) Go back and browse the pages again while not connected, follow some different
140   links this time (you will see a WWWOFFLE server message in the browser).
141
1426) a) Connect to the internet
143   b) Tell the WWWOFFLE demon that you are online
144        c:\wwwoffle\bin\wwwoffle -online
145   c) Tell the WWWOFFLE demon to fetch the new pages
146        c:\wwwoffle\bin\wwwoffle -fetch
147   d) Tell the WWWOFFLE demon that you are offline
148        c:\wwwoffle\bin\wwwoffle -offline
149   e) Disconnect from the internet
150
1517) a) Go to http://localhost:8080/index/ and find the newly downloaded pages.
152   b) Browse the new pages that have just been fetched.
153
154
155Running WWWOFFLE as a service
156-----------------------------
157
158One way to run WWWOFFLE is to install it as a service using the cygrunsrv
159program that comes with the cygnus toolset.
160
161The command that you would use to do this is (all on one line):
162
163cygrunsrv -I wwwoffle -p /usr/local/sbin/wwwoffled.exe -o
164          -a "-c /etc/wwwoffle/wwwoffle.conf -d"
165
166
167Win32 Problems
168--------------
169
170If accessing WWWOFFLE is very slow then it could be that there are problems with
171performing DNS lookups to see which hosts are connecting to WWWOFFLE.  Windows
172can use a UNIX style hosts file to specify local hostnames.  The file is named
173c:\windows\hosts or c:\winnt\hosts and contains a list of IP addresses and
174hostnames.
175
176For example if you have two hosts on the 192.168.0.* subnet called host1 and
177host2 you would put the following in the hosts file.
178
179192.168.0.1   host1
180192.168.0.2   host2
181
182
183Other Information
184-----------------
185
186You should read the rest of the documentation about WWWOFFLE, in particular the
187FAQ and README.1st file.  These should answer your questions or at least point
188you in the direction of how to contact me for information.
189
190There may be other UNIX biased features of WWWOFFLE in the documentation of the
191program itself.  Since this is the first version of WWWOFFLE that works on Win32
192platforms I hope that you will try and work around any problems.  I will try and
193make sure that the next version has more applicable information.
194
195
196
197Andrew M. Bishop
198August 24th 2001
199