• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

README.contribH A D19-May-20191.8 KiB5334

README.randomtrainH A D19-May-20192.5 KiB5445

bfproxy.plH A D03-May-202224 KiB763266

bogo.RH A D19-May-20191.5 KiB6237

bogofilter-milter.plH A D03-May-202243.4 KiB1,442972

bogofilter-qfe.shH A D19-May-20191.8 KiB8849

bogogrep.cH A D19-May-20192.9 KiB129100

bogominitrain.plH A D03-May-20227.7 KiB193161

dot-qmail-bogofilter-defaultH A D19-May-201936 21

mailfilter.exampleH A D10-Oct-20191.3 KiB4836

mime.get.rfc822.plH A D03-May-20222.2 KiB7824

parmtest.shH A D19-May-20193.3 KiB14176

printmaildir.plH A D03-May-20221 KiB5240

procmailrc.exampleH A D10-Oct-20191.9 KiB8362

randomtrain.shH A D03-May-20225.2 KiB170130

scramble.shH A D03-May-20223.9 KiB157123

spamitarium.plH A D03-May-202245.5 KiB1,231561

stripsearch.plH A D03-May-202213.6 KiB481153

trainbogo.shH A D03-May-20228.9 KiB286211

vm-bogofilter.elH A D19-May-201914.9 KiB392179

README.contrib

1Files in this directory
2~~~~~~~~~~~~~~~~~~~~~~~
3
4README.contrib - the file you are now reading.
5
6QDBM-transactions.patch - a patch to teach QDBM to use transactions,
7    to show how QDBM transaction enhancements might look like.
8
9bfproxy.pl - performs bogofilter functions via email
10
11bogo.R - Script to check calculations performed by bogofilter.
12
13bogofilter-milter.pl - Sendmail::Milter Perl script for filtering mail
14   using individual users' bogofilter databases.
15
16bogogrep.c - This file emulates GNU grep -ab with a plain text pattern
17   anchored to the left.
18
19bogominitrain.pl - Script for training to exhaustion from mbox, see FAQ.
20
21dot-qmail-bogofilter-default - Sample .qmail or .qmail-default file to
22    go along with ...
23bogofilter-qfe.sh - Qmail specific bogofilter frontend script which allows
24   the use of a centralized bogofilter running on an smtp mail server.
25
26mailfilter.example - Sample maildrop configuration file.
27
28mime.get.rfc822.pl - Perl script to extract message/rfc822 attachments into
29   an mbox.
30
31parmtest.sh - Script to test a bogofilter option.
32
33printmaildir.pl - Converts a Maildir to mbox format.
34
35procmailrc.example - Sample .procmailrc file.
36
37randomtrain.sh - Script to train on error with random order from mbox,
38    see FAQ. Comes with (and uses) ...
39scramble.sh - Script needed by randomtrain.
40
41spamitarium.pl - helps to remove unnecessary noise from email headers
42    and to highlight just the portions which contribute positively to
43    spam filtering using statistical methods.
44
45stripsearch.pl - Stripsearch investigates the body of your emails for
46    evidence of spamvertized URLs by looking them up in a Realtime
47    BlockList (RBL) such as surbl.org or spamhaus.org.
48
49trainbogo.sh - Script to train from qmail maildirs.
50
51vm-bogofilter.el - An interface between the VM mail reader and the
52    bogofilter spam filter.
53

README.randomtrain

1It seems that training bogofilter on its errors _only_ is a very good
2way to train, at least with the Robinson-Fisher or Bayes chain rule
3calculation methods.  The way this works is: messages from the training
4corpus are picked at random (without replacement, ie no message is used
5more than once) and fed to bogofilter for evaluation.  If bogofilter
6gets the classification right, nothing further is done.  If it's wrong,
7or uncertain if ternary mode is in use, then the message is fed to
8bogofilter again with the -s or -n option, as appropriate.
9
10That's all very well, except that it's not an easy process to execute
11with just a couple of shell commands.  I've now written a bash script
12that does the job [Matthias Andree: this has been changed so it may now
13work on a regular POSIX compliant sh, too, feedback is welcome]; you
14give it the directory in which to build the bogofilter database, and a
15list of files flagged with either -s or -n to indicate spam or nonspam,
16and it performs training-on-error using all the messages in all the
17files in random order.
18
19My production version of bogofilter returns the following exit codes:
200 for spam
211 for nonspam
222 for uncertain
233 for error
24
25Normal bogofilter returns (I think)
260 for spam
271 for nonspam
282 for error
29
30This script will work with either.
31
32You can use it to build from scratch; the first message evaluated will
33return the error exit code, and randomtrain (as this script is called)
34will train with that message, thus creating the databases.
35
36The script needs rather a lot of auxiliary commands (they're listed in
37the comments at the top of the file); in particular, perl is called for
38the randomization function.  (The embedded perl script is "useful" in
39its own right: it takes text on standard input and returns the lines in
40random sequence.)  Known portability issue: on HP-UX (10.20 at least),
41grep -b returns a block offset instead of a byte offset, so randomtrain
42won't work unless gnu grep is substituted for the HP-UX one.
43
44I rebuilt my training lists with randomtrain.  The training corpus
45consists of 9878 spams and 7896 nonspams.  The message-counts from
46bogoutil -w bogodir are 1475 and 408.  The database sizes from full
47training were 10 and 4 Mb; randomtrain produced .db files of 7 and 1.2
48Mb.  I don't yet have figures comparing discrimination by bogofilter
49with these two training sets, but yesterday's smaller-scale test (which
50motivated me to write this script) clearly indicated an improvement
51could be expected.
52
53Greg Louis <glouis@dynamicro.on.ca>
54