Name | Date | Size | #Lines | LOC | ||
---|---|---|---|---|---|---|
.. | 03-May-2022 | - | ||||
README.contrib | H A D | 19-May-2019 | 1.8 KiB | 53 | 34 | |
README.randomtrain | H A D | 19-May-2019 | 2.5 KiB | 54 | 45 | |
bfproxy.pl | H A D | 03-May-2022 | 24 KiB | 763 | 266 | |
bogo.R | H A D | 19-May-2019 | 1.5 KiB | 62 | 37 | |
bogofilter-milter.pl | H A D | 03-May-2022 | 43.4 KiB | 1,442 | 972 | |
bogofilter-qfe.sh | H A D | 19-May-2019 | 1.8 KiB | 88 | 49 | |
bogogrep.c | H A D | 19-May-2019 | 2.9 KiB | 129 | 100 | |
bogominitrain.pl | H A D | 03-May-2022 | 7.7 KiB | 193 | 161 | |
dot-qmail-bogofilter-default | H A D | 19-May-2019 | 36 | 2 | 1 | |
mailfilter.example | H A D | 10-Oct-2019 | 1.3 KiB | 48 | 36 | |
mime.get.rfc822.pl | H A D | 03-May-2022 | 2.2 KiB | 78 | 24 | |
parmtest.sh | H A D | 19-May-2019 | 3.3 KiB | 141 | 76 | |
printmaildir.pl | H A D | 03-May-2022 | 1 KiB | 52 | 40 | |
procmailrc.example | H A D | 10-Oct-2019 | 1.9 KiB | 83 | 62 | |
randomtrain.sh | H A D | 03-May-2022 | 5.2 KiB | 170 | 130 | |
scramble.sh | H A D | 03-May-2022 | 3.9 KiB | 157 | 123 | |
spamitarium.pl | H A D | 03-May-2022 | 45.5 KiB | 1,231 | 561 | |
stripsearch.pl | H A D | 03-May-2022 | 13.6 KiB | 481 | 153 | |
trainbogo.sh | H A D | 03-May-2022 | 8.9 KiB | 286 | 211 | |
vm-bogofilter.el | H A D | 19-May-2019 | 14.9 KiB | 392 | 179 |
README.contrib
1Files in this directory 2~~~~~~~~~~~~~~~~~~~~~~~ 3 4README.contrib - the file you are now reading. 5 6QDBM-transactions.patch - a patch to teach QDBM to use transactions, 7 to show how QDBM transaction enhancements might look like. 8 9bfproxy.pl - performs bogofilter functions via email 10 11bogo.R - Script to check calculations performed by bogofilter. 12 13bogofilter-milter.pl - Sendmail::Milter Perl script for filtering mail 14 using individual users' bogofilter databases. 15 16bogogrep.c - This file emulates GNU grep -ab with a plain text pattern 17 anchored to the left. 18 19bogominitrain.pl - Script for training to exhaustion from mbox, see FAQ. 20 21dot-qmail-bogofilter-default - Sample .qmail or .qmail-default file to 22 go along with ... 23bogofilter-qfe.sh - Qmail specific bogofilter frontend script which allows 24 the use of a centralized bogofilter running on an smtp mail server. 25 26mailfilter.example - Sample maildrop configuration file. 27 28mime.get.rfc822.pl - Perl script to extract message/rfc822 attachments into 29 an mbox. 30 31parmtest.sh - Script to test a bogofilter option. 32 33printmaildir.pl - Converts a Maildir to mbox format. 34 35procmailrc.example - Sample .procmailrc file. 36 37randomtrain.sh - Script to train on error with random order from mbox, 38 see FAQ. Comes with (and uses) ... 39scramble.sh - Script needed by randomtrain. 40 41spamitarium.pl - helps to remove unnecessary noise from email headers 42 and to highlight just the portions which contribute positively to 43 spam filtering using statistical methods. 44 45stripsearch.pl - Stripsearch investigates the body of your emails for 46 evidence of spamvertized URLs by looking them up in a Realtime 47 BlockList (RBL) such as surbl.org or spamhaus.org. 48 49trainbogo.sh - Script to train from qmail maildirs. 50 51vm-bogofilter.el - An interface between the VM mail reader and the 52 bogofilter spam filter. 53
README.randomtrain
1It seems that training bogofilter on its errors _only_ is a very good 2way to train, at least with the Robinson-Fisher or Bayes chain rule 3calculation methods. The way this works is: messages from the training 4corpus are picked at random (without replacement, ie no message is used 5more than once) and fed to bogofilter for evaluation. If bogofilter 6gets the classification right, nothing further is done. If it's wrong, 7or uncertain if ternary mode is in use, then the message is fed to 8bogofilter again with the -s or -n option, as appropriate. 9 10That's all very well, except that it's not an easy process to execute 11with just a couple of shell commands. I've now written a bash script 12that does the job [Matthias Andree: this has been changed so it may now 13work on a regular POSIX compliant sh, too, feedback is welcome]; you 14give it the directory in which to build the bogofilter database, and a 15list of files flagged with either -s or -n to indicate spam or nonspam, 16and it performs training-on-error using all the messages in all the 17files in random order. 18 19My production version of bogofilter returns the following exit codes: 200 for spam 211 for nonspam 222 for uncertain 233 for error 24 25Normal bogofilter returns (I think) 260 for spam 271 for nonspam 282 for error 29 30This script will work with either. 31 32You can use it to build from scratch; the first message evaluated will 33return the error exit code, and randomtrain (as this script is called) 34will train with that message, thus creating the databases. 35 36The script needs rather a lot of auxiliary commands (they're listed in 37the comments at the top of the file); in particular, perl is called for 38the randomization function. (The embedded perl script is "useful" in 39its own right: it takes text on standard input and returns the lines in 40random sequence.) Known portability issue: on HP-UX (10.20 at least), 41grep -b returns a block offset instead of a byte offset, so randomtrain 42won't work unless gnu grep is substituted for the HP-UX one. 43 44I rebuilt my training lists with randomtrain. The training corpus 45consists of 9878 spams and 7896 nonspams. The message-counts from 46bogoutil -w bogodir are 1475 and 408. The database sizes from full 47training were 10 and 4 Mb; randomtrain produced .db files of 7 and 1.2 48Mb. I don't yet have figures comparing discrimination by bogofilter 49with these two training sets, but yesterday's smaller-scale test (which 50motivated me to write this script) clearly indicated an improvement 51could be expected. 52 53Greg Louis <glouis@dynamicro.on.ca> 54