README
1 Bayespam is a spam filter for qmail inspired by Paul Graham's "A Plan for
2Spam" <http://www.paulgraham.com/spam.html>. It uses Bayesian classification
3to determine if a particular piece of email is spam or not. I'd get into the
4theory more, but I recommend you read Mr. Graham's paper -- it explains it
5very well.
6
7NOTE:
8 Bayespam 0.9.2 is NOT backwards compatible with Bayespam 0.9! You
9should delete the old Perl scripts, and especially be sure to delete
10any ratings files you have made -- Bayespam 0.9.2 cannot understand
11them!
12
13NOTE:
14 I just got through writing, testing, and deploying this filter on my own
15email server. I don't guarantee it's 100% foolproof yet, so use it with
16caution. Please report any problems, improvements, or reports on how well it
17works to me!
18
19Building the Corpus Rating File
20===============================
21 The first step in setting up Bayespam is building what I call the corpus
22rating file, a file that describes the "spam probability" of any particular
23token. To build this, you'll need a directory of spam emails and a directory
24of non-spam emails. (You should use your own emails...each corpus is
25individually "tuned" to each user's email). Personally, I did this using the
26contents of the appropriate qmail Maildir directories.
27
28 Once you have these directories of files, say named "spam_mail" and
29"not_spam_mail", execute this command:
30bayes_process_email.pl --good not_spam_mail --spam spam_mail -o bayes_rating.db
31 This will build the file bayes_rating.db, which is used by the
32bayes_spam_check.pl script. You can build an individualized corpus file for
33each user, keeping each one in that user's directory, or make a single global
34corpus -- it's up to you. Check the INSTALL file for how to use the corpus.
35
36Hints, Tips, Ideas, and Suggestions from Users
37==============================================
38 Some hints and ideas on getting the most out of Bayespam, sent in by
39several Bayespam users. I haven't necessarily tested any of these, so caveat
40user. Please use caution when trying these out!
41
42 Nate Underwood suggests using mbox2maildir to convert mbox files to
43maildir format. [Get it at http://www.qmail.org/mbox2maildir . I've found
44another program that does that, though I've not used either of these:
45http://www.firstpr.com.au/web-mail/mb2md/ -g.]
46 Nate also suggests being sure to use a comprehensive (ie: more than a
47couple) email corpus to train Bayespam.
48
49 Brian Asker has some ideas for getting Bayespam to work with
50sendmail/procmail/mbox at http://www.asker.net/software/bayespam/ . I can't
51say how applicable these changes are to Bayespam 0.9.2, but hopefully they
52are. :)
53
54 Grahame Bowland has this code to make Bayespam work with procmail:
55Procmailrc file:
56----------------
57:0fw
58| /usr/local/bin/bayes_spam_check.pl /home/<USER>/corpus.dat
59<your email address>
60:0:
61* ^X-Spam-Status: Yes
62spam
63-----------------------
64Change the end of bayes_spam_check.pl to:
65-----------------------
66my $probability_of_spam = $prod / ( $prod + $minus_one_prod );
67
68# append diagnostic information to the headers
69my $in_header = 1;
70foreach $line (@email_message) {
71 if ($in_header && $line =~ /^$/) {
72 if ($probability_of_spam > $spam_minimum) {
73 print "X-Spam-Status: Yes\n";
74 } else {
75 print "X-Spam-Status: No\n";
76 }
77 printf("X-Spam-Probability: %.3f\n", $probability_of_spam);
78 $in_header = 0;
79 }
80 print $line;
81}
82
83# We don't want to freak procmail
84exit 0;
85-----------------------
86
87 Several users report that using very small corpora, using someone else's
88spam, or using a very large spam corpora can give bad results. I think that
89we'll find the best results are obtained with your own spam and email, and
90when both corpora are approximately the same size.
91
92Comments, suggestions?
93======================
94 Please be sure to contact me if you have any problems o suggestions, or
95if you just want to let me know how well the filter works for you.
96
97Gary Arnold <garnold@garyarnold.com>
98
99-g.