• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

Bayespam/H01-Sep-2002-5826

AUTHORSH A D01-Sep-200259 42

COPYINGH A D01-Sep-200217.6 KiB341281

ChangelogH A D01-Sep-2002644 2219

INSTALLH A D01-Sep-20021,013 2622

READMEH A D03-May-20223.9 KiB9985

TODOH A D01-Sep-2002919 3028

bayes_process_email.plH A D03-May-20227.8 KiB275181

bayes_spam_check.plH A D03-May-20223.8 KiB15082

bayestest.plH A D03-May-20221.7 KiB5423

qmail.sampleH A D03-May-2022215 32

README

1	Bayespam is a spam filter for qmail inspired by Paul Graham's "A Plan for
2Spam" <http://www.paulgraham.com/spam.html>.  It uses Bayesian classification
3to determine if a particular piece of email is spam or not.  I'd get into the
4theory more, but I recommend you read Mr. Graham's paper -- it explains it
5very well.
6
7NOTE:
8	Bayespam 0.9.2 is NOT backwards compatible with Bayespam 0.9!  You
9should delete the old Perl scripts, and especially be sure to delete
10any ratings files you have made -- Bayespam 0.9.2 cannot understand
11them!
12
13NOTE:
14	I just got through writing, testing, and deploying this filter on my own
15email server.  I don't guarantee it's 100% foolproof yet, so use it with
16caution.  Please report any problems, improvements, or reports on how well it
17works to me!
18
19Building the Corpus Rating File
20===============================
21	The first step in setting up Bayespam is building what I call the corpus
22rating file, a file that describes the "spam probability" of any particular
23token.  To build this, you'll need a directory of spam emails and a directory
24of non-spam emails.  (You should use your own emails...each corpus is
25individually "tuned" to each user's email).  Personally, I did this using the
26contents of the appropriate qmail Maildir directories.
27
28	Once you have these directories of files, say named "spam_mail" and
29"not_spam_mail", execute this command:
30bayes_process_email.pl --good not_spam_mail --spam spam_mail -o bayes_rating.db
31	This will build the file bayes_rating.db, which is used by the
32bayes_spam_check.pl script.  You can build an individualized corpus file for
33each user, keeping each one in that user's directory, or make a single global
34corpus -- it's up to you.  Check the INSTALL file for how to use the corpus.
35
36Hints, Tips, Ideas, and Suggestions from Users
37==============================================
38	Some hints and ideas on getting the most out of Bayespam, sent in by
39several Bayespam users.  I haven't necessarily tested any of these, so caveat
40user.  Please use caution when trying these out!
41
42	Nate Underwood suggests using mbox2maildir to convert mbox files to
43maildir format. [Get it at http://www.qmail.org/mbox2maildir .  I've found
44another program that does that, though I've not used either of these:
45http://www.firstpr.com.au/web-mail/mb2md/ -g.]
46	Nate also suggests being sure to use a comprehensive (ie: more than a
47couple) email corpus to train Bayespam.
48
49	Brian Asker has some ideas for getting Bayespam to work with
50sendmail/procmail/mbox at http://www.asker.net/software/bayespam/ .  I can't
51say how applicable these changes are to Bayespam 0.9.2, but hopefully they
52are. :)
53
54	Grahame Bowland has this code to make Bayespam work with procmail:
55Procmailrc file:
56----------------
57:0fw
58| /usr/local/bin/bayes_spam_check.pl /home/<USER>/corpus.dat
59<your email address>
60:0:
61* ^X-Spam-Status: Yes
62spam
63-----------------------
64Change the end of bayes_spam_check.pl to:
65-----------------------
66my $probability_of_spam = $prod / ( $prod + $minus_one_prod );
67
68# append diagnostic information to the headers
69my $in_header = 1;
70foreach $line (@email_message) {
71  if ($in_header && $line =~ /^$/) {
72    if ($probability_of_spam > $spam_minimum) {
73      print "X-Spam-Status: Yes\n";
74    } else {
75      print "X-Spam-Status: No\n";
76    }
77    printf("X-Spam-Probability: %.3f\n", $probability_of_spam);
78    $in_header = 0;
79  }
80  print $line;
81}
82
83# We don't want to freak procmail
84exit 0;
85-----------------------
86
87	Several users report that using very small corpora, using someone else's
88spam, or using a very large spam corpora can give bad results.  I think that
89we'll find the best results are obtained with your own spam and email, and
90when both corpora are approximately the same size.
91
92Comments, suggestions?
93======================
94	Please be sure to contact me if you have any problems o suggestions, or
95if you just want to let me know how well the filter works for you.
96
97Gary Arnold <garnold@garyarnold.com>
98
99-g.