• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

bayes-testing/H20-Dec-2021-2,3511,598

contrib/automasscheck-minimal/H20-Dec-2021-227177

corpora/H20-Dec-2021-1,334966

evolve_metarule/H20-Dec-2021-598389

graphs/H20-Dec-2021-3120

plugins/H20-Dec-2021-298195

rule-dev/H20-Dec-2021-1,7521,055

rule-qa/H20-Dec-2021-5,6954,110

rule-update-score-gen/H20-Dec-2021-1,066638

tenpass/H20-Dec-2021-383264

CORPUS_POLICYH A D20-Dec-20211.9 KiB4633

CORPUS_SUBMITH A D20-Dec-20211.6 KiB3326

CORPUS_SUBMIT_NIGHTLYH A D20-Dec-2021921 3019

MakefileH A D20-Dec-20211.5 KiB5735

READMEH A D20-Dec-20213.1 KiB10160

README.perceptronH A D20-Dec-20216.3 KiB160119

compare-modelsH A D20-Dec-20214.6 KiB166102

configH A D20-Dec-202161 65

config.set0H A D20-Dec-202160 65

config.set1H A D20-Dec-202161 65

config.set2H A D20-Dec-202161 65

config.set3H A D20-Dec-202161 65

cpucountH A D20-Dec-20211.4 KiB5551

enable-all-evolved-rulesH A D20-Dec-20211.9 KiB6227

extract-resultsH A D20-Dec-2021682 3218

find-extremesH A D20-Dec-202110.2 KiB355281

force-publish-active-rulesH A D20-Dec-2021703 3323

fp-fn-statisticsH A D20-Dec-20217.1 KiB241127

fp-fn-to-tcrH A D20-Dec-20213.1 KiB8948

freqdiffH A D20-Dec-20214.6 KiB199155

garescorer.cH A D20-Dec-202136.8 KiB1,3141,043

generate-corpusH A D20-Dec-2021521 1210

generate-translationH A D20-Dec-20215.9 KiB218160

hit-frequenciesH A D20-Dec-202128.3 KiB1,036672

lint-rules-from-freqsH A D20-Dec-202110.2 KiB339247

log-grep-recentH A D20-Dec-20212.3 KiB8951

logdiffH A D20-Dec-20211.2 KiB6149

logrulediffH A D20-Dec-20211.4 KiB6448

logs-to-cH A D20-Dec-202115.1 KiB595407

logs-to-corpus-reportH A D20-Dec-20216.5 KiB237173

mass-checkH A D20-Dec-202184.7 KiB2,6791,869

mass-check.cfH A D20-Dec-202157 43

mboxgetH A D20-Dec-20214.6 KiB18586

mk-baseline-resultsH A D20-Dec-20211.4 KiB5537

mk-roc-graphsH A D20-Dec-20217.6 KiB283184

model-statisticsH A D20-Dec-20211.5 KiB6440

overlapH A D20-Dec-20213.6 KiB13163

perceptron.cH A D20-Dec-202112.9 KiB481322

perceptron.podH A D20-Dec-2021790 3121

post-ga-analysis.plH A D20-Dec-20212.4 KiB11092

remove-ids-from-mclogH A D20-Dec-20211.7 KiB6430

rewrite-cf-with-new-scoresH A D20-Dec-202113.5 KiB522326

runGAH A D20-Dec-20215 KiB184108

runPerceptronH A D20-Dec-20212.9 KiB10468

score-ranges-from-freqsH A D20-Dec-20214.5 KiB166111

uniq-scoresH A D20-Dec-2021950 299

validate-modelH A D20-Dec-20213.2 KiB13195

README

1
2RESCORING SURVEY: HOW TO TAKE PART
3----------------------------------
4
5The tools in this directory are used to optimise the scoring system used for
6incoming mails, using a genetic algorithm to search for optimal values.
7
8Since this works best with a very large dataset, it would be *great* if you
9(as a user) could run this and submit the results.
10
11The analysis script will not include text from the mails themselves, so
12it will not give away private details from your mail spool.  The only
13details you'll give away will be your email address (and I promise *NEVER*
14to give that out or use it for spammy stuff) -- and how many mails you
15have sitting around in folders!
16
17
18CONDITIONS
19----------
20
211. First of all, you must be running it on a UNIX system; it's not portable to
22other OSes yet.  Also currently it only reads UNIX mailbox format files, or MH
23spool directories.
24
252. This will not work unless you have separated the mail messages you'll be
26analysing into separate "spam" and "non-spam" piles.  It doesn't matter how
27many mailboxes contain spam, or how many mailboxes contain non-spam; you just
28need to be sure you know which set is which!
29
30The latter is most important.  If you have occasional spams scattered through
31your mailboxes, or occasional non-spam messages in your trapped spam folder,
32the analysis will be useless.
33
34See the CORPUS_POLICY file for more details.
35
36
37
38HOW TO SUBMIT RESULTS BACK TO US
39--------------------------------
40
41See the file CORPUS_SUBMIT in this directory.
42
43
44HOW IT WORKS
45------------
46
47If you're interested, here's a quick description of the rest of the stuff
48in this directory and what they do:
49
50mass-check :
51
52  This script is used to perform "mass checks" of a set of mailboxes, Cyrus
53  folders, and/or MH mail spools.  It generates summary lines like this:
54
55  Y  7 /home/jm/Mail/Sapm/1382 SUBJ_ALL_CAPS,SUPERLONG_LINE,SUBJ_FULL_OF_8BITS
56
57  or for mailboxes,
58
59  .  1 /path/to/mbox:<5.1.0.14.2.20011004073932.05f4fd28@localhost> TRACKER_ID,BALANCE_FOR_LONG
60
61  listing the path to the message or its message ID, its score, and the tests
62  that triggered on that mail.
63
64  Using this info, and a score optimization tool, I can figure out which tests
65  get good hits with few false positives, etc., and re-score the tests to
66  optimise the ratio.
67
68  This script relies on the spamassassin distribution directory living in "..".
69
70
71logs-to-c :
72
73  Takes the "spam.log" and "nonspam.log" files and converts them into C
74  source files and simplified data files for use by the C score optimization
75  algorithm.  (Called by "make" when you build the perceptron, so generally
76  you won't need to run it yourself.)
77
78
79hit-frequencies :
80
81  Analyses the log files and computes how often each test hits, overall,
82  for spam mails and for non-spam.
83
84
85mk-baseline-results :
86
87  Compute results for the baseline scores (read from ../rules/*).  If you
88  provide the name of a config directory as the first argument, it'll use that
89  instead.
90
91  It will output statistics on the current ruleset to ../rules/STATISTICS.txt,
92  suitable for a release build of SpamAssassin.
93
94
95perceptron.c :
96
97  Perceptron learner by Henry Stern.  See "README.perceptron" for details.
98
99
100-- EOF
101

README.perceptron

1
2Fast SpamAssassin Score Learning Tool
3
4Henry Stern
5Faculty of Computer Science
6Dalhousie University
76050 University Avenue
8Halifax, NS  Canada
9B3H 1W5
10henry@stern.ca
11
12January 8, 2004
13
141.  WHAT IS IT?
15
16This program is used to compute scores for SpamAssassin rules.  It makes
17use of data files generated by the suite of scripts in
18spamassassin/masses.  The program outputs the generated scores in a file
19titled 'perceptron.scores'.
20
21The advantage of this program over that of the genetic algorithm (GA)
22implementation in spamassassin/masses/craig_evolve.c is that while the GA
23requires several hours to run on high-end machines, the perceptron
24requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.
25
26This makes incremental updates and score personalization practical for the
27end-user and gives developers a better idea just how useful a new rule is.
28
292.  OPTIONS
30
31There are four options that can be passed to the perceptron program.
32
33  -p ham_preference
34
35	This increases the number of non-spam messages in the training
36	set.  It does this by adding 1 + (number of tests hit) *
37	ham_preference instances of non-spam messages to the training set.
38	This is intended to reduce false positives by encouraging the
39	training program to look at the harder-to-clasify ham messages
40	more often.  By default, this parameter is 2.0.
41
42  -e num_epochs
43
44	This parameter sets how many passes the perceptron will make
45	through the training set before terminating.  On each pass, the
46	training set is shuffled and then iterated through.  By default,
47	it will make 15 passes.
48
49  -l learning_rate
50
51	This parameter modifies the learning rate of the perceptron.  The
52	error gradient is computed for each instance.  This program uses a
53	logsig activation function y = (1/(1+exp(-x))), so the error
54	gradient is computed as E(x) = y*(1-y)*(is_spam - y).  For
55	each instance and score hit in the training set, the scores are
56	modified by adding E(x) / (number of tests hit + 1) *
57	learning_rate.  The default value for this is 2.0, but it can be
58	whatever you want.
59
60  -w weight_decay
61
62	To prevent the scores from getting too high (or even forcing them
63	to if you want), before each epoch, the scores and network bias
64	are multiplied by the weight decay.  This is off by default (1.0),
65	but it can be useful.
66
673.  HOW DOES IT WORK?
68
69This program implements the "Stochastic Gradient Descent" method of
70training a neural network.  It uses a single perceptron with a logsig
71activation function and maps the weights to SpamAssassin score space.
72
73The perceptron is the simplest form of neural network.  It consists of a
74transfer function and an activation function.  Together, they simulate the
75average firing rate of a biological neuron over time.
76
77The transfer function is the sum of the product of weights and inputs.
78It simulates the membrane potential of a neuron.  When the accumulated
79charge on the membrane exceeds a certain threshold, the neuron fires,
80sending an electrical impulse down the axon. This implementation uses a
81linear transfer function:
82
83[1]	f(x) = bias + sum_{i=1}^{n} w_i * x_i
84
85This is quite similar to how the SpamAssasin score system works.  If you
86set the bias to be 0, w_i to be the score for rule i and x_i to be whether
87or not rule i is activated by a given message, the transfer function will
88return the score of the message.
89
90The activation function simulates the electrical spike travelling down the
91axon.  It takes the output of the transfer function as input and applies
92some sort of transformation to it.  This implementation uses a logsig
93activation function:
94
95[2]	y(x) = 1 / (1 + exp(-f(x)))
96
97This non-linear function constrains the output of the transfer function
98between 0 and 1.  When plotted, it looks somewhat S-shaped with vertical
99asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.
100The slope of the function is greatest at x=0 and tapers off as it
101approaches the asymptotes.
102
103Lastly, the performance of the perceptron needs to be measured using an
104error function.  Two error functions are commonly used:  mean squared
105error and entropic error.  By default, this implementation uses mean
106squared error but entropic error may be substituted by adding a compiler
107directive.
108
109The most common method of training neural networks is called gradient
110descent.  It involves iteratively tuning the parameters of the network so
111that the mean error rate always decreases.  This is done by finding the
112direction of steepest descent down the "error gradient," reducing the
113value of the error function for the next iteration of the algorithm.
114
115If the transfer function and activation function are both differentiable,
116the error gradient of a neural network can be calculated with respect to
117the weights and bias.  Without getting into calculus, the error gradient
118for a perceptron with a linear transfer function, logsig activation
119function and mean squared error function is:
120
121[3]	E(x) = y(x) * (1-y(x)) * (y_expected - y(x))
122
123The weights are updated using the function:
124
125[4]	w_i = w_i + E(x) * x_i * learning_rate
126
127Since the SpamAssassin rule hits are sparse, the basic gradient descent
128algorithm is impractical.  This implementation uses a variation called
129"Stochastic gradient descent."  Instead of doing one batch update per
130epoch, the training set is randomly walked through, doing incremental
131updates.  In addition, in this implementation, the learning rate is
132modified by the number of rule hits for a given training instance.
133Together, these allow good weights to be computed for
134infrequently-occurring rules.
135
136Once the perceptron has finished running, the weights are converted to
137scores and exported using the familiar file format.  Weights are converted
138to scores using this function:
139
140[5]	score(weight) = -threshold * weight / bias
141
1424.  ACKNOWLEDGEMENTS
143
144I would like to thank my PhD supervisor, Michael Shepherd, for not getting
145mad at me while I worked on this.  I'd also like to thank Thomas
146Trappenberg for his invaluable assistance while I was tweaking the
147performance of the learning algorithm.  I would like to thank Daniel
148Quinlan, Justin Mason and Theo Van Dinter for their valuable input and
149constructive criticism.
150
151--
152hs
1538/1/2004
154
155
156Updates
1572007-07-01: http://old.nabble.com/Re%3A-Spam-research-tp11386525p11386525.html
1582010-04-24: http://old.nabble.com/hardware-on-ruleqa-tp28352058p28353585.html
159See also http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376
160