|
Name |
|
Date |
Size |
#Lines |
LOC |
| .. | | 03-May-2022 | - |
| bayes-testing/ | H | 20-Dec-2021 | - | 2,351 | 1,598 |
| contrib/automasscheck-minimal/ | H | 20-Dec-2021 | - | 227 | 177 |
| corpora/ | H | 20-Dec-2021 | - | 1,334 | 966 |
| evolve_metarule/ | H | 20-Dec-2021 | - | 598 | 389 |
| graphs/ | H | 20-Dec-2021 | - | 31 | 20 |
| plugins/ | H | 20-Dec-2021 | - | 298 | 195 |
| rule-dev/ | H | 20-Dec-2021 | - | 1,752 | 1,055 |
| rule-qa/ | H | 20-Dec-2021 | - | 5,695 | 4,110 |
| rule-update-score-gen/ | H | 20-Dec-2021 | - | 1,066 | 638 |
| tenpass/ | H | 20-Dec-2021 | - | 383 | 264 |
| CORPUS_POLICY | H A D | 20-Dec-2021 | 1.9 KiB | 46 | 33 |
| CORPUS_SUBMIT | H A D | 20-Dec-2021 | 1.6 KiB | 33 | 26 |
| CORPUS_SUBMIT_NIGHTLY | H A D | 20-Dec-2021 | 921 | 30 | 19 |
| Makefile | H A D | 20-Dec-2021 | 1.5 KiB | 57 | 35 |
| README | H A D | 20-Dec-2021 | 3.1 KiB | 101 | 60 |
| README.perceptron | H A D | 20-Dec-2021 | 6.3 KiB | 160 | 119 |
| compare-models | H A D | 20-Dec-2021 | 4.6 KiB | 166 | 102 |
| config | H A D | 20-Dec-2021 | 61 | 6 | 5 |
| config.set0 | H A D | 20-Dec-2021 | 60 | 6 | 5 |
| config.set1 | H A D | 20-Dec-2021 | 61 | 6 | 5 |
| config.set2 | H A D | 20-Dec-2021 | 61 | 6 | 5 |
| config.set3 | H A D | 20-Dec-2021 | 61 | 6 | 5 |
| cpucount | H A D | 20-Dec-2021 | 1.4 KiB | 55 | 51 |
| enable-all-evolved-rules | H A D | 20-Dec-2021 | 1.9 KiB | 62 | 27 |
| extract-results | H A D | 20-Dec-2021 | 682 | 32 | 18 |
| find-extremes | H A D | 20-Dec-2021 | 10.2 KiB | 355 | 281 |
| force-publish-active-rules | H A D | 20-Dec-2021 | 703 | 33 | 23 |
| fp-fn-statistics | H A D | 20-Dec-2021 | 7.1 KiB | 241 | 127 |
| fp-fn-to-tcr | H A D | 20-Dec-2021 | 3.1 KiB | 89 | 48 |
| freqdiff | H A D | 20-Dec-2021 | 4.6 KiB | 199 | 155 |
| garescorer.c | H A D | 20-Dec-2021 | 36.8 KiB | 1,314 | 1,043 |
| generate-corpus | H A D | 20-Dec-2021 | 521 | 12 | 10 |
| generate-translation | H A D | 20-Dec-2021 | 5.9 KiB | 218 | 160 |
| hit-frequencies | H A D | 20-Dec-2021 | 28.3 KiB | 1,036 | 672 |
| lint-rules-from-freqs | H A D | 20-Dec-2021 | 10.2 KiB | 339 | 247 |
| log-grep-recent | H A D | 20-Dec-2021 | 2.3 KiB | 89 | 51 |
| logdiff | H A D | 20-Dec-2021 | 1.2 KiB | 61 | 49 |
| logrulediff | H A D | 20-Dec-2021 | 1.4 KiB | 64 | 48 |
| logs-to-c | H A D | 20-Dec-2021 | 15.1 KiB | 595 | 407 |
| logs-to-corpus-report | H A D | 20-Dec-2021 | 6.5 KiB | 237 | 173 |
| mass-check | H A D | 20-Dec-2021 | 84.7 KiB | 2,679 | 1,869 |
| mass-check.cf | H A D | 20-Dec-2021 | 57 | 4 | 3 |
| mboxget | H A D | 20-Dec-2021 | 4.6 KiB | 185 | 86 |
| mk-baseline-results | H A D | 20-Dec-2021 | 1.4 KiB | 55 | 37 |
| mk-roc-graphs | H A D | 20-Dec-2021 | 7.6 KiB | 283 | 184 |
| model-statistics | H A D | 20-Dec-2021 | 1.5 KiB | 64 | 40 |
| overlap | H A D | 20-Dec-2021 | 3.6 KiB | 131 | 63 |
| perceptron.c | H A D | 20-Dec-2021 | 12.9 KiB | 481 | 322 |
| perceptron.pod | H A D | 20-Dec-2021 | 790 | 31 | 21 |
| post-ga-analysis.pl | H A D | 20-Dec-2021 | 2.4 KiB | 110 | 92 |
| remove-ids-from-mclog | H A D | 20-Dec-2021 | 1.7 KiB | 64 | 30 |
| rewrite-cf-with-new-scores | H A D | 20-Dec-2021 | 13.5 KiB | 522 | 326 |
| runGA | H A D | 20-Dec-2021 | 5 KiB | 184 | 108 |
| runPerceptron | H A D | 20-Dec-2021 | 2.9 KiB | 104 | 68 |
| score-ranges-from-freqs | H A D | 20-Dec-2021 | 4.5 KiB | 166 | 111 |
| uniq-scores | H A D | 20-Dec-2021 | 950 | 29 | 9 |
| validate-model | H A D | 20-Dec-2021 | 3.2 KiB | 131 | 95 |
README
1
2RESCORING SURVEY: HOW TO TAKE PART
3----------------------------------
4
5The tools in this directory are used to optimise the scoring system used for
6incoming mails, using a genetic algorithm to search for optimal values.
7
8Since this works best with a very large dataset, it would be *great* if you
9(as a user) could run this and submit the results.
10
11The analysis script will not include text from the mails themselves, so
12it will not give away private details from your mail spool. The only
13details you'll give away will be your email address (and I promise *NEVER*
14to give that out or use it for spammy stuff) -- and how many mails you
15have sitting around in folders!
16
17
18CONDITIONS
19----------
20
211. First of all, you must be running it on a UNIX system; it's not portable to
22other OSes yet. Also currently it only reads UNIX mailbox format files, or MH
23spool directories.
24
252. This will not work unless you have separated the mail messages you'll be
26analysing into separate "spam" and "non-spam" piles. It doesn't matter how
27many mailboxes contain spam, or how many mailboxes contain non-spam; you just
28need to be sure you know which set is which!
29
30The latter is most important. If you have occasional spams scattered through
31your mailboxes, or occasional non-spam messages in your trapped spam folder,
32the analysis will be useless.
33
34See the CORPUS_POLICY file for more details.
35
36
37
38HOW TO SUBMIT RESULTS BACK TO US
39--------------------------------
40
41See the file CORPUS_SUBMIT in this directory.
42
43
44HOW IT WORKS
45------------
46
47If you're interested, here's a quick description of the rest of the stuff
48in this directory and what they do:
49
50mass-check :
51
52 This script is used to perform "mass checks" of a set of mailboxes, Cyrus
53 folders, and/or MH mail spools. It generates summary lines like this:
54
55 Y 7 /home/jm/Mail/Sapm/1382 SUBJ_ALL_CAPS,SUPERLONG_LINE,SUBJ_FULL_OF_8BITS
56
57 or for mailboxes,
58
59 . 1 /path/to/mbox:<5.1.0.14.2.20011004073932.05f4fd28@localhost> TRACKER_ID,BALANCE_FOR_LONG
60
61 listing the path to the message or its message ID, its score, and the tests
62 that triggered on that mail.
63
64 Using this info, and a score optimization tool, I can figure out which tests
65 get good hits with few false positives, etc., and re-score the tests to
66 optimise the ratio.
67
68 This script relies on the spamassassin distribution directory living in "..".
69
70
71logs-to-c :
72
73 Takes the "spam.log" and "nonspam.log" files and converts them into C
74 source files and simplified data files for use by the C score optimization
75 algorithm. (Called by "make" when you build the perceptron, so generally
76 you won't need to run it yourself.)
77
78
79hit-frequencies :
80
81 Analyses the log files and computes how often each test hits, overall,
82 for spam mails and for non-spam.
83
84
85mk-baseline-results :
86
87 Compute results for the baseline scores (read from ../rules/*). If you
88 provide the name of a config directory as the first argument, it'll use that
89 instead.
90
91 It will output statistics on the current ruleset to ../rules/STATISTICS.txt,
92 suitable for a release build of SpamAssassin.
93
94
95perceptron.c :
96
97 Perceptron learner by Henry Stern. See "README.perceptron" for details.
98
99
100-- EOF
101
README.perceptron
1
2Fast SpamAssassin Score Learning Tool
3
4Henry Stern
5Faculty of Computer Science
6Dalhousie University
76050 University Avenue
8Halifax, NS Canada
9B3H 1W5
10henry@stern.ca
11
12January 8, 2004
13
141. WHAT IS IT?
15
16This program is used to compute scores for SpamAssassin rules. It makes
17use of data files generated by the suite of scripts in
18spamassassin/masses. The program outputs the generated scores in a file
19titled 'perceptron.scores'.
20
21The advantage of this program over that of the genetic algorithm (GA)
22implementation in spamassassin/masses/craig_evolve.c is that while the GA
23requires several hours to run on high-end machines, the perceptron
24requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.
25
26This makes incremental updates and score personalization practical for the
27end-user and gives developers a better idea just how useful a new rule is.
28
292. OPTIONS
30
31There are four options that can be passed to the perceptron program.
32
33 -p ham_preference
34
35 This increases the number of non-spam messages in the training
36 set. It does this by adding 1 + (number of tests hit) *
37 ham_preference instances of non-spam messages to the training set.
38 This is intended to reduce false positives by encouraging the
39 training program to look at the harder-to-clasify ham messages
40 more often. By default, this parameter is 2.0.
41
42 -e num_epochs
43
44 This parameter sets how many passes the perceptron will make
45 through the training set before terminating. On each pass, the
46 training set is shuffled and then iterated through. By default,
47 it will make 15 passes.
48
49 -l learning_rate
50
51 This parameter modifies the learning rate of the perceptron. The
52 error gradient is computed for each instance. This program uses a
53 logsig activation function y = (1/(1+exp(-x))), so the error
54 gradient is computed as E(x) = y*(1-y)*(is_spam - y). For
55 each instance and score hit in the training set, the scores are
56 modified by adding E(x) / (number of tests hit + 1) *
57 learning_rate. The default value for this is 2.0, but it can be
58 whatever you want.
59
60 -w weight_decay
61
62 To prevent the scores from getting too high (or even forcing them
63 to if you want), before each epoch, the scores and network bias
64 are multiplied by the weight decay. This is off by default (1.0),
65 but it can be useful.
66
673. HOW DOES IT WORK?
68
69This program implements the "Stochastic Gradient Descent" method of
70training a neural network. It uses a single perceptron with a logsig
71activation function and maps the weights to SpamAssassin score space.
72
73The perceptron is the simplest form of neural network. It consists of a
74transfer function and an activation function. Together, they simulate the
75average firing rate of a biological neuron over time.
76
77The transfer function is the sum of the product of weights and inputs.
78It simulates the membrane potential of a neuron. When the accumulated
79charge on the membrane exceeds a certain threshold, the neuron fires,
80sending an electrical impulse down the axon. This implementation uses a
81linear transfer function:
82
83[1] f(x) = bias + sum_{i=1}^{n} w_i * x_i
84
85This is quite similar to how the SpamAssasin score system works. If you
86set the bias to be 0, w_i to be the score for rule i and x_i to be whether
87or not rule i is activated by a given message, the transfer function will
88return the score of the message.
89
90The activation function simulates the electrical spike travelling down the
91axon. It takes the output of the transfer function as input and applies
92some sort of transformation to it. This implementation uses a logsig
93activation function:
94
95[2] y(x) = 1 / (1 + exp(-f(x)))
96
97This non-linear function constrains the output of the transfer function
98between 0 and 1. When plotted, it looks somewhat S-shaped with vertical
99asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.
100The slope of the function is greatest at x=0 and tapers off as it
101approaches the asymptotes.
102
103Lastly, the performance of the perceptron needs to be measured using an
104error function. Two error functions are commonly used: mean squared
105error and entropic error. By default, this implementation uses mean
106squared error but entropic error may be substituted by adding a compiler
107directive.
108
109The most common method of training neural networks is called gradient
110descent. It involves iteratively tuning the parameters of the network so
111that the mean error rate always decreases. This is done by finding the
112direction of steepest descent down the "error gradient," reducing the
113value of the error function for the next iteration of the algorithm.
114
115If the transfer function and activation function are both differentiable,
116the error gradient of a neural network can be calculated with respect to
117the weights and bias. Without getting into calculus, the error gradient
118for a perceptron with a linear transfer function, logsig activation
119function and mean squared error function is:
120
121[3] E(x) = y(x) * (1-y(x)) * (y_expected - y(x))
122
123The weights are updated using the function:
124
125[4] w_i = w_i + E(x) * x_i * learning_rate
126
127Since the SpamAssassin rule hits are sparse, the basic gradient descent
128algorithm is impractical. This implementation uses a variation called
129"Stochastic gradient descent." Instead of doing one batch update per
130epoch, the training set is randomly walked through, doing incremental
131updates. In addition, in this implementation, the learning rate is
132modified by the number of rule hits for a given training instance.
133Together, these allow good weights to be computed for
134infrequently-occurring rules.
135
136Once the perceptron has finished running, the weights are converted to
137scores and exported using the familiar file format. Weights are converted
138to scores using this function:
139
140[5] score(weight) = -threshold * weight / bias
141
1424. ACKNOWLEDGEMENTS
143
144I would like to thank my PhD supervisor, Michael Shepherd, for not getting
145mad at me while I worked on this. I'd also like to thank Thomas
146Trappenberg for his invaluable assistance while I was tweaking the
147performance of the learning algorithm. I would like to thank Daniel
148Quinlan, Justin Mason and Theo Van Dinter for their valuable input and
149constructive criticism.
150
151--
152hs
1538/1/2004
154
155
156Updates
1572007-07-01: http://old.nabble.com/Re%3A-Spam-research-tp11386525p11386525.html
1582010-04-24: http://old.nabble.com/hardware-on-ruleqa-tp28352058p28353585.html
159See also http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5376
160