masses - OpenGrok cross reference for /dports/mail/spamassassin-devel/spamassassin-1ea352210/masses/

RESCORING SURVEY: HOW TO TAKE PART
----------------------------------

The tools in this directory are used to optimise the scoring system used for
incoming mails, using a genetic algorithm to search for optimal values.

Since this works best with a very large dataset, it would be *great* if you
(as a user) could run this and submit the results.

The analysis script will not include text from the mails themselves, so
it will not give away private details from your mail spool.  The only
details you'll give away will be your email address (and I promise *NEVER*
to give that out or use it for spammy stuff) -- and how many mails you
have sitting around in folders!


CONDITIONS
----------

1. First of all, you must be running it on a UNIX system; it's not portable to
other OSes yet.  Also currently it only reads UNIX mailbox format files, or MH
spool directories.

2. This will not work unless you have separated the mail messages you'll be
analysing into separate "spam" and "non-spam" piles.  It doesn't matter how
many mailboxes contain spam, or how many mailboxes contain non-spam; you just
need to be sure you know which set is which!

The latter is most important.  If you have occasional spams scattered through
your mailboxes, or occasional non-spam messages in your trapped spam folder,
the analysis will be useless.

See the CORPUS_POLICY file for more details.


HOW TO SUBMIT RESULTS BACK TO US
--------------------------------

See the file CORPUS_SUBMIT in this directory.


HOW IT WORKS
------------

If you're interested, here's a quick description of the rest of the stuff
in this directory and what they do:

mass-check :

  This script is used to perform "mass checks" of a set of mailboxes, Cyrus
  folders, and/or MH mail spools.  It generates summary lines like this:

  Y  7 /home/jm/Mail/Sapm/1382 SUBJ_ALL_CAPS,SUPERLONG_LINE,SUBJ_FULL_OF_8BITS

  or for mailboxes,

  .  1 /path/to/mbox:<5.1.0.14.2.20011004073932.05f4fd28@localhost> TRACKER_ID,BALANCE_FOR_LONG

  listing the path to the message or its message ID, its score, and the tests
  that triggered on that mail.

  Using this info, and a score optimization tool, I can figure out which tests
  get good hits with few false positives, etc., and re-score the tests to
  optimise the ratio.

  This script relies on the spamassassin distribution directory living in "..".


logs-to-c :

  Takes the "spam.log" and "nonspam.log" files and converts them into C
  source files and simplified data files for use by the C score optimization
  algorithm.  (Called by "make" when you build the perceptron, so generally
  you won't need to run it yourself.)


hit-frequencies :

  Analyses the log files and computes how often each test hits, overall,
  for spam mails and for non-spam.


mk-baseline-results :

  Compute results for the baseline scores (read from ../rules/*).  If you
  provide the name of a config directory as the first argument, it'll use that
  instead.

  It will output statistics on the current ruleset to ../rules/STATISTICS.txt,
  suitable for a release build of SpamAssassin.


perceptron.c :

  Perceptron learner by Henry Stern.  See "README.perceptron" for details.


-- EOF
Name		Date	Size	#Lines	LOC
..		03-May-2022	-
bayes-testing/	H	20-Dec-2021	-	2,351	1,598
contrib/automasscheck-minimal/	H	20-Dec-2021	-	227	177
corpora/	H	20-Dec-2021	-	1,334	966
evolve_metarule/	H	20-Dec-2021	-	598	389
graphs/	H	20-Dec-2021	-	31	20
plugins/	H	20-Dec-2021	-	298	195
rule-dev/	H	20-Dec-2021	-	1,752	1,055
rule-qa/	H	20-Dec-2021	-	5,695	4,110
rule-update-score-gen/	H	20-Dec-2021	-	1,066	638
tenpass/	H	20-Dec-2021	-	383	264
CORPUS_POLICY	H A D	20-Dec-2021	1.9 KiB	46	33
CORPUS_SUBMIT	H A D	20-Dec-2021	1.6 KiB	33	26
CORPUS_SUBMIT_NIGHTLY	H A D	20-Dec-2021	921	30	19
Makefile	H A D	20-Dec-2021	1.5 KiB	57	35
README	H A D	20-Dec-2021	3.1 KiB	101	60
README.perceptron	H A D	20-Dec-2021	6.3 KiB	160	119
compare-models	H A D	20-Dec-2021	4.6 KiB	166	102
config	H A D	20-Dec-2021	61	6	5
config.set0	H A D	20-Dec-2021	60	6	5
config.set1	H A D	20-Dec-2021	61	6	5
config.set2	H A D	20-Dec-2021	61	6	5
config.set3	H A D	20-Dec-2021	61	6	5
cpucount	H A D	20-Dec-2021	1.4 KiB	55	51
enable-all-evolved-rules	H A D	20-Dec-2021	1.9 KiB	62	27
extract-results	H A D	20-Dec-2021	682	32	18
find-extremes	H A D	20-Dec-2021	10.2 KiB	355	281
force-publish-active-rules	H A D	20-Dec-2021	703	33	23
fp-fn-statistics	H A D	20-Dec-2021	7.1 KiB	241	127
fp-fn-to-tcr	H A D	20-Dec-2021	3.1 KiB	89	48
freqdiff	H A D	20-Dec-2021	4.6 KiB	199	155
garescorer.c	H A D	20-Dec-2021	36.8 KiB	1,314	1,043
generate-corpus	H A D	20-Dec-2021	521	12	10
generate-translation	H A D	20-Dec-2021	5.9 KiB	218	160
hit-frequencies	H A D	20-Dec-2021	28.3 KiB	1,036	672
lint-rules-from-freqs	H A D	20-Dec-2021	10.2 KiB	339	247
log-grep-recent	H A D	20-Dec-2021	2.3 KiB	89	51
logdiff	H A D	20-Dec-2021	1.2 KiB	61	49
logrulediff	H A D	20-Dec-2021	1.4 KiB	64	48
logs-to-c	H A D	20-Dec-2021	15.1 KiB	595	407
logs-to-corpus-report	H A D	20-Dec-2021	6.5 KiB	237	173
mass-check	H A D	20-Dec-2021	84.7 KiB	2,679	1,869
mass-check.cf	H A D	20-Dec-2021	57	4	3
mboxget	H A D	20-Dec-2021	4.6 KiB	185	86
mk-baseline-results	H A D	20-Dec-2021	1.4 KiB	55	37
mk-roc-graphs	H A D	20-Dec-2021	7.6 KiB	283	184
model-statistics	H A D	20-Dec-2021	1.5 KiB	64	40
overlap	H A D	20-Dec-2021	3.6 KiB	131	63
perceptron.c	H A D	20-Dec-2021	12.9 KiB	481	322
perceptron.pod	H A D	20-Dec-2021	790	31	21
post-ga-analysis.pl	H A D	20-Dec-2021	2.4 KiB	110	92
remove-ids-from-mclog	H A D	20-Dec-2021	1.7 KiB	64	30
rewrite-cf-with-new-scores	H A D	20-Dec-2021	13.5 KiB	522	326
runGA	H A D	20-Dec-2021	5 KiB	184	108
runPerceptron	H A D	20-Dec-2021	2.9 KiB	104	68
score-ranges-from-freqs	H A D	20-Dec-2021	4.5 KiB	166	111
uniq-scores	H A D	20-Dec-2021	950	29	9
validate-model	H A D	20-Dec-2021	3.2 KiB	131	95