1Received: from usw-sf-list2.yyyyyyyyyyyy.net (usw-sf-fw2.yyyyyyyyyyyy.net
2     [216.136.171.252]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id
3     g7HFlZ603002 for <zzzzzz-sa@zzzzzz.org>; Sat, 17 Aug 2002 16:47:35 +0100
4Received: from usw-sf-list1-b.yyyyyyyyyyyy.net ([10.3.1.13]
5     helo=usw-sf-list1.yyyyyyyyyyyy.net) by usw-sf-list2.yyyyyyyyyyyy.net with
6     esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17g5m8-000654-00; Sat,
7     17 Aug 2002 08:46:04 -0700
8Received: from dogma.slashnull.org ([212.17.35.15]) by
9     usw-sf-list1.yyyyyyyyyyyy.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id
10     17g5lM-0005xL-00 for <SpamAssassin-talk@lists.yyyyyyyyyyyy.net>;
11     Sat, 17 Aug 2002 08:45:16 -0700
12Received: (from apache@localhost) by dogma.slashnull.org (8.11.6/8.11.6)
13     id g7HFj8h02977; Sat, 17 Aug 2002 16:45:08 +0100
14X-Authentication-Warning: dogma.slashnull.org: apache set sender to
15     zzzzzz@zzzzzz.org using -f
16Received: from 194.125.173.146 (SquirrelMail authenticated user zzzzzz) by
17     zzzzzz.org with HTTP; Sat, 17 Aug 2002 16:45:08 +0100 (IST)
18Message-Id: <33025.194.125.173.146.1029599108.squirrel@zzzzzz.org>
19From: "Justin Mason" <zzzzzz@zzzzzz.org>
20To: SpamAssassin-talk@lists.yyyyyyyyyyyy.net
21X-Mailer: SquirrelMail (version 1.0.6)
22MIME-Version: 1.0
23Content-Type: text/plain; charset=iso-8859-1
24Content-Transfer-Encoding: 8bit
25Subject: [SAtalk] spam-phrases existing algo
26Sender: spamassassin-talk-admin@lists.yyyyyyyyyyyy.net
27Errors-To: spamassassin-talk-admin@lists.yyyyyyyyyyyy.net
28X-Beenthere: spamassassin-talk@lists.yyyyyyyyyyyy.net
29X-Mailman-Version: 2.0.9-sf.net
30Precedence: bulk
31List-Help: <mailto:spamassassin-talk-request@lists.yyyyyyyyyyyy.net?subject=help>
32List-Post: <mailto:spamassassin-talk@lists.yyyyyyyyyyyy.net>
33List-Subscribe: <https://lists.yyyyyyyyyyyy.net/lists/listinfo/spamassassin-talk>,
34     <mailto:spamassassin-talk-request@lists.yyyyyyyyyyyy.net?subject=subscribe>
35List-Id: Talk about SpamAssassin <spamassassin-talk.lists.yyyyyyyyyyyy.net>
36List-Unsubscribe: <https://lists.yyyyyyyyyyyy.net/lists/listinfo/spamassassin-talk>,
37     <mailto:spamassassin-talk-request@lists.yyyyyyyyyyyy.net?subject=unsubscribe>
38List-Archive: <http://www.geocrawler.com/redir-sf.php3?list=spamassassin-talk>
39X-Original-Date: Sat, 17 Aug 2002 16:45:08 +0100 (IST)
40Date: Sat, 17 Aug 2002 16:45:08 +0100 (IST)
41
42BTW, I should not that this algorithm Paul Graham uses is
43very close to what we've got in spam-phrases code already.
44
45To turn it into pcode:
46
47  mass-check for spamphrases:
48
49    - get mail body, strip HTML, attachments and mail formatting
50    - strip stopwords ("to", "of", "a" etc.)
51    - find pairs of 3-20 letter words
52    - foreach pair:
53      - skip pair if one word is in stoplist of common terms
54      - ++ the frequency of that word-pair
55
56  settle-phrases -- turn mass-check results into a spamphrases file
57
58    - read all spam word-pairs, let NS = number of word-pairs
59    - read all nonspam word-pairs, let NN = number of word-pairs
60    - let bias = NS / NN (compensates for different corpus size)
61    - foreach nonspam word-pair:
62      - wpfreq = (freq in spam) - (frequency in nonspam * bias)
63    - foreach spam word-pair:
64      - if (wordpair was not found in nonspam):
65        - wpfreq *= 10
66    - note the highest score of all rules
67
68  scoring of an incoming message:
69
70    - get mail body, strip HTML, attachments and mail formatting
71    - strip stopwords ("to", "of", "a" etc.)
72    - find pairs of 3-20 letter words
73    - foreach pair:
74      - score += ((wpfreq*10) / highest_score_of_all_rules)
75    - foreach "!" found in text:
76      - score++
77    - return result as "spam phrase score".
78
79So it's quite close to PG's algo, but he also tracks the non-spam
80word-pairs -- which we don't do for SpamAssassin, because they
81overfit to the mass-checker's nonspam mail corpus (generally
82names of friends, etc.)
83
84--j.
85
86
87
88-------------------------------------------------------
89This sf.net email is sponsored by: OSDN - Tired of that same old
90cell phone?  Get a new here for FREE!
91https://www.inphonic.com/r.asp?r=yyyyyyyyyyyy&refcode1=vs3390
92_______________________________________________
93Spamassassin-talk mailing list
94Spamassassin-talk@lists.yyyyyyyyyyyy.net
95https://lists.yyyyyyyyyyyy.net/lists/listinfo/spamassassin-talk
96