1Received: from usw-sf-list2.yyyyyyyyyyyy.net (usw-sf-fw2.yyyyyyyyyyyy.net 2 [216.136.171.252]) by dogma.slashnull.org (8.11.6/8.11.6) with ESMTP id 3 g7HFlZ603002 for <zzzzzz-sa@zzzzzz.org>; Sat, 17 Aug 2002 16:47:35 +0100 4Received: from usw-sf-list1-b.yyyyyyyyyyyy.net ([10.3.1.13] 5 helo=usw-sf-list1.yyyyyyyyyyyy.net) by usw-sf-list2.yyyyyyyyyyyy.net with 6 esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 17g5m8-000654-00; Sat, 7 17 Aug 2002 08:46:04 -0700 8Received: from dogma.slashnull.org ([212.17.35.15]) by 9 usw-sf-list1.yyyyyyyyyyyy.net with esmtp (Exim 3.31-VA-mm2 #1 (Debian)) id 10 17g5lM-0005xL-00 for <SpamAssassin-talk@lists.yyyyyyyyyyyy.net>; 11 Sat, 17 Aug 2002 08:45:16 -0700 12Received: (from apache@localhost) by dogma.slashnull.org (8.11.6/8.11.6) 13 id g7HFj8h02977; Sat, 17 Aug 2002 16:45:08 +0100 14X-Authentication-Warning: dogma.slashnull.org: apache set sender to 15 zzzzzz@zzzzzz.org using -f 16Received: from 194.125.173.146 (SquirrelMail authenticated user zzzzzz) by 17 zzzzzz.org with HTTP; Sat, 17 Aug 2002 16:45:08 +0100 (IST) 18Message-Id: <33025.194.125.173.146.1029599108.squirrel@zzzzzz.org> 19From: "Justin Mason" <zzzzzz@zzzzzz.org> 20To: SpamAssassin-talk@lists.yyyyyyyyyyyy.net 21X-Mailer: SquirrelMail (version 1.0.6) 22MIME-Version: 1.0 23Content-Type: text/plain; charset=iso-8859-1 24Content-Transfer-Encoding: 8bit 25Subject: [SAtalk] spam-phrases existing algo 26Sender: spamassassin-talk-admin@lists.yyyyyyyyyyyy.net 27Errors-To: spamassassin-talk-admin@lists.yyyyyyyyyyyy.net 28X-Beenthere: spamassassin-talk@lists.yyyyyyyyyyyy.net 29X-Mailman-Version: 2.0.9-sf.net 30Precedence: bulk 31List-Help: <mailto:spamassassin-talk-request@lists.yyyyyyyyyyyy.net?subject=help> 32List-Post: <mailto:spamassassin-talk@lists.yyyyyyyyyyyy.net> 33List-Subscribe: <https://lists.yyyyyyyyyyyy.net/lists/listinfo/spamassassin-talk>, 34 <mailto:spamassassin-talk-request@lists.yyyyyyyyyyyy.net?subject=subscribe> 35List-Id: Talk about SpamAssassin <spamassassin-talk.lists.yyyyyyyyyyyy.net> 36List-Unsubscribe: <https://lists.yyyyyyyyyyyy.net/lists/listinfo/spamassassin-talk>, 37 <mailto:spamassassin-talk-request@lists.yyyyyyyyyyyy.net?subject=unsubscribe> 38List-Archive: <http://www.geocrawler.com/redir-sf.php3?list=spamassassin-talk> 39X-Original-Date: Sat, 17 Aug 2002 16:45:08 +0100 (IST) 40Date: Sat, 17 Aug 2002 16:45:08 +0100 (IST) 41 42BTW, I should not that this algorithm Paul Graham uses is 43very close to what we've got in spam-phrases code already. 44 45To turn it into pcode: 46 47 mass-check for spamphrases: 48 49 - get mail body, strip HTML, attachments and mail formatting 50 - strip stopwords ("to", "of", "a" etc.) 51 - find pairs of 3-20 letter words 52 - foreach pair: 53 - skip pair if one word is in stoplist of common terms 54 - ++ the frequency of that word-pair 55 56 settle-phrases -- turn mass-check results into a spamphrases file 57 58 - read all spam word-pairs, let NS = number of word-pairs 59 - read all nonspam word-pairs, let NN = number of word-pairs 60 - let bias = NS / NN (compensates for different corpus size) 61 - foreach nonspam word-pair: 62 - wpfreq = (freq in spam) - (frequency in nonspam * bias) 63 - foreach spam word-pair: 64 - if (wordpair was not found in nonspam): 65 - wpfreq *= 10 66 - note the highest score of all rules 67 68 scoring of an incoming message: 69 70 - get mail body, strip HTML, attachments and mail formatting 71 - strip stopwords ("to", "of", "a" etc.) 72 - find pairs of 3-20 letter words 73 - foreach pair: 74 - score += ((wpfreq*10) / highest_score_of_all_rules) 75 - foreach "!" found in text: 76 - score++ 77 - return result as "spam phrase score". 78 79So it's quite close to PG's algo, but he also tracks the non-spam 80word-pairs -- which we don't do for SpamAssassin, because they 81overfit to the mass-checker's nonspam mail corpus (generally 82names of friends, etc.) 83 84--j. 85 86 87 88------------------------------------------------------- 89This sf.net email is sponsored by: OSDN - Tired of that same old 90cell phone? Get a new here for FREE! 91https://www.inphonic.com/r.asp?r=yyyyyyyyyyyy&refcode1=vs3390 92_______________________________________________ 93Spamassassin-talk mailing list 94Spamassassin-talk@lists.yyyyyyyyyyyy.net 95https://lists.yyyyyyyyyyyy.net/lists/listinfo/spamassassin-talk 96