1[14/Jan/2007 Version 2.0.4 2o Changes to osbf module 3 - Removed unnecessary linking of liblua.a, which caused segfaults on 4 IRIX 6.5.30. This fix also reduced the size of the module by a 5 factor of 5 or more. Problem detected and fixed by Holger Weiss. 6 - Fixed the number of args returned by osbf.classify in case of error. 7 8o Changes to spamfilter.lua - version 2.0.3 9 - Added --help option; 10 - Extended syntax to read from file passed as arg in command line. 11 If no file is given it uses standard input, as usual; 12 - Better error handling; 13 - Fixed optind in getopt.lua. 14 15o Fixed a date parsing error in cache_report.lua, caused mainly by 16 ill-formed date fields in spam messages; 17 18o The scripts classify.lua and train.lua were renamed to classify.sample 19 and train.sample, because they are meant more as samples, starting 20 points for customized scripts, than for real use. spamfilter.lua should 21 be used for real classifications and trainings. 22 23o Added the file COPYRIGHT_AGREEMENT which states the dual-license 24 agreement between Fidelis Assis and William Yerazunis. 25 26[17/Nov/2006] Version 2.0.3 27o When a SFID is not found in the cache it's now added to the report 28 message, for reference purposes; 29o New config option, osbf.cfg_mail_cmd, to specify the mail command used by 30 the spamfilter; 31o Fixes and improvements to cache_report.lua; 32o Minor fixes and improvements to spamfilter.lua and spamfilter_commands.lua; 33o More flexible config. 34 35[15/Oct/2006] Version 2.0.2 36o Added a new script, cache_report.lua. It sends an email with an HTML form 37 that makes training really easy. The form is an HTML table with Date, From, 38 Subject and a drop down menu with the possible actions: Train as spam, 39 Train as non-spam, Add 'From:' to whitelist, etc. This training mechanism 40 requires the new OSBF module v2.0.2 and that the email client supports HTML 41 messages with "mailto" form action. It works fine with Mozilla Thunderbird 42 and Microsoft Outlook but was not tested with other clients. This script is 43 tipically launched from a cron job. Read the text at the top of the script 44 to know how to use. 45 46o Changes to osbf module 47 - Added the function osbf.dir, a directory iteractor presented in the PIL 48 book, to support the new training mechanism mentioned above; 49 - Replaced the call to luaL_opendir with the new luaL_register; 50 - osbf.create_db and osbf.remove_db now check if the first arg is a table 51 and osbf.create_db returns an error if the file already exists; 52 - osbf.classify now returns an additional value which is a Lua table with 53 the number of trainings for each class. See the manual for details; 54 - Added an optional second argument to osbf.stats to specify full 55 (default) or fast statistics. 56 - Fixes to white and blacklist handling; 57 - Added PREFIX to makefile config, for easier local installation - patch 58 sent by Christian Siefkes. 59 60o Changes to spamfilter.lua - version 2.0 61 - New subject-line command: batch_train <pwd>. This command allows training 62 in bach, that is, many sfids can be sent in the body of the message, along 63 with the right class. Ex: 64 65 sfid-+20060924-215225-+005.65-1@spamfilter.osbf.lua=spam 66 sfid-+20060924-215238-+001.53-1@spamfilter.osbf.lua=nonspam 67 ... 68 69 It can be used manually but its main purpose is to allow the new 70 semi-automated batch training mechanism used by cache_report.lua. 71 - New subject-line command train_form <password>. This command executes the 72 script cache_report.lua which sends a mail with a training form to the 73 user. 74 75 - Improved handling of white and blacklists. 76 77o Minor fixes to the docs. 78 79[02/Sep/2006] Version 2.0.1 80o This version incorporates all changes in version 2.0; 81o Changes to the osbf module: 82 - Changed the function osbf.import to read from a .cfc file instead of from 83 a .csv one; 84o Improvements and fixes to spamfilter.lua: 85 - Changed the tags [s] and [h], in the X-OSBF-Lua-Score header to [-] and 86 [+], respectively, because some email client filters are case insensitive 87 and can't distinguish between [s] and [S]. This is useful for those who 88 prefer not to tag subject lines and filter using the information in the 89 X-OSBF-Lua-Score header; 90 - Fixed a bug that caused messages with score below 91 osbf.cfg_remove_body_threshold to have their bodies removed even when 92 whitelisted; 93 - Added a new command-line option, --source = <message_source>, to specify 94 the source of the message to be used for training. The possible values 95 for <message_source> are: 96 + stdin - the message is read directly from stdin. This is the default. 97 + sfid - the message is recovered from the cache, using the sfid present 98 in the header of the message read from stdin. The message read 99 from stdin must have been classified previously, in order to 100 have a sfid in the header. 101 + body - the message to be trained with is the body of the message read 102 from stdin. 103 These options are valid only in conjunction with one of the commands 104 --learn or --unlearn. 105 - Added a new command-line option, --output, to determine what is written 106 to stdout after training a message (suggested by Steve Pellegrin): 107 --output=report => a report message is sent to stdout. This is the 108 default action. 109 --output=message => the original message, classified as spam or ham, 110 according to the the training command, is written 111 to stdout. 112 - New config option, osbf.cfg_insert_sfid_in, to define where the sfid 113 will be inserted; 114 - Now, trained messages have their cached name changed to reflect the new 115 state: learned as spam or learned as ham. The changed names can be used 116 for automatic retraining or for rebuilding the databases. The change in 117 name also prevents training a message more than once or unlearning a 118 message that was not learned before; 119 - Fixed a bug in the error handling of invalid command-line options; 120 - There's a new config option, osbf.cfg_insert_sfid_in, to determine where 121 the SFID will be inserted when an incoming message is classified: 122 "references", "message-id" or "both". The default is now to insert in 123 both, References and Message-ID headers, because some email clients 124 don't follow RFC2822 strictly and reinsert only one of them in a reply; 125 - Old SFIDs are now removed when an incoming message is classified, right 126 before the new one is inserted. 127o Updated the training method in toer.lua to the same one introduced in 128 spamfilter.lua version 2.0. As of this version, toer.lua uses the 129 TREC format for both corpora and result files. 130 131[11/Feb/2006] Version 2.0 132o This version was used for TREC 2006 tests only and was not released; 133o Improvements and fixes to the osbf module 134 - Adjustments to the EDDC formula and better tuning of the intrinsic 135 OSB-bigram weights for improved AUC; 136 - Added specific counters for classification, mistake and extra 137 learning, besides the existing learning counter. 138 - Bug fixes; 139o Improvements and fixes to spamfilter 140 - New training method, a variant of TOER (see toer.lua), where extra 141 trainings using exclusively the header are done if the first one, with 142 the full message, was not enough to change the score to an acceptable 143 value. In many tests, with different corpora, this new method resulted 144 in improved Area Under the ROC Curve (AUC); 145 - The messages cached for later training are now saved under the directory 146 "cache", parallel to the previous "log". You must create the directory 147 "cache" before using the filter; 148 - New option for caching the messages in a subdir structure formed by 149 "DD/HH", under the cache dir, to avoid excessive messages per directory; 150 - Added accuracy statistics to the stats command, based on the new counters; 151 - The DSTTT method is not used any more; 152 - Added many command line options - check the file spamfilter.help; 153 - Bug fixes. 154 155OBS: Versions after 1.5.6b and before 2.0.1 were experimental and not 156 released. 157 158[20/Feb/2006] Version 1.5.6b 159o Added a new option to osbf.config: limit_token_size, which toggles 160 token size limitation on when different from 0. The default value 161 is 0 and restores the traditional behavior broken in 1.5.5b; 162o Fixed a bad collateral effect in get_next_hash introduced in v1.5.5b 163 - long sequences of long tokens were not being collapsed any more; 164 165[19/Feb/2006] Version 1.5.5b 166o Fixed a memory leak in osbf.classify; 167o Two new options to osbf.config: max_token_size and max_long_tokens. 168 For testing and special tuning purposes; 169o Added train.lua a script for training from stdin; 170o Added getopt.lua a lua function useful for handling command line 171 arguments, similar to the C getopt_long; 172o Minor change to toer.lua, now it stops without an error message if 173 there are less index files than what is expected in the for loop. 174 It prints an error message if none is found, though. 175 176[21/Jan/2006] Version 1.5.4b 177o Now we have that nice logo at the top, sent by Alessandro Martins 178 <alessandro@martins.eng.br>; 179o Added a new function to the osbf module: osbf.import("file.cfc", 180 "file.csv"). This function is similar to osbf.restore but, contray to 181 that, file.cfc must already exist before the importing and, instead of 182 restoring the original .cfc, the buckets in file .csv will be imported 183 into the existing file .cfc, which can have more or lessi buckets than 184 the original .cfc. Its main use is to create a larger database from an 185 older and full one, preserving the contents. 186o Better separation of lib and bind codes, what will make it easier to 187 adapt the module to other languages; 188o Doc files moved to the new docs dir. 189 190[08/Jan/2006] Version 1.5.3b 191o Fixes to the osbf module 192o Fixed the database restore function - osbf.restore; 193o Changed the osbf.so link from absolute to relative to make it simpler 194 to generate the Slackware package - suggested by Alessandro Martins 195 <alessandro@martins.eng.br>. 196o Improvements and fixes to spamfilter (v1.1.3): 197 - Better detection of the "Subject:" header line; 198 - Improved scan for a command in the subject line. Now it'll detect a 199 command even if another filter in the middle has mistakenly added a 200 tag to the beginning of the subject line. Problem pointed out by 201 Pavel Kolar. 202 203[01/Jan/2006] Version 1.5.2b 204o Improvements and fixes to spamfilter: 205 - The recover command now sends the recovered message as an attachment; 206 - Added a new config option, osbf.cfg_remove_body_threshold, to remove 207 the body of spam messages. Setting osbf.cfg_remove_body_threshold = 20 208 in spamfilter_config.lua removes the body of all spam messages with 209 score greater than 20. The original message is still available with 210 the recover command, if needed; 211o Fixed a problem that occurred when a command-message was sent in HTML 212 format. Because of the Content-Type header in the original message, 213 the answer, in plain text format, was not visible; 214o Fixed a bug in the password parsing. An invalid password was accepted 215 as OK if it started with the valid password as a substring and was the 216 last string in the command. 217o Improvements to the lib 218 - New function added, osbf.config, to allow internal parameter 219 adjustments. This function is more intended for experiments and 220 debugging. 221 222[15/Nov/2005] Version 1.5.1b 223o Improvements and fixes to spamfilter, toer.lua and docs: 224 - All X-OSBF headers were merged into a single one as suggested by Pavel 225 Kolar <kolar@fzu.cz>: Ex: X-OSBF-Lua-Score: 33.63/0.00 [H] (v1.5.1b, 226 Spamfilter v1.1) 227 - White and blacklisted messages are now classified too, so that the 228 score in the header X-OSBF-Lua-Score is the real one, as if they 229 hadn't been listed - suggested by Pavel Kolar. The subject tags for 230 blacklisted and whitelisted messages are the same as configured for 231 spam and ham in the config file, respectively; 232 - The tags in the X-OSBF-Lua-Score header don't follow the subject tags 233 defined in the config file any more. They're now fixed: [B], [S], [s], 234 [h], [H], [W] for blacklisted, spam, spam reinforcement, ham 235 reinforcement, ham and whitelisted, according to the classification; 236 - White and black lists don't use Lua regex by default any more. There's 237 a new option in the config file to turn regex on or off: 238 osbf.cfg_lists_use_regex; 239 - Removed the trailing spaces from the subject tags in the config file. 240 They're now added internally; 241 - Removed duplicate database info showed by the stats <pwd> command; 242 - The var unlearn_threshold in spamfilter_commands.lua is now an option 243 in the config file, as it should: osbf.cfg_unlearn_threshold; 244 - More consistent thresholds checking in toer.lua; 245 - DSTTT is now the default training method in toer.lua. 246 - Added the script roc.lua, which calculates 1-ROCAC%, a measure of the 247 quality of the classifier. 248 249[06/Nov/2005] Version 1.5b - first public release 250o Re-tuning of internal parameters, after the chain rule fix, resulting 251 in improved accuracy. 252o Docs and example scripts updated. 253 254[30/Sep/2005] Version 1.4b - internal use only 255o Changed seen_features and other flags data struture to a separate array 256 of unsigned chars, in the learn function. 257 258[25/Sep/2005] Version 1.3b - internal use only 259o C and Lua codes updated for lua-5.1-alpha 260o No more captures in string.find 261o Code changed to use new Lua function string.mach 262 263[08/Sep/2005] Version 1.2b - internal use only 264o Fixed an old bug in the chain rule that caused bad accuracy with some 265 corpus. It sometimes would also cause unexpected worse scores after 266 training, as if one had done an "unlearn"; 267o Fixed a bug in the "unlearn" code that caused broken chains in the 268 databases; 269o Implemented a new training method acting on both, spam and ham, 270 databases simultaneously, doing a "learn" on the right database and an 271 "unlearn" on the opposite if the score improvement was not enough. Now, 272 both toer.lua and spamfilter.lua use this new method; 273 274[25/Aug/2005] Version 1.1b - internal use only 275o Changed the training method used by the spamfilter. Now the original 276 message is saved under a unique SpamFilter ID (SFID) on the server and 277 the original message is sent to the user with the SFID added as a 278 comment to its "Message-ID" header. The original message is recovered, 279 using the SFID sent back by the user's mail client, in the "In-Reply-To" 280 or "References" header, when he does a "Reply" for training. 281 282[13/May/2005] Version 1.0b18 - internal use only 283[16/Mar/2005] Version 1.0b12 - internal use only 284[28/Jan/2005] Version 1.0b1 - internal use only 285 286