1Snowball 2.0.0 (2019-10-02) 2=========================== 3 4C/C++ 5----- 6 7* Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled 8 sequences of any length, but commands which look at the character value only 9 handled sequences up to length 3. Fixes #89. 10 11* Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`. 12 13Java 14---- 15 16* TestApp.java: 17 18 - Always use UTF-8 for I/O. Patch from David Corbett (#80). 19 20 - Allow reading input from stdin. 21 22 - Remove rather pointless "stem n times" feature. 23 24 - Only lower case ASCII to match stemwords.c. 25 26 - Stem empty lines too to match stemwords.c. 27 28Code Quality Improvements 29------------------------- 30 31* Fix various warnings from newer compilers. 32 33* Improve use of `const`. 34 35* Share common functions between compiler backends rather than having multiple 36 copies of the same code. 37 38* Assorted code clean-up. 39 40* Initialise line_labelled member of struct generator to 0. Previously we were 41 invoking undefined behaviour, though in practice it'll be zero initialised on 42 most platforms. 43 44New Code Generators 45------------------- 46 47* Add Python generator (#24). Originally written by Yoshiki Shibukawa, with 48 additional updates by Dmitry Shachnev. 49 50* Add Javascript generator. Based on JSX generator (#26) written by Yoshiki 51 Shibukawa. 52 53* Add Rust generator from Jakob Demler (#51). 54 55* Add Go generator from Marty Schoch (#57). 56 57* Add C# generator. Based on patch from Cesar Souza (#16, #17). 58 59* Add Pascal generator. Based on Delphi backend from stemming.zip file on old 60 website (#75). 61 62New Language Features 63--------------------- 64 65* Add `len` and `lenof` to measure Unicode length. These are similar to `size` 66 and `sizeof` (respectively), but `size` and `sizeof` return the length in 67 bytes under `-utf8`, whereas these new commands give the same result whether 68 using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in 69 the length of the string). For compatibility with existing code which might 70 use these as variable or function names, they stop being treated as tokens if 71 declared to be a variable or function. 72 73* New `{U+1234}` stringdef notation for Unicode codepoints. 74 75* More versatile integer tests. Now you can compare any two arithmetic 76 expressions with a relational operator in parentheses after the `$`, so for 77 example `$(len > 3)` can now be used when previously a temporary variable was 78 required: `$tmp = len $tmp > 3` 79 80Code generation improvements 81---------------------------- 82 83* General: 84 85 + Avoid unnecessarily saving and restoring of the cursor for more commands - 86 `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always 87 restore its value, and for C `booltest` (which other languages already 88 handled). 89 90 + Special case handling for `setlimit tomark AE`. All uses of setlimit in 91 the current stemmers we ship follow this pattern, and by special-casing we 92 can avoid having to save and restore the cursor (#74). 93 94 + Merge duplicate actions in the same `among`. This reduces the size of the 95 switch/if-chain in the generated code which dispatch the among for many of 96 the stemmers. 97 98 + Generate simpler code for `among`. We always check for a zero return value 99 when we call the among, so there's no point also checking for that in the 100 switch/if-chain. We can also avoid the switch/if-chain entirely when 101 there's only one possible outcome (besides the zero return). 102 103 + Optimise code generated for `do <function call>`. This speeds up "make 104 check_python" by about 2%, and should speed up other interpreted languages 105 too (#110). 106 107 + Generate more and better comments referencing snowball source. 108 109 + Add homepage URL and compiler version as comments in generated files. 110 111* C/C++: 112 113 + Fix `size` and `sizeof` to not report one too high (reported by Assem 114 Chelli in #32). 115 116 + If signal `f` from a function call would lead to return from the current 117 function then handle this and bailing out on an error together with a 118 simple `if (ret <= 0) return ret;` 119 120 + Inline testing for a single character literals. 121 122 + Avoiding generating `|| 0` in corner case - this can result in a compiler 123 warning when building the generated code. 124 125 + Implement `insert_v()` in terms of `insert_s()`. 126 127 + Add conditional `extern "C"` so `runtime/api.h` can be included from C++ 128 code. Closes #90, reported by vvarma. 129 130* Java: 131 132 + Fix functions in `among` to work in Java. We seem to need to make the 133 methods called from among `public` instead of `private`, and to call them 134 on `this` instead of the `methodObject` (which is cleaner anyway). No 135 revision in version control seems to generate working code for this case, 136 but Richard says it definitely used to work - possibly older JVMs failed to 137 correctly enforce the access controls when methods were invoked by 138 reflection. 139 140 + Code after handling `f` by returning from the current function is 141 unreachable too. 142 143 + Previously we incorrectly decided that code after an `or` was 144 unreachable in certain cases. None of the current stemmers in the 145 distribution triggered this, but Martin Porter's snowball version 146 of the Schinke Latin stemmer does. Fixes #58, reported by Alexander 147 Myltsev. 148 149 + The reachability logic was failing to consider reachability from 150 the final command in an `or`. Fixes #82, reported by David Corbett. 151 152 + Fix `maxint` and `minint`. Patch from David Corbett in #31. 153 154 + Fix `$` on strings. The previous generated code was just wrong. This 155 doesn't affect any of the included algorithms, but for example breaks 156 Martin Porter's snowball implementation of Schinke's Latin Stemmer. 157 Issue noted by Jakob Demler while working on the Rust backend in #51, 158 and reported in the Schinke's Latin Stemmer by Alexander Myltsev 159 in #58. 160 161 + Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43. 162 163 + Eliminate range-check implementation for groupings. This was removed from 164 the C generator 10 years earlier, isn't used for any of the existing 165 algorithms, and it doesn't seem likely it would be - the grouping would 166 have to consist entirely of a contiguous block of Unicode code-points. 167 168 + Simplify code generated for `repeat` and `atleast`. 169 170 + Eliminate unused return values and variables from runtime functions. 171 172 + Only import the `among` and `SnowballProgram` classes if they're actually 173 used. 174 175 + Only generate `copy_from()` method if it's used. 176 177 + Merge runtime functions `eq_s` and `eq_v` functions. 178 179 + Java arrays know their own length so stop storing it separately. 180 181 + Escape char 127 (DEL) in generated Java code. It's unlikely that this 182 character would actually be used in a real stemmer, so this was more of a 183 theoretical bug. 184 185 + Drop unused import of InvocationTargetException from SnowballStemmer. 186 Reported by GerritDeMeulder in #72. 187 188 + Fix lint check issues in generated Java code. The stemmer classes are only 189 referenced in the example app via reflection, so add 190 @SuppressWarnings("unused") for them. The stemmer classes override 191 equals() and hashCode() methods from the standard java Object class, so 192 mark these with @Override. Both suggested by GerritDeMeulder in #72. 193 194 + Declare Java variables at point of use in generated code. Putting all 195 declarations at the top of the function was adding unnecessary complexity 196 to the Java generator code for no benefit. 197 198 + Improve formatting of generated code. 199 200New stemming algorithms 201----------------------- 202 203* Add Tamil stemmer from Damodharan Rajalingam (#2, #3). 204 205* Add Arabic stemmer from Assem Chelli (#32, #50). 206 207* Add Irish stemmer Jim O'Regan (#48). 208 209* Add Nepali stemmer from Arthur Zakirov (#70). 210 211* Add Indonesian stemmer from Olly Betts (#71). 212 213* Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review. 214 215* Add Lithuanian stemmer from Dainius Jocas (#22, #76). 216 217* Add Greek stemmer from Oleg Smirnov (#44). 218 219* Add Catalan and Basque stemmers from Israel Olalla (#104). 220 221Behavioural changes to existing algorithms 222------------------------------------------ 223 224* Portuguese: 225 226 + Replace incorrect Spanish suffixes by Portuguese suffixes (#1). 227 228* French: 229 230 + The MSDOS CP850 version of the French algorithm was missing changes present 231 in the ISO8859-1 and Unicode versions. There's now a single version of 232 each algorithm which was based on the Unicode version. 233 234 + Recognize French suffixes even when they begin with diaereses. Patch from 235 David Corbett in #78. 236 237* Russian: 238 239 + We now normalise 'ё' to 'е' before stemming. The documentation has long 240 said "we assume ['ё'] is mapped into ['е']" but it's more convenient for 241 the stemmer to actually perform this normalisation. This change has no 242 effect if the caller is already normalising as we recommend. It's a change 243 in behaviour they aren't, but 'ё' occurs rarely (there are currently no 244 instances in our test vocabulary) and this improves behaviour when it does 245 occur. Patch from Eugene Mirotin (#65, #68). 246 247* Finish: 248 249 + Adjust the Finnish algorithm not to mangle numbers. This change also 250 means it tends to leave foreign words alone. Fixes #66. 251 252* Danish: 253 254 + Adjust Danish algorithm not to mangle alphanumeric codes. In particular 255 alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000, 256 space1999) are no longer mangled. See #81. 257 258Optimisations to existing algorithms 259------------------------------------ 260 261* Turkish: 262 263 + Simplify uses of `test` in stemmer code. 264 265 + Check for 'ad' or 'soyad' more efficiently, and without needing the 266 strlen variable. This speeds up "make check_utf8_turkish" by 11% 267 on x86 Linux. 268 269* Kraaij-Pohlmann: 270 271 + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient 272 than `setmark x $x >= p1`. 273 274Code clarity improvements to existing algorithms 275------------------------------------------------ 276 277* Turkish: 278 279 + Use , for cedilla to match the conventions used in other stemmers. 280 281* Kraaij-Pohlmann: 282 283 + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same 284 `[substring] among (` ... `)` construct we do in other stemmers. 285 286Compiler 287-------- 288 289* Support conventional --help and --version options. 290 291* Warn if -r or -ep used with backend other than C/C++. 292 293* Warn if encoding command line options are specified when generating code in a 294 language with a fixed encoding. 295 296* The default classname is now set based on the output filename, so `-n` is now 297 often no longer needed. Fixes #64. 298 299* Avoid potential one byte buffer over-read when parsing snowball code. 300 301* Avoid comparing with uninitialised array element during compilation. 302 303* Improve `-syntax` output for `setlimit L for C`. 304 305* Optimise away double negation so generators don't have to worry about 306 generating `--` (decrement operator in many languages). Fixes #52, reported 307 by David Corbett. 308 309* Improved compiler error and warning messages: 310 311 - We now report FILE:LINE: before each diagnostic message. 312 313 - Improve warnings for unused declarations/definitions. 314 315 - Warn for variables which are used, but either never initialised 316 or never read. 317 318 - Flag non-ASCII literal strings. This is an error for wide Unicode, but 319 only a warning for single-byte and UTF-8 which work so long as the source 320 encoding matches the encoding used in the generated stemmer code. 321 322 - Improve error recovery after an undeclared `define`. We now sniff the 323 token after the identifier and if it is `as` we parse as a routine, 324 otherwise we parse as a grouping. Previously we always just assumed it was 325 a routine, which gave a confusing second error if it was a grouping. 326 327 - Improve error recovery after an unexpected token in `among`. Previously 328 we acted as if the unexpected token closed the `among` (this probably 329 wasn't intended but just a missing `break;` in a switch statement). Now we 330 issue an error and try the next token. 331 332* Report error instead of silently truncating character values (e.g. `hex 123` 333 previously silently became byte 0x23 which is `#` rather than a 334 g-with-cedilla). 335 336* Enlarge the initial input buffer size to 8192 bytes and double each time we 337 hit the end. Snowball programs are typically a few KB in size (with the 338 current largest we ship being the Greek stemmer at 27KB) so the previous 339 approach of starting with a 10 byte input buffer and increasing its size by 340 50% plus 40 bytes each time it filled was inefficient, needing up to 15 341 reallocations to load greek.sbl. 342 343* Identify variables only used by one `routine`/`external`. This information 344 isn't yet used, but such variables which are also always written to before 345 being read can be emitted as local variables in most target languages. 346 347* We now allow multiple source files on command line, and allow them to be 348 after (or even interspersed) with options to better match modern Unix 349 conventions. Support for multiple source files allows specifying a single 350 byte character set mapping via a source file of `stringdef`. 351 352* Avoid infinite recursion in compiler when optimising a recursive snowball 353 function. Recursive functions aren't typical in snowball programs, but 354 the compiler shouldn't crash for any input, especially not a valid one. 355 We now simply limit on how deep the compiler will recurse and make the 356 pessimistic assumption in the unlikely event we hit this limit. 357 358Build system: 359 360* `make clean` in C libstemmer_c distribution now removes `examples/*.o`. 361 (#59) 362 363* Fix all the places which previously had to have a list of stemmers to work 364 dynamically or be generated, so now only modules.txt needs updating to add 365 a new stemmer. 366 367* Add check_java make target which runs tests for java. 368 369* Support gzipped test data (the uncompressed arabic test data is too big for 370 github). 371 372* GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball 373 invocations for Java - these are only meaningful when generating C code. 374 375* Pass CFLAGS when linking which matches convention (e.g. automake does it) and 376 facilitates use of tools such as ASan. Fixes #84, reported by Thomas 377 Pointhuber. 378 379* Add CI builds with -std=c90 to check compiler and generated code are C90 380 (#54) 381 382libstemmer stuff: 383 384* Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords. 385 386* Add -O2 to CFLAGS. 387 388* Make generated tables of encodings and modules const. 389 390* Fix clang static analyzer memory leak warning (in practice this code path 391 can never actually be taken). Patch from Patrick O. Perry (#56) 392 393documentation 394 395* Added copyright and licensing details (#10). 396 397* Document that libstemmer supports ISO_8859_2 encoding. Currently hungarian 398 and romanian are available in ISO_8859_2. 399 400* Remove documentation falsely claiming that libstemmer supports CP850 401 encoding. 402 403* CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and 404 new language backends. 405 406* Overhaul libstemmer_python_README. Most notably, replace the benchmark data 407 which was very out of date. 408