1Snowball 2.0.0 (2019-10-02)
2===========================
3
4C/C++
5-----
6
7* Fully handle 4-byte UTF-8 sequences.  Previously `hop` and `next` handled
8  sequences of any length, but commands which look at the character value only
9  handled sequences up to length 3.  Fixes #89.
10
11* Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`.
12
13Java
14----
15
16* TestApp.java:
17
18  - Always use UTF-8 for I/O.  Patch from David Corbett (#80).
19
20  - Allow reading input from stdin.
21
22  - Remove rather pointless "stem n times" feature.
23
24  - Only lower case ASCII to match stemwords.c.
25
26  - Stem empty lines too to match stemwords.c.
27
28Code Quality Improvements
29-------------------------
30
31* Fix various warnings from newer compilers.
32
33* Improve use of `const`.
34
35* Share common functions between compiler backends rather than having multiple
36  copies of the same code.
37
38* Assorted code clean-up.
39
40* Initialise line_labelled member of struct generator to 0.  Previously we were
41  invoking undefined behaviour, though in practice it'll be zero initialised on
42  most platforms.
43
44New Code Generators
45-------------------
46
47* Add Python generator (#24).  Originally written by Yoshiki Shibukawa, with
48  additional updates by Dmitry Shachnev.
49
50* Add Javascript generator.  Based on JSX generator (#26) written by Yoshiki
51  Shibukawa.
52
53* Add Rust generator from Jakob Demler (#51).
54
55* Add Go generator from Marty Schoch (#57).
56
57* Add C# generator.  Based on patch from Cesar Souza (#16, #17).
58
59* Add Pascal generator.  Based on Delphi backend from stemming.zip file on old
60  website (#75).
61
62New Language Features
63---------------------
64
65* Add `len` and `lenof` to measure Unicode length.  These are similar to `size`
66  and `sizeof` (respectively), but `size` and `sizeof` return the length in
67  bytes under `-utf8`, whereas these new commands give the same result whether
68  using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in
69  the length of the string).  For compatibility with existing code which might
70  use these as variable or function names, they stop being treated as tokens if
71  declared to be a variable or function.
72
73* New `{U+1234}` stringdef notation for Unicode codepoints.
74
75* More versatile integer tests.  Now you can compare any two arithmetic
76  expressions with a relational operator in parentheses after the `$`, so for
77  example `$(len > 3)` can now be used when previously a temporary variable was
78  required: `$tmp = len $tmp > 3`
79
80Code generation improvements
81----------------------------
82
83* General:
84
85  + Avoid unnecessarily saving and restoring of the cursor for more commands -
86    `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always
87    restore its value, and for C `booltest` (which other languages already
88    handled).
89
90  + Special case handling for `setlimit tomark AE`.  All uses of setlimit in
91    the current stemmers we ship follow this pattern, and by special-casing we
92    can avoid having to save and restore the cursor (#74).
93
94  + Merge duplicate actions in the same `among`.  This reduces the size of the
95    switch/if-chain in the generated code which dispatch the among for many of
96    the stemmers.
97
98  + Generate simpler code for `among`.  We always check for a zero return value
99    when we call the among, so there's no point also checking for that in the
100    switch/if-chain.  We can also avoid the switch/if-chain entirely when
101    there's only one possible outcome (besides the zero return).
102
103  + Optimise code generated for `do <function call>`.  This speeds up "make
104    check_python" by about 2%, and should speed up other interpreted languages
105    too (#110).
106
107  + Generate more and better comments referencing snowball source.
108
109  + Add homepage URL and compiler version as comments in generated files.
110
111* C/C++:
112
113  + Fix `size` and `sizeof` to not report one too high (reported by Assem
114    Chelli in #32).
115
116  + If signal `f` from a function call would lead to return from the current
117    function then handle this and bailing out on an error together with a
118    simple `if (ret <= 0) return ret;`
119
120  + Inline testing for a single character literals.
121
122  + Avoiding generating `|| 0` in corner case - this can result in a compiler
123    warning when building the generated code.
124
125  + Implement `insert_v()` in terms of `insert_s()`.
126
127  + Add conditional `extern "C"` so `runtime/api.h` can be included from C++
128    code.  Closes #90, reported by vvarma.
129
130* Java:
131
132  + Fix functions in `among` to work in Java.  We seem to need to make the
133    methods called from among `public` instead of `private`, and to call them
134    on `this` instead of the `methodObject` (which is cleaner anyway).  No
135    revision in version control seems to generate working code for this case,
136    but Richard says it definitely used to work - possibly older JVMs failed to
137    correctly enforce the access controls when methods were invoked by
138    reflection.
139
140  + Code after handling `f` by returning from the current function is
141    unreachable too.
142
143  + Previously we incorrectly decided that code after an `or` was
144    unreachable in certain cases.  None of the current stemmers in the
145    distribution triggered this, but Martin Porter's snowball version
146    of the Schinke Latin stemmer does.  Fixes #58, reported by Alexander
147    Myltsev.
148
149  + The reachability logic was failing to consider reachability from
150    the final command in an `or`.  Fixes #82, reported by David Corbett.
151
152  + Fix `maxint` and `minint`.  Patch from David Corbett in #31.
153
154  + Fix `$` on strings.  The previous generated code was just wrong.  This
155    doesn't affect any of the included algorithms, but for example breaks
156    Martin Porter's snowball implementation of Schinke's Latin Stemmer.
157    Issue noted by Jakob Demler while working on the Rust backend in #51,
158    and reported in the Schinke's Latin Stemmer by Alexander Myltsev
159    in #58.
160
161  + Make SnowballProgram objects serializable.  Patch from Oleg Smirnov in #43.
162
163  + Eliminate range-check implementation for groupings.  This was removed from
164    the C generator 10 years earlier, isn't used for any of the existing
165    algorithms, and it doesn't seem likely it would be - the grouping would
166    have to consist entirely of a contiguous block of Unicode code-points.
167
168  + Simplify code generated for `repeat` and `atleast`.
169
170  + Eliminate unused return values and variables from runtime functions.
171
172  + Only import the `among` and `SnowballProgram` classes if they're actually
173    used.
174
175  + Only generate `copy_from()` method if it's used.
176
177  + Merge runtime functions `eq_s` and `eq_v` functions.
178
179  + Java arrays know their own length so stop storing it separately.
180
181  + Escape char 127 (DEL) in generated Java code.  It's unlikely that this
182    character would actually be used in a real stemmer, so this was more of a
183    theoretical bug.
184
185  + Drop unused import of InvocationTargetException from SnowballStemmer.
186    Reported by GerritDeMeulder in #72.
187
188  + Fix lint check issues in generated Java code.  The stemmer classes are only
189    referenced in the example app via reflection, so add
190    @SuppressWarnings("unused") for them.  The stemmer classes override
191    equals() and hashCode() methods from the standard java Object class, so
192    mark these with @Override.  Both suggested by GerritDeMeulder in #72.
193
194  + Declare Java variables at point of use in generated code.  Putting all
195    declarations at the top of the function was adding unnecessary complexity
196    to the Java generator code for no benefit.
197
198  + Improve formatting of generated code.
199
200New stemming algorithms
201-----------------------
202
203* Add Tamil stemmer from Damodharan Rajalingam (#2, #3).
204
205* Add Arabic stemmer from Assem Chelli (#32, #50).
206
207* Add Irish stemmer Jim O'Regan (#48).
208
209* Add Nepali stemmer from Arthur Zakirov (#70).
210
211* Add Indonesian stemmer from Olly Betts (#71).
212
213* Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review.
214
215* Add Lithuanian stemmer from Dainius Jocas (#22, #76).
216
217* Add Greek stemmer from Oleg Smirnov (#44).
218
219* Add Catalan and Basque stemmers from Israel Olalla (#104).
220
221Behavioural changes to existing algorithms
222------------------------------------------
223
224* Portuguese:
225
226  + Replace incorrect Spanish suffixes by Portuguese suffixes (#1).
227
228* French:
229
230  + The MSDOS CP850 version of the French algorithm was missing changes present
231    in the ISO8859-1 and Unicode versions.  There's now a single version of
232    each algorithm which was based on the Unicode version.
233
234  + Recognize French suffixes even when they begin with diaereses.  Patch from
235    David Corbett in #78.
236
237* Russian:
238
239  + We now normalise 'ё' to 'е' before stemming.  The documentation has long
240    said "we assume ['ё'] is mapped into ['е']" but it's more convenient for
241    the stemmer to actually perform this normalisation.  This change has no
242    effect if the caller is already normalising as we recommend.  It's a change
243    in behaviour they aren't, but 'ё' occurs rarely (there are currently no
244    instances in our test vocabulary) and this improves behaviour when it does
245    occur.  Patch from Eugene Mirotin (#65, #68).
246
247* Finish:
248
249  + Adjust the Finnish algorithm not to mangle numbers.  This change also
250    means it tends to leave foreign words alone.  Fixes #66.
251
252* Danish:
253
254  + Adjust Danish algorithm not to mangle alphanumeric codes. In particular
255    alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000,
256    space1999) are no longer mangled.  See #81.
257
258Optimisations to existing algorithms
259------------------------------------
260
261* Turkish:
262
263  + Simplify uses of `test` in stemmer code.
264
265  + Check for 'ad' or 'soyad' more efficiently, and without needing the
266    strlen variable.  This speeds up "make check_utf8_turkish" by 11%
267    on x86 Linux.
268
269* Kraaij-Pohlmann:
270
271  + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient
272    than `setmark x $x >= p1`.
273
274Code clarity improvements to existing algorithms
275------------------------------------------------
276
277* Turkish:
278
279  + Use , for cedilla to match the conventions used in other stemmers.
280
281* Kraaij-Pohlmann:
282
283  + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same
284    `[substring] among (` ... `)` construct we do in other stemmers.
285
286Compiler
287--------
288
289* Support conventional --help and --version options.
290
291* Warn if -r or -ep used with backend other than C/C++.
292
293* Warn if encoding command line options are specified when generating code in a
294  language with a fixed encoding.
295
296* The default classname is now set based on the output filename, so `-n` is now
297  often no longer needed.  Fixes #64.
298
299* Avoid potential one byte buffer over-read when parsing snowball code.
300
301* Avoid comparing with uninitialised array element during compilation.
302
303* Improve `-syntax` output for `setlimit L for C`.
304
305* Optimise away double negation so generators don't have to worry about
306  generating `--` (decrement operator in many languages).  Fixes #52, reported
307  by David Corbett.
308
309* Improved compiler error and warning messages:
310
311  - We now report FILE:LINE: before each diagnostic message.
312
313  - Improve warnings for unused declarations/definitions.
314
315  - Warn for variables which are used, but either never initialised
316    or never read.
317
318  - Flag non-ASCII literal strings.  This is an error for wide Unicode, but
319    only a warning for single-byte and UTF-8 which work so long as the source
320    encoding matches the encoding used in the generated stemmer code.
321
322  - Improve error recovery after an undeclared `define`.  We now sniff the
323    token after the identifier and if it is `as` we parse as a routine,
324    otherwise we parse as a grouping.  Previously we always just assumed it was
325    a routine, which gave a confusing second error if it was a grouping.
326
327  - Improve error recovery after an unexpected token in `among`.  Previously
328    we acted as if the unexpected token closed the `among` (this probably
329    wasn't intended but just a missing `break;` in a switch statement).  Now we
330    issue an error and try the next token.
331
332* Report error instead of silently truncating character values (e.g. `hex 123`
333  previously silently became byte 0x23 which is `#` rather than a
334  g-with-cedilla).
335
336* Enlarge the initial input buffer size to 8192 bytes and double each time we
337  hit the end.  Snowball programs are typically a few KB in size (with the
338  current largest we ship being the Greek stemmer at 27KB) so the previous
339  approach of starting with a 10 byte input buffer and increasing its size by
340  50% plus 40 bytes each time it filled was inefficient, needing up to 15
341  reallocations to load greek.sbl.
342
343* Identify variables only used by one `routine`/`external`.  This information
344  isn't yet used, but such variables which are also always written to before
345  being read can be emitted as local variables in most target languages.
346
347* We now allow multiple source files on command line, and allow them to be
348  after (or even interspersed) with options to better match modern Unix
349  conventions.  Support for multiple source files allows specifying a single
350  byte character set mapping via a source file of `stringdef`.
351
352* Avoid infinite recursion in compiler when optimising a recursive snowball
353  function.  Recursive functions aren't typical in snowball programs, but
354  the compiler shouldn't crash for any input, especially not a valid one.
355  We now simply limit on how deep the compiler will recurse and make the
356  pessimistic assumption in the unlikely event we hit this limit.
357
358Build system:
359
360* `make clean` in C libstemmer_c distribution now removes `examples/*.o`.
361  (#59)
362
363* Fix all the places which previously had to have a list of stemmers to work
364  dynamically or be generated, so now only modules.txt needs updating to add
365  a new stemmer.
366
367* Add check_java make target which runs tests for java.
368
369* Support gzipped test data (the uncompressed arabic test data is too big for
370  github).
371
372* GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball
373  invocations for Java - these are only meaningful when generating C code.
374
375* Pass CFLAGS when linking which matches convention (e.g. automake does it) and
376  facilitates use of tools such as ASan.  Fixes #84, reported by Thomas
377  Pointhuber.
378
379* Add CI builds with -std=c90 to check compiler and generated code are C90
380  (#54)
381
382libstemmer stuff:
383
384* Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords.
385
386* Add -O2 to CFLAGS.
387
388* Make generated tables of encodings and modules const.
389
390* Fix clang static analyzer memory leak warning (in practice this code path
391  can never actually be taken).  Patch from Patrick O. Perry (#56)
392
393documentation
394
395* Added copyright and licensing details (#10).
396
397* Document that libstemmer supports ISO_8859_2 encoding.  Currently hungarian
398  and romanian are available in ISO_8859_2.
399
400* Remove documentation falsely claiming that libstemmer supports CP850
401  encoding.
402
403* CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and
404  new language backends.
405
406* Overhaul libstemmer_python_README.  Most notably, replace the benchmark data
407  which was very out of date.
408