1========================
2Whoosh 1.x release notes
3========================
4
5Whoosh 1.8.3
6============
7
8Whoosh 1.8.3 contains important bugfixes and new functionality. Thanks to all
9the mailing list and BitBucket users who helped with the fixes!
10
11Fixed a bad ``Collector`` bug where the docset of a Results object did not match
12the actual results.
13
14You can now pass a sequence of objects to a keyword argument in ``add_document``
15and ``update_document`` (currently this will not work for unique fields in
16``update_document``). This is useful for non-text fields such as ``DATETIME``
17and ``NUMERIC``, allowing you to index multiple dates/numbers for a document::
18
19    writer.add_document(shoe=u"Saucony Kinvara", sizes=[10.0, 9.5, 12])
20
21This version reverts to using the CDB hash function for hash files instead of
22Python's ``hash()`` because the latter is not meant to be stored externally.
23This change maintains backwards compatibility with old files.
24
25The ``Searcher.search`` method now takes a ``mask`` keyword argument. This is
26the opposite of the ``filter`` argument. Where the ``filter`` specifies the
27set of documents that can appear in the results, the ``mask`` specifies a
28set of documents that must not appear in the results.
29
30Fixed performance problems in ``Searcher.more_like``. This method now also
31takes a ``filter`` keyword argument like ``Searcher.search``.
32
33Improved documentation.
34
35
36Whoosh 1.8.2
37============
38
39Whoosh 1.8.2 fixes some bugs, including a mistyped signature in
40Searcher.more_like and a bad bug in Collector that could screw up the
41ordering of results given certain parameters.
42
43
44Whoosh 1.8.1
45============
46
47Whoosh 1.8.1 includes a few recent bugfixes/improvements:
48
49- ListMatcher.skip_to_quality() wasn't returning an integer, resulting
50  in a "None + int" error.
51
52- Fixed locking and memcache sync bugs in the Google App Engine storage
53  object.
54
55- MultifieldPlugin wasn't working correctly with groups.
56
57  - The binary matcher trees of Or and And are now generated using a
58    Huffman-like algorithm instead perfectly balanced. This gives a
59    noticeable speed improvement because less information has to be passed
60    up/down the tree.
61
62
63Whoosh 1.8
64==========
65
66This release relicensed the Whoosh source code under the Simplified BSD (A.K.A.
67"two-clause" or "FreeBSD") license. See LICENSE.txt for more information.
68
69
70Whoosh 1.7.7
71============
72
73Setting a TEXT field to store term vectors is now much easier. Instead of
74having to pass an instantiated whoosh.formats.Format object to the vector=
75keyword argument, you can pass True to automatically use the same format and
76analyzer as the inverted index. Alternatively, you can pass a Format subclass
77and Whoosh will instantiate it for you.
78
79For example, to store term vectors using the same settings as the inverted
80index (Positions format and StandardAnalyzer)::
81
82    from whoosh.fields import Schema, TEXT
83
84    schema = Schema(content=TEXT(vector=True))
85
86To store term vectors that use the same analyzer as the inverted index
87(StandardAnalyzer by default) but only store term frequency::
88
89    from whoosh.formats import Frequency
90
91    schema = Schema(content=TEXT(vector=Frequency))
92
93Note that currently the only place term vectors are used in Whoosh is keyword
94extraction/more like this, but they can be useful for expert users with custom
95code.
96
97Added :meth:`whoosh.searching.Searcher.more_like` and
98:meth:`whoosh.searching.Hit.more_like_this` methods, as shortcuts for doing
99keyword extraction yourself. Return a Results object.
100
101"python setup.py test" works again, as long as you have nose installed.
102
103The :meth:`whoosh.searching.Searcher.sort_query_using` method lets you sort documents matching a given query using an arbitrary function. Note that like "complex" searching with the Sorter object, this can be slow on large multi-segment indexes.
104
105
106Whoosh 1.7
107==========
108
109You can once again perform complex sorting of search results (that is, a sort
110with some fields ascending and some fields descending).
111
112You can still use the ``sortedby`` keyword argument to
113:meth:`whoosh.searching.Searcher.search` to do a simple sort (where all fields
114are sorted in the same direction), or you can use the new
115:class:`~whoosh.sorting.Sorter` class to do a simple or complex sort::
116
117    searcher = myindex.searcher()
118    sorter = searcher.sorter()
119    # Sort first by the group field, ascending
120    sorter.add_field("group")
121    # Then by the price field, descending
122    sorter.add_field("price", reverse=True)
123    # Get the Results
124    results = sorter.sort_query(myquery)
125
126See the documentation for the :class:`~whoosh.sorting.Sorter` class for more
127information. Bear in mind that complex sorts will be much slower on large
128indexes because they can't use the per-segment field caches.
129
130You can now get highlighted snippets for a hit automatically using
131:meth:`whoosh.searching.Hit.highlights`::
132
133    results = searcher.search(myquery, limit=20)
134    for hit in results:
135        print hit["title"]
136        print hit.highlights("content")
137
138See :meth:`whoosh.searching.Hit.highlights` for more information.
139
140Added the ability to filter search results so that only hits in a Results
141set, a set of docnums, or matching a query are returned. The filter is
142cached on the searcher.
143
144    # Search within previous results
145    newresults = searcher.search(newquery, filter=oldresults)
146
147    # Search within the "basics" chapter
148    results = searcher.search(userquery, filter=query.Term("chapter", "basics"))
149
150You can now specify a time limit for a search. If the search does not finish
151in the given time, a :class:`whoosh.searching.TimeLimit` exception is raised,
152but you can still retrieve the partial results from the collector. See the
153``timelimit`` and ``greedy`` arguments in the
154:class:`whoosh.searching.Collector` documentation.
155
156Added back the ability to set :class:`whoosh.analysis.StemFilter` to use an
157unlimited cache. This is useful for one-shot batch indexing (see
158:doc:`../batch`).
159
160The ``normalize()`` method of the ``And`` and ``Or`` queries now merges
161overlapping range queries for more efficient queries.
162
163Query objects now have ``__hash__`` methods allowing them to be used as
164dictionary keys.
165
166The API of the highlight module has changed slightly. Most of the functions
167in the module have been converted to classes. However, most old code should
168still work. The ``NullFragmeter`` is now called ``WholeFragmenter``, but the
169old name is still available as an alias.
170
171Fixed MultiPool so it won't fill up the temp directory with job files.
172
173Fixed a bug where Phrase query objects did not use their boost factor.
174
175Fixed a bug where a fieldname after an open parenthesis wasn't parsed
176correctly. The change alters the semantics of certain parsing "corner cases"
177(such as ``a:b:c:d``).
178
179
180Whoosh 1.6
181==========
182
183The ``whoosh.writing.BatchWriter`` class is now called
184:class:`whoosh.writing.BufferedWriter`. It is similar to the old ``BatchWriter``
185class but allows you to search and update the buffered documents as well as the
186documents that have been flushed to disk::
187
188    writer = writing.BufferedWriter(myindex)
189
190    # You can update (replace) documents in RAM without having to commit them
191    # to disk
192    writer.add_document(path="/a", text="Hi there")
193    writer.update_document(path="/a", text="Hello there")
194
195    # Search committed and uncommited documents by getting a searcher from the
196    # writer instead of the index
197    searcher = writer.searcher()
198
199(BatchWriter is still available as an alias for backwards compatibility.)
200
201The :class:`whoosh.qparser.QueryParser` initialization method now requires a
202schema as the second argument. Previously the default was to create a
203``QueryParser`` without a schema, which was confusing::
204
205    qp = qparser.QueryParser("content", myindex.schema)
206
207The :meth:`whoosh.searching.Searcher.search` method now takes a ``scored``
208keyword. If you search with ``scored=False``, the results will be in "natural"
209order (the order the documents were added to the index). This is useful when
210you don't need scored results but want the convenience of the Results object.
211
212Added the :class:`whoosh.qparser.GtLtPlugin` parser plugin to allow greater
213than/less as an alternative syntax for ranges::
214
215    count:>100 tag:<=zebra date:>='29 march 2001'
216
217Added the ability to define schemas declaratively, similar to Django models::
218
219    from whoosh import index
220    from whoosh.fields import SchemaClass, ID, KEYWORD, STORED, TEXT
221
222    class MySchema(SchemaClass):
223        uuid = ID(stored=True, unique=True)
224        path = STORED
225        tags = KEYWORD(stored=True)
226        content = TEXT
227
228    index.create_in("indexdir", MySchema)
229
230Whoosh 1.6.2: Added :class:`whoosh.searching.TermTrackingCollector` which tracks
231which part of the query matched which documents in the final results.
232
233Replaced the unbounded cache in :class:`whoosh.analysis.StemFilter` with a
234bounded LRU (least recently used) cache. This will make stemming analysis
235slightly slower but prevent it from eating up too much memory over time.
236
237Added a simple :class:`whoosh.analysis.PyStemmerFilter` that works when the
238py-stemmer library is installed::
239
240    ana = RegexTokenizer() | PyStemmerFilter("spanish")
241
242The estimation of memory usage for the ``limitmb`` keyword argument to
243``FileIndex.writer()`` is more accurate, which should help keep memory usage
244memory usage by the sorting pool closer to the limit.
245
246The ``whoosh.ramdb`` package was removed and replaced with a single
247``whoosh.ramindex`` module.
248
249Miscellaneous bug fixes.
250
251
252Whoosh 1.5
253==========
254
255.. note::
256    Whoosh 1.5 is incompatible with previous indexes. You must recreate
257    existing indexes with Whoosh 1.5.
258
259Fixed a bug where postings were not portable across different endian platforms.
260
261New generalized field cache system, using per-reader caches, for much faster
262sorting and faceting of search results, as well as much faster multi-term (e.g.
263prefix and wildcard) and range queries, especially for large indexes and/or
264indexes with multiple segments.
265
266Changed the faceting API. See :doc:`../facets`.
267
268Faster storage and retrieval of posting values.
269
270Added per-field ``multitoken_query`` attribute to control how the query parser
271deals with a "term" that when analyzed generates multiple tokens. The default
272value is `"first"` which throws away all but the first token (the previous
273behavior). Other possible values are `"and"`, `"or"`, or `"phrase"`.
274
275Added :class:`whoosh.analysis.DoubleMetaphoneFilter`,
276:class:`whoosh.analysis.SubstitutionFilter`, and
277:class:`whoosh.analysis.ShingleFilter`.
278
279Added :class:`whoosh.qparser.CopyFieldPlugin`.
280
281Added :class:`whoosh.query.Otherwise`.
282
283Generalized parsing of operators (such as OR, AND, NOT, etc.) in the query
284parser to make it easier to add new operators. In intend to add a better API
285for this in a future release.
286
287Switched NUMERIC and DATETIME fields to use more compact on-disk
288representations of numbers.
289
290Fixed a bug in the porter2 stemmer when stemming the string `"y"`.
291
292Added methods to :class:`whoosh.searching.Hit` to make it more like a `dict`.
293
294Short posting lists (by default, single postings) are inline in the term file
295instead of written to the posting file for faster retrieval and a small saving
296in disk space.
297
298
299Whoosh 1.3
300==========
301
302Whoosh 1.3 adds a more efficient DATETIME field based on the new tiered NUMERIC
303field, and the DateParserPlugin. See :doc:`../dates`.
304
305
306Whoosh 1.2
307==========
308
309Whoosh 1.2 adds tiered indexing for NUMERIC fields, resulting in much faster
310range queries on numeric fields.
311
312
313Whoosh 1.0
314==========
315
316Whoosh 1.0 is a major milestone release with vastly improved performance and
317several useful new features.
318
319*The index format of this version is not compatibile with indexes created by
320previous versions of Whoosh*. You will need to reindex your data to use this
321version.
322
323Orders of magnitude faster searches for common terms. Whoosh now uses
324optimizations similar to those in Xapian to skip reading low-scoring postings.
325
326Faster indexing and ability to use multiple processors (via ``multiprocessing``
327module) to speed up indexing.
328
329Flexible Schema: you can now add and remove fields in an index with the
330:meth:`whoosh.writing.IndexWriter.add_field` and
331:meth:`whoosh.writing.IndexWriter.remove_field` methods.
332
333New hand-written query parser based on plug-ins. Less brittle, more robust,
334more flexible, and easier to fix/improve than the old pyparsing-based parser.
335
336On-disk formats now use 64-bit disk pointers allowing files larger than 4 GB.
337
338New :class:`whoosh.searching.Facets` class efficiently sorts results into
339facets based on any criteria that can be expressed as queries, for example
340tags or price ranges.
341
342New :class:`whoosh.writing.BatchWriter` class automatically batches up
343individual ``add_document`` and/or ``delete_document`` calls until a certain
344number of calls or a certain amount of time passes, then commits them all at
345once.
346
347New :class:`whoosh.analysis.BiWordFilter` lets you create bi-word indexed
348fields a possible alternative to phrase searching.
349
350Fixed bug where files could be deleted before a reader could open them  in
351threaded situations.
352
353New :class:`whoosh.analysis.NgramFilter` filter,
354:class:`whoosh.analysis.NgramWordAnalyzer` analyzer, and
355:class:`whoosh.fields.NGRAMWORDS` field type allow producing n-grams from
356tokenized text.
357
358Errors in query parsing now raise a specific ``whoosh.qparse.QueryParserError``
359exception instead of a generic exception.
360
361Previously, the query string ``*`` was optimized to a
362:class:`whoosh.query.Every` query which matched every document. Now the
363``Every`` query only matches documents that actually have an indexed term from
364the given field, to better match the intuitive sense of what a query string like
365``tag:*`` should do.
366
367New :meth:`whoosh.searching.Searcher.key_terms_from_text` method lets you
368extract key words from arbitrary text instead of documents in the index.
369
370Previously the :meth:`whoosh.searching.Searcher.key_terms` and
371:meth:`whoosh.searching.Results.key_terms` methods required that the given
372field store term vectors. They now also work if the given field is stored
373instead. They will analyze the stored string into a term vector on-the-fly.
374The field must still be indexed.
375
376
377User API changes
378================
379
380The default for the ``limit`` keyword argument to
381:meth:`whoosh.searching.Searcher.search` is now ``10``. To return all results
382in a single ``Results`` object, use ``limit=None``.
383
384The ``Index`` object no longer represents a snapshot of the index at the time
385the object was instantiated. Instead it always represents the index in the
386abstract. ``Searcher`` and ``IndexReader`` objects obtained from the
387``Index`` object still represent the index as it was at the time they were
388created.
389
390Because the ``Index`` object no longer represents the index at a specific
391version, several methods such as ``up_to_date`` and ``refresh`` were removed
392from its interface. The Searcher object now has
393:meth:`~whoosh.searching.Searcher.last_modified`,
394:meth:`~whoosh.searching.Searcher.up_to_date`, and
395:meth:`~whoosh.searching.Searcher.refresh` methods similar to those that used to
396be on ``Index``.
397
398The document deletion and field add/remove methods on the ``Index`` object now
399create a writer behind the scenes to accomplish each call. This means they write
400to the index immediately, so you don't need to call ``commit`` on the ``Index``.
401Also, it will be much faster if you need to call them multiple times to create
402your own writer instead::
403
404    # Don't do this
405    for id in my_list_of_ids_to_delete:
406        myindex.delete_by_term("id", id)
407    myindex.commit()
408
409    # Instead do this
410    writer = myindex.writer()
411    for id in my_list_of_ids_to_delete:
412        writer.delete_by_term("id", id)
413    writer.commit()
414
415The ``postlimit`` argument to ``Index.writer()`` has been changed to
416``postlimitmb`` and is now expressed in megabytes instead of bytes::
417
418    writer = myindex.writer(postlimitmb=128)
419
420Instead of having to import ``whoosh.filedb.filewriting.NO_MERGE`` or
421``whoosh.filedb.filewriting.OPTIMIZE`` to use as arguments to ``commit()``, you
422can now simply do the following::
423
424    # Do not merge segments
425    writer.commit(merge=False)
426
427    # or
428
429    # Merge all segments
430    writer.commit(optimize=True)
431
432The ``whoosh.postings`` module is gone. The ``whoosh.matching`` module contains
433classes for posting list readers.
434
435Whoosh no longer maps field names to numbers for internal use or writing to
436disk. Any low-level method that accepted field numbers now accept field names
437instead.
438
439Custom Weighting implementations that use the ``final()`` method must now
440set the ``use_final`` attribute to ``True``::
441
442    from whoosh.scoring import BM25F
443
444    class MyWeighting(BM25F):
445        use_final = True
446
447        def final(searcher, docnum, score):
448            return score + docnum * 10
449
450This disables the new optimizations, forcing Whoosh to score every matching
451document.
452
453:class:`whoosh.writing.AsyncWriter` now takes an :class:`whoosh.index.Index`
454object as its first argument, not a callable. Also, the keyword arguments to
455pass to the index's ``writer()`` method should now be passed as a dictionary
456using the ``writerargs`` keyword argument.
457
458Whoosh now stores per-document field length using an approximation rather than
459exactly. For low numbers the approximation is perfectly accurate, while high
460numbers will be approximated less accurately.
461
462The ``doc_field_length`` method on searchers and readers now takes a second
463argument representing the default to return if the given document and field
464do not have a length (i.e. the field is not scored or the field was not
465provided for the given document).
466
467The :class:`whoosh.analysis.StopFilter` now has a ``maxsize`` argument as well
468as a ``minsize`` argument to its initializer. Analyzers that use the
469``StopFilter`` have the ``maxsize`` argument in their initializers now also.
470
471The interface of :class:`whoosh.writing.AsyncWriter` has changed.
472
473
474Misc
475====
476
477* Because the file backend now writes 64-bit disk pointers and field names
478  instead of numbers, the size of an index on disk will grow compared to
479  previous versions.
480
481* Unit tests should no longer leave directories and files behind.
482
483