1======================== 2Whoosh 1.x release notes 3======================== 4 5Whoosh 1.8.3 6============ 7 8Whoosh 1.8.3 contains important bugfixes and new functionality. Thanks to all 9the mailing list and BitBucket users who helped with the fixes! 10 11Fixed a bad ``Collector`` bug where the docset of a Results object did not match 12the actual results. 13 14You can now pass a sequence of objects to a keyword argument in ``add_document`` 15and ``update_document`` (currently this will not work for unique fields in 16``update_document``). This is useful for non-text fields such as ``DATETIME`` 17and ``NUMERIC``, allowing you to index multiple dates/numbers for a document:: 18 19 writer.add_document(shoe=u"Saucony Kinvara", sizes=[10.0, 9.5, 12]) 20 21This version reverts to using the CDB hash function for hash files instead of 22Python's ``hash()`` because the latter is not meant to be stored externally. 23This change maintains backwards compatibility with old files. 24 25The ``Searcher.search`` method now takes a ``mask`` keyword argument. This is 26the opposite of the ``filter`` argument. Where the ``filter`` specifies the 27set of documents that can appear in the results, the ``mask`` specifies a 28set of documents that must not appear in the results. 29 30Fixed performance problems in ``Searcher.more_like``. This method now also 31takes a ``filter`` keyword argument like ``Searcher.search``. 32 33Improved documentation. 34 35 36Whoosh 1.8.2 37============ 38 39Whoosh 1.8.2 fixes some bugs, including a mistyped signature in 40Searcher.more_like and a bad bug in Collector that could screw up the 41ordering of results given certain parameters. 42 43 44Whoosh 1.8.1 45============ 46 47Whoosh 1.8.1 includes a few recent bugfixes/improvements: 48 49- ListMatcher.skip_to_quality() wasn't returning an integer, resulting 50 in a "None + int" error. 51 52- Fixed locking and memcache sync bugs in the Google App Engine storage 53 object. 54 55- MultifieldPlugin wasn't working correctly with groups. 56 57 - The binary matcher trees of Or and And are now generated using a 58 Huffman-like algorithm instead perfectly balanced. This gives a 59 noticeable speed improvement because less information has to be passed 60 up/down the tree. 61 62 63Whoosh 1.8 64========== 65 66This release relicensed the Whoosh source code under the Simplified BSD (A.K.A. 67"two-clause" or "FreeBSD") license. See LICENSE.txt for more information. 68 69 70Whoosh 1.7.7 71============ 72 73Setting a TEXT field to store term vectors is now much easier. Instead of 74having to pass an instantiated whoosh.formats.Format object to the vector= 75keyword argument, you can pass True to automatically use the same format and 76analyzer as the inverted index. Alternatively, you can pass a Format subclass 77and Whoosh will instantiate it for you. 78 79For example, to store term vectors using the same settings as the inverted 80index (Positions format and StandardAnalyzer):: 81 82 from whoosh.fields import Schema, TEXT 83 84 schema = Schema(content=TEXT(vector=True)) 85 86To store term vectors that use the same analyzer as the inverted index 87(StandardAnalyzer by default) but only store term frequency:: 88 89 from whoosh.formats import Frequency 90 91 schema = Schema(content=TEXT(vector=Frequency)) 92 93Note that currently the only place term vectors are used in Whoosh is keyword 94extraction/more like this, but they can be useful for expert users with custom 95code. 96 97Added :meth:`whoosh.searching.Searcher.more_like` and 98:meth:`whoosh.searching.Hit.more_like_this` methods, as shortcuts for doing 99keyword extraction yourself. Return a Results object. 100 101"python setup.py test" works again, as long as you have nose installed. 102 103The :meth:`whoosh.searching.Searcher.sort_query_using` method lets you sort documents matching a given query using an arbitrary function. Note that like "complex" searching with the Sorter object, this can be slow on large multi-segment indexes. 104 105 106Whoosh 1.7 107========== 108 109You can once again perform complex sorting of search results (that is, a sort 110with some fields ascending and some fields descending). 111 112You can still use the ``sortedby`` keyword argument to 113:meth:`whoosh.searching.Searcher.search` to do a simple sort (where all fields 114are sorted in the same direction), or you can use the new 115:class:`~whoosh.sorting.Sorter` class to do a simple or complex sort:: 116 117 searcher = myindex.searcher() 118 sorter = searcher.sorter() 119 # Sort first by the group field, ascending 120 sorter.add_field("group") 121 # Then by the price field, descending 122 sorter.add_field("price", reverse=True) 123 # Get the Results 124 results = sorter.sort_query(myquery) 125 126See the documentation for the :class:`~whoosh.sorting.Sorter` class for more 127information. Bear in mind that complex sorts will be much slower on large 128indexes because they can't use the per-segment field caches. 129 130You can now get highlighted snippets for a hit automatically using 131:meth:`whoosh.searching.Hit.highlights`:: 132 133 results = searcher.search(myquery, limit=20) 134 for hit in results: 135 print hit["title"] 136 print hit.highlights("content") 137 138See :meth:`whoosh.searching.Hit.highlights` for more information. 139 140Added the ability to filter search results so that only hits in a Results 141set, a set of docnums, or matching a query are returned. The filter is 142cached on the searcher. 143 144 # Search within previous results 145 newresults = searcher.search(newquery, filter=oldresults) 146 147 # Search within the "basics" chapter 148 results = searcher.search(userquery, filter=query.Term("chapter", "basics")) 149 150You can now specify a time limit for a search. If the search does not finish 151in the given time, a :class:`whoosh.searching.TimeLimit` exception is raised, 152but you can still retrieve the partial results from the collector. See the 153``timelimit`` and ``greedy`` arguments in the 154:class:`whoosh.searching.Collector` documentation. 155 156Added back the ability to set :class:`whoosh.analysis.StemFilter` to use an 157unlimited cache. This is useful for one-shot batch indexing (see 158:doc:`../batch`). 159 160The ``normalize()`` method of the ``And`` and ``Or`` queries now merges 161overlapping range queries for more efficient queries. 162 163Query objects now have ``__hash__`` methods allowing them to be used as 164dictionary keys. 165 166The API of the highlight module has changed slightly. Most of the functions 167in the module have been converted to classes. However, most old code should 168still work. The ``NullFragmeter`` is now called ``WholeFragmenter``, but the 169old name is still available as an alias. 170 171Fixed MultiPool so it won't fill up the temp directory with job files. 172 173Fixed a bug where Phrase query objects did not use their boost factor. 174 175Fixed a bug where a fieldname after an open parenthesis wasn't parsed 176correctly. The change alters the semantics of certain parsing "corner cases" 177(such as ``a:b:c:d``). 178 179 180Whoosh 1.6 181========== 182 183The ``whoosh.writing.BatchWriter`` class is now called 184:class:`whoosh.writing.BufferedWriter`. It is similar to the old ``BatchWriter`` 185class but allows you to search and update the buffered documents as well as the 186documents that have been flushed to disk:: 187 188 writer = writing.BufferedWriter(myindex) 189 190 # You can update (replace) documents in RAM without having to commit them 191 # to disk 192 writer.add_document(path="/a", text="Hi there") 193 writer.update_document(path="/a", text="Hello there") 194 195 # Search committed and uncommited documents by getting a searcher from the 196 # writer instead of the index 197 searcher = writer.searcher() 198 199(BatchWriter is still available as an alias for backwards compatibility.) 200 201The :class:`whoosh.qparser.QueryParser` initialization method now requires a 202schema as the second argument. Previously the default was to create a 203``QueryParser`` without a schema, which was confusing:: 204 205 qp = qparser.QueryParser("content", myindex.schema) 206 207The :meth:`whoosh.searching.Searcher.search` method now takes a ``scored`` 208keyword. If you search with ``scored=False``, the results will be in "natural" 209order (the order the documents were added to the index). This is useful when 210you don't need scored results but want the convenience of the Results object. 211 212Added the :class:`whoosh.qparser.GtLtPlugin` parser plugin to allow greater 213than/less as an alternative syntax for ranges:: 214 215 count:>100 tag:<=zebra date:>='29 march 2001' 216 217Added the ability to define schemas declaratively, similar to Django models:: 218 219 from whoosh import index 220 from whoosh.fields import SchemaClass, ID, KEYWORD, STORED, TEXT 221 222 class MySchema(SchemaClass): 223 uuid = ID(stored=True, unique=True) 224 path = STORED 225 tags = KEYWORD(stored=True) 226 content = TEXT 227 228 index.create_in("indexdir", MySchema) 229 230Whoosh 1.6.2: Added :class:`whoosh.searching.TermTrackingCollector` which tracks 231which part of the query matched which documents in the final results. 232 233Replaced the unbounded cache in :class:`whoosh.analysis.StemFilter` with a 234bounded LRU (least recently used) cache. This will make stemming analysis 235slightly slower but prevent it from eating up too much memory over time. 236 237Added a simple :class:`whoosh.analysis.PyStemmerFilter` that works when the 238py-stemmer library is installed:: 239 240 ana = RegexTokenizer() | PyStemmerFilter("spanish") 241 242The estimation of memory usage for the ``limitmb`` keyword argument to 243``FileIndex.writer()`` is more accurate, which should help keep memory usage 244memory usage by the sorting pool closer to the limit. 245 246The ``whoosh.ramdb`` package was removed and replaced with a single 247``whoosh.ramindex`` module. 248 249Miscellaneous bug fixes. 250 251 252Whoosh 1.5 253========== 254 255.. note:: 256 Whoosh 1.5 is incompatible with previous indexes. You must recreate 257 existing indexes with Whoosh 1.5. 258 259Fixed a bug where postings were not portable across different endian platforms. 260 261New generalized field cache system, using per-reader caches, for much faster 262sorting and faceting of search results, as well as much faster multi-term (e.g. 263prefix and wildcard) and range queries, especially for large indexes and/or 264indexes with multiple segments. 265 266Changed the faceting API. See :doc:`../facets`. 267 268Faster storage and retrieval of posting values. 269 270Added per-field ``multitoken_query`` attribute to control how the query parser 271deals with a "term" that when analyzed generates multiple tokens. The default 272value is `"first"` which throws away all but the first token (the previous 273behavior). Other possible values are `"and"`, `"or"`, or `"phrase"`. 274 275Added :class:`whoosh.analysis.DoubleMetaphoneFilter`, 276:class:`whoosh.analysis.SubstitutionFilter`, and 277:class:`whoosh.analysis.ShingleFilter`. 278 279Added :class:`whoosh.qparser.CopyFieldPlugin`. 280 281Added :class:`whoosh.query.Otherwise`. 282 283Generalized parsing of operators (such as OR, AND, NOT, etc.) in the query 284parser to make it easier to add new operators. In intend to add a better API 285for this in a future release. 286 287Switched NUMERIC and DATETIME fields to use more compact on-disk 288representations of numbers. 289 290Fixed a bug in the porter2 stemmer when stemming the string `"y"`. 291 292Added methods to :class:`whoosh.searching.Hit` to make it more like a `dict`. 293 294Short posting lists (by default, single postings) are inline in the term file 295instead of written to the posting file for faster retrieval and a small saving 296in disk space. 297 298 299Whoosh 1.3 300========== 301 302Whoosh 1.3 adds a more efficient DATETIME field based on the new tiered NUMERIC 303field, and the DateParserPlugin. See :doc:`../dates`. 304 305 306Whoosh 1.2 307========== 308 309Whoosh 1.2 adds tiered indexing for NUMERIC fields, resulting in much faster 310range queries on numeric fields. 311 312 313Whoosh 1.0 314========== 315 316Whoosh 1.0 is a major milestone release with vastly improved performance and 317several useful new features. 318 319*The index format of this version is not compatibile with indexes created by 320previous versions of Whoosh*. You will need to reindex your data to use this 321version. 322 323Orders of magnitude faster searches for common terms. Whoosh now uses 324optimizations similar to those in Xapian to skip reading low-scoring postings. 325 326Faster indexing and ability to use multiple processors (via ``multiprocessing`` 327module) to speed up indexing. 328 329Flexible Schema: you can now add and remove fields in an index with the 330:meth:`whoosh.writing.IndexWriter.add_field` and 331:meth:`whoosh.writing.IndexWriter.remove_field` methods. 332 333New hand-written query parser based on plug-ins. Less brittle, more robust, 334more flexible, and easier to fix/improve than the old pyparsing-based parser. 335 336On-disk formats now use 64-bit disk pointers allowing files larger than 4 GB. 337 338New :class:`whoosh.searching.Facets` class efficiently sorts results into 339facets based on any criteria that can be expressed as queries, for example 340tags or price ranges. 341 342New :class:`whoosh.writing.BatchWriter` class automatically batches up 343individual ``add_document`` and/or ``delete_document`` calls until a certain 344number of calls or a certain amount of time passes, then commits them all at 345once. 346 347New :class:`whoosh.analysis.BiWordFilter` lets you create bi-word indexed 348fields a possible alternative to phrase searching. 349 350Fixed bug where files could be deleted before a reader could open them in 351threaded situations. 352 353New :class:`whoosh.analysis.NgramFilter` filter, 354:class:`whoosh.analysis.NgramWordAnalyzer` analyzer, and 355:class:`whoosh.fields.NGRAMWORDS` field type allow producing n-grams from 356tokenized text. 357 358Errors in query parsing now raise a specific ``whoosh.qparse.QueryParserError`` 359exception instead of a generic exception. 360 361Previously, the query string ``*`` was optimized to a 362:class:`whoosh.query.Every` query which matched every document. Now the 363``Every`` query only matches documents that actually have an indexed term from 364the given field, to better match the intuitive sense of what a query string like 365``tag:*`` should do. 366 367New :meth:`whoosh.searching.Searcher.key_terms_from_text` method lets you 368extract key words from arbitrary text instead of documents in the index. 369 370Previously the :meth:`whoosh.searching.Searcher.key_terms` and 371:meth:`whoosh.searching.Results.key_terms` methods required that the given 372field store term vectors. They now also work if the given field is stored 373instead. They will analyze the stored string into a term vector on-the-fly. 374The field must still be indexed. 375 376 377User API changes 378================ 379 380The default for the ``limit`` keyword argument to 381:meth:`whoosh.searching.Searcher.search` is now ``10``. To return all results 382in a single ``Results`` object, use ``limit=None``. 383 384The ``Index`` object no longer represents a snapshot of the index at the time 385the object was instantiated. Instead it always represents the index in the 386abstract. ``Searcher`` and ``IndexReader`` objects obtained from the 387``Index`` object still represent the index as it was at the time they were 388created. 389 390Because the ``Index`` object no longer represents the index at a specific 391version, several methods such as ``up_to_date`` and ``refresh`` were removed 392from its interface. The Searcher object now has 393:meth:`~whoosh.searching.Searcher.last_modified`, 394:meth:`~whoosh.searching.Searcher.up_to_date`, and 395:meth:`~whoosh.searching.Searcher.refresh` methods similar to those that used to 396be on ``Index``. 397 398The document deletion and field add/remove methods on the ``Index`` object now 399create a writer behind the scenes to accomplish each call. This means they write 400to the index immediately, so you don't need to call ``commit`` on the ``Index``. 401Also, it will be much faster if you need to call them multiple times to create 402your own writer instead:: 403 404 # Don't do this 405 for id in my_list_of_ids_to_delete: 406 myindex.delete_by_term("id", id) 407 myindex.commit() 408 409 # Instead do this 410 writer = myindex.writer() 411 for id in my_list_of_ids_to_delete: 412 writer.delete_by_term("id", id) 413 writer.commit() 414 415The ``postlimit`` argument to ``Index.writer()`` has been changed to 416``postlimitmb`` and is now expressed in megabytes instead of bytes:: 417 418 writer = myindex.writer(postlimitmb=128) 419 420Instead of having to import ``whoosh.filedb.filewriting.NO_MERGE`` or 421``whoosh.filedb.filewriting.OPTIMIZE`` to use as arguments to ``commit()``, you 422can now simply do the following:: 423 424 # Do not merge segments 425 writer.commit(merge=False) 426 427 # or 428 429 # Merge all segments 430 writer.commit(optimize=True) 431 432The ``whoosh.postings`` module is gone. The ``whoosh.matching`` module contains 433classes for posting list readers. 434 435Whoosh no longer maps field names to numbers for internal use or writing to 436disk. Any low-level method that accepted field numbers now accept field names 437instead. 438 439Custom Weighting implementations that use the ``final()`` method must now 440set the ``use_final`` attribute to ``True``:: 441 442 from whoosh.scoring import BM25F 443 444 class MyWeighting(BM25F): 445 use_final = True 446 447 def final(searcher, docnum, score): 448 return score + docnum * 10 449 450This disables the new optimizations, forcing Whoosh to score every matching 451document. 452 453:class:`whoosh.writing.AsyncWriter` now takes an :class:`whoosh.index.Index` 454object as its first argument, not a callable. Also, the keyword arguments to 455pass to the index's ``writer()`` method should now be passed as a dictionary 456using the ``writerargs`` keyword argument. 457 458Whoosh now stores per-document field length using an approximation rather than 459exactly. For low numbers the approximation is perfectly accurate, while high 460numbers will be approximated less accurately. 461 462The ``doc_field_length`` method on searchers and readers now takes a second 463argument representing the default to return if the given document and field 464do not have a length (i.e. the field is not scored or the field was not 465provided for the given document). 466 467The :class:`whoosh.analysis.StopFilter` now has a ``maxsize`` argument as well 468as a ``minsize`` argument to its initializer. Analyzers that use the 469``StopFilter`` have the ``maxsize`` argument in their initializers now also. 470 471The interface of :class:`whoosh.writing.AsyncWriter` has changed. 472 473 474Misc 475==== 476 477* Because the file backend now writes 64-bit disk pointers and field names 478 instead of numbers, the size of an index on disk will grow compared to 479 previous versions. 480 481* Unit tests should no longer leave directories and files behind. 482 483