1===================================================== 2"Did you mean... ?" Correcting errors in user queries 3===================================================== 4 5Overview 6======== 7 8Whoosh can quickly suggest replacements for mis-typed words by returning 9a list of words from the index (or a dictionary) that are close to the 10mis-typed word:: 11 12 with ix.searcher() as s: 13 corrector = s.corrector("text") 14 for mistyped_word in mistyped_words: 15 print corrector.suggest(mistyped_word, limit=3) 16 17See the :meth:`whoosh.spelling.Corrector.suggest` method documentation 18for information on the arguments. 19 20Currently the suggestion engine is more like a "typo corrector" than a 21real "spell checker" since it doesn't do the kind of sophisticated 22phonetic matching or semantic/contextual analysis a good spell checker 23might. However, it is still very useful. 24 25There are two main strategies for correcting words: 26 27* Use the terms from an index field. 28 29* Use words from a word list. 30 31 32Pulling suggestions from an indexed field 33========================================= 34 35In Whoosh 2.7 and later, spelling suggestions are available on all fields. 36However, if you have an analyzer that modifies the indexed words (such as 37stemming), you can add ``spelling=True`` to a field to have it store separate 38unmodified versions of the terms for spelling suggestions:: 39 40 ana = analysis.StemmingAnalyzer() 41 schema = fields.Schema(text=TEXT(analyzer=ana, spelling=True)) 42 43You can then use the :meth:`whoosh.searching.Searcher.corrector` method 44to get a corrector for a field:: 45 46 corrector = searcher.corrector("content") 47 48The advantage of using the contents of an index field is that when you 49are spell checking queries on that index, the suggestions are tailored 50to the contents of the index. The disadvantage is that if the indexed 51documents contain spelling errors, then the spelling suggestions will 52also be erroneous. 53 54 55Pulling suggestions from a word list 56==================================== 57 58There are plenty of word lists available on the internet you can use to 59populate the spelling dictionary. 60 61(In the following examples, ``word_list`` can be a list of unicode 62strings, or a file object with one word on each line.) 63 64To create a :class:`whoosh.spelling.Corrector` object from a sorted word list:: 65 66 from whoosh.spelling import ListCorrector 67 68 # word_list must be a sorted list of unicocde strings 69 corrector = ListCorrector(word_list) 70 71 72Merging two or more correctors 73============================== 74 75You can combine suggestions from two sources (for example, the contents 76of an index field and a word list) using a 77:class:`whoosh.spelling.MultiCorrector`:: 78 79 c1 = searcher.corrector("content") 80 c2 = spelling.ListCorrector(word_list) 81 corrector = MultiCorrector([c1, c2]) 82 83 84Correcting user queries 85======================= 86 87You can spell-check a user query using the 88:meth:`whoosh.searching.Searcher.correct_query` method:: 89 90 from whoosh import qparser 91 92 # Parse the user query string 93 qp = qparser.QueryParser("content", myindex.schema) 94 q = qp.parse(qstring) 95 96 # Try correcting the query 97 with myindex.searcher() as s: 98 corrected = s.correct_query(q, qstring) 99 if corrected.query != q: 100 print("Did you mean:", corrected.string) 101 102The ``correct_query`` method returns an object with the following 103attributes: 104 105``query`` 106 A corrected :class:`whoosh.query.Query` tree. You can test 107 whether this is equal (``==``) to the original parsed query to 108 check if the corrector actually changed anything. 109 110``string`` 111 A corrected version of the user's query string. 112 113``tokens`` 114 A list of corrected token objects representing the corrected 115 terms. You can use this to reformat the user query (see below). 116 117 118You can use a :class:`whoosh.highlight.Formatter` object to format the 119corrected query string. For example, use the 120:class:`~whoosh.highlight.HtmlFormatter` to format the corrected string 121as HTML:: 122 123 from whoosh import highlight 124 125 hf = highlight.HtmlFormatter() 126 corrected = s.correct_query(q, qstring, formatter=hf) 127 128See the documentation for 129:meth:`whoosh.searching.Searcher.correct_query` for information on the 130defaults and arguments. 131