1============== 2Whoosh recipes 3============== 4 5General 6======= 7 8Get the stored fields for a document from the document number 9------------------------------------------------------------- 10:: 11 12 stored_fields = searcher.stored_fields(docnum) 13 14 15Analysis 16======== 17 18Eliminate words shorter/longer than N 19------------------------------------- 20 21Use a :class:`~whoosh.analysis.StopFilter` and the ``minsize`` and ``maxsize`` 22keyword arguments. If you just want to filter based on size and not common 23words, set the ``stoplist`` to ``None``:: 24 25 sf = analysis.StopFilter(stoplist=None, minsize=2, maxsize=40) 26 27 28Allow optional case-sensitive searches 29-------------------------------------- 30 31A quick and easy way to do this is to index both the original and lowercased 32versions of each word. If the user searches for an all-lowercase word, it acts 33as a case-insensitive search, but if they search for a word with any uppercase 34characters, it acts as a case-sensitive search:: 35 36 class CaseSensitivizer(analysis.Filter): 37 def __call__(self, tokens): 38 for t in tokens: 39 yield t 40 if t.mode == "index": 41 low = t.text.lower() 42 if low != t.text: 43 t.text = low 44 yield t 45 46 ana = analysis.RegexTokenizer() | CaseSensitivizer() 47 [t.text for t in ana("The new SuperTurbo 5000", mode="index")] 48 # ["The", "the", "new", "SuperTurbo", "superturbo", "5000"] 49 50 51Searching 52========= 53 54Find every document 55------------------- 56:: 57 58 myquery = query.Every() 59 60 61iTunes-style search-as-you-type 62------------------------------- 63 64Use the :class:`whoosh.analysis.NgramWordAnalyzer` as the analyzer for the 65field you want to search as the user types. You can save space in the index by 66turning off positions in the field using ``phrase=False``, since phrase 67searching on N-gram fields usually doesn't make much sense:: 68 69 # For example, to search the "title" field as the user types 70 analyzer = analysis.NgramWordAnalyzer() 71 title_field = fields.TEXT(analyzer=analyzer, phrase=False) 72 schema = fields.Schema(title=title_field) 73 74See the documentation for the :class:`~whoosh.analysis.NgramWordAnalyzer` class 75for information on the available options. 76 77 78Shortcuts 79========= 80 81Look up documents by a field value 82---------------------------------- 83:: 84 85 # Single document (unique field value) 86 stored_fields = searcher.document(id="bacon") 87 88 # Multiple documents 89 for stored_fields in searcher.documents(tag="cake"): 90 ... 91 92 93Sorting and scoring 94=================== 95 96See :doc:`facets`. 97 98 99Score results based on the position of the matched term 100------------------------------------------------------- 101 102The following scoring function uses the position of the first occurance of a 103term in each document to calculate the score, so documents with the given term 104earlier in the document will score higher:: 105 106 from whoosh import scoring 107 108 def pos_score_fn(searcher, fieldname, text, matcher): 109 poses = matcher.value_as("positions") 110 return 1.0 / (poses[0] + 1) 111 112 pos_weighting = scoring.FunctionWeighting(pos_score_fn) 113 with myindex.searcher(weighting=pos_weighting) as s: 114 ... 115 116 117Results 118======= 119 120How many hits were there? 121------------------------- 122 123The number of *scored* hits:: 124 125 found = results.scored_length() 126 127Depending on the arguments to the search, the exact total number of hits may be 128known:: 129 130 if results.has_exact_length(): 131 print("Scored", found, "of exactly", len(results), "documents") 132 133Usually, however, the exact number of documents that match the query is not 134known, because the searcher can skip over blocks of documents it knows won't 135show up in the "top N" list. If you call ``len(results)`` on a query where the 136exact length is unknown, Whoosh will run an unscored version of the original 137query to get the exact number. This is faster than the scored search, but may 138still be noticeably slow on very large indexes or complex queries. 139 140As an alternative, you might display the *estimated* total hits:: 141 142 found = results.scored_length() 143 if results.has_exact_length(): 144 print("Scored", found, "of exactly", len(results), "documents") 145 else: 146 low = results.estimated_min_length() 147 high = results.estimated_length() 148 149 print("Scored", found, "of between", low, "and", high, "documents") 150 151 152Which terms matched in each hit? 153-------------------------------- 154:: 155 156 # Use terms=True to record term matches for each hit 157 results = searcher.search(myquery, terms=True) 158 159 for hit in results: 160 # Which terms matched in this hit? 161 print("Matched:", hit.matched_terms()) 162 163 # Which terms from the query didn't match in this hit? 164 print("Didn't match:", myquery.all_terms() - hit.matched_terms()) 165 166 167Global information 168================== 169 170How many documents are in the index? 171------------------------------------ 172:: 173 174 # Including documents that are deleted but not yet optimized away 175 numdocs = searcher.doc_count_all() 176 177 # Not including deleted documents 178 numdocs = searcher.doc_count() 179 180 181What fields are in the index? 182----------------------------- 183:: 184 185 return myindex.schema.names() 186 187 188Is term X in the index? 189----------------------- 190:: 191 192 return ("content", "wobble") in searcher 193 194 195How many times does term X occur in the index? 196---------------------------------------------- 197:: 198 199 # Number of times content:wobble appears in all documents 200 freq = searcher.frequency("content", "wobble") 201 202 # Number of documents containing content:wobble 203 docfreq = searcher.doc_frequency("content", "wobble") 204 205 206Is term X in document Y? 207------------------------ 208:: 209 210 # Check if the "content" field of document 500 contains the term "wobble" 211 212 # Without term vectors, skipping through list... 213 postings = searcher.postings("content", "wobble") 214 postings.skip_to(500) 215 return postings.id() == 500 216 217 # ...or the slower but easier way 218 docset = set(searcher.postings("content", "wobble").all_ids()) 219 return 500 in docset 220 221 # If field has term vectors, skipping through list... 222 vector = searcher.vector(500, "content") 223 vector.skip_to("wobble") 224 return vector.id() == "wobble" 225 226 # ...or the slower but easier way 227 wordset = set(searcher.vector(500, "content").all_ids()) 228 return "wobble" in wordset 229 230