1==============
2Whoosh recipes
3==============
4
5General
6=======
7
8Get the stored fields for a document from the document number
9-------------------------------------------------------------
10::
11
12    stored_fields = searcher.stored_fields(docnum)
13
14
15Analysis
16========
17
18Eliminate words shorter/longer than N
19-------------------------------------
20
21Use a :class:`~whoosh.analysis.StopFilter` and the ``minsize`` and ``maxsize``
22keyword arguments. If you just want to filter based on size and not common
23words, set the ``stoplist`` to ``None``::
24
25    sf = analysis.StopFilter(stoplist=None, minsize=2, maxsize=40)
26
27
28Allow optional case-sensitive searches
29--------------------------------------
30
31A quick and easy way to do this is to index both the original and lowercased
32versions of each word. If the user searches for an all-lowercase word, it acts
33as a case-insensitive search, but if they search for a word with any uppercase
34characters, it acts as a case-sensitive search::
35
36    class CaseSensitivizer(analysis.Filter):
37        def __call__(self, tokens):
38            for t in tokens:
39                yield t
40                if t.mode == "index":
41                   low = t.text.lower()
42                   if low != t.text:
43                       t.text = low
44                       yield t
45
46    ana = analysis.RegexTokenizer() | CaseSensitivizer()
47    [t.text for t in ana("The new SuperTurbo 5000", mode="index")]
48    # ["The", "the", "new", "SuperTurbo", "superturbo", "5000"]
49
50
51Searching
52=========
53
54Find every document
55-------------------
56::
57
58    myquery = query.Every()
59
60
61iTunes-style search-as-you-type
62-------------------------------
63
64Use the :class:`whoosh.analysis.NgramWordAnalyzer` as the analyzer for the
65field you want to search as the user types. You can save space in the index by
66turning off positions in the field using ``phrase=False``, since phrase
67searching on N-gram fields usually doesn't make much sense::
68
69    # For example, to search the "title" field as the user types
70    analyzer = analysis.NgramWordAnalyzer()
71    title_field = fields.TEXT(analyzer=analyzer, phrase=False)
72    schema = fields.Schema(title=title_field)
73
74See the documentation for the :class:`~whoosh.analysis.NgramWordAnalyzer` class
75for information on the available options.
76
77
78Shortcuts
79=========
80
81Look up documents by a field value
82----------------------------------
83::
84
85    # Single document (unique field value)
86    stored_fields = searcher.document(id="bacon")
87
88    # Multiple documents
89    for stored_fields in searcher.documents(tag="cake"):
90        ...
91
92
93Sorting and scoring
94===================
95
96See :doc:`facets`.
97
98
99Score results based on the position of the matched term
100-------------------------------------------------------
101
102The following scoring function uses the position of the first occurance of a
103term in each document to calculate the score, so documents with the given term
104earlier in the document will score higher::
105
106    from whoosh import scoring
107
108    def pos_score_fn(searcher, fieldname, text, matcher):
109        poses = matcher.value_as("positions")
110        return 1.0 / (poses[0] + 1)
111
112    pos_weighting = scoring.FunctionWeighting(pos_score_fn)
113    with myindex.searcher(weighting=pos_weighting) as s:
114        ...
115
116
117Results
118=======
119
120How many hits were there?
121-------------------------
122
123The number of *scored* hits::
124
125    found = results.scored_length()
126
127Depending on the arguments to the search, the exact total number of hits may be
128known::
129
130    if results.has_exact_length():
131        print("Scored", found, "of exactly", len(results), "documents")
132
133Usually, however, the exact number of documents that match the query is not
134known, because the searcher can skip over blocks of documents it knows won't
135show up in the "top N" list. If you call ``len(results)`` on a query where the
136exact length is unknown, Whoosh will run an unscored version of the original
137query to get the exact number. This is faster than the scored search, but may
138still be noticeably slow on very large indexes or complex queries.
139
140As an alternative, you might display the *estimated* total hits::
141
142    found = results.scored_length()
143    if results.has_exact_length():
144        print("Scored", found, "of exactly", len(results), "documents")
145    else:
146        low = results.estimated_min_length()
147        high = results.estimated_length()
148
149        print("Scored", found, "of between", low, "and", high, "documents")
150
151
152Which terms matched in each hit?
153--------------------------------
154::
155
156    # Use terms=True to record term matches for each hit
157    results = searcher.search(myquery, terms=True)
158
159    for hit in results:
160        # Which terms matched in this hit?
161        print("Matched:", hit.matched_terms())
162
163        # Which terms from the query didn't match in this hit?
164        print("Didn't match:", myquery.all_terms() - hit.matched_terms())
165
166
167Global information
168==================
169
170How many documents are in the index?
171------------------------------------
172::
173
174    # Including documents that are deleted but not yet optimized away
175    numdocs = searcher.doc_count_all()
176
177    # Not including deleted documents
178    numdocs = searcher.doc_count()
179
180
181What fields are in the index?
182-----------------------------
183::
184
185    return myindex.schema.names()
186
187
188Is term X in the index?
189-----------------------
190::
191
192    return ("content", "wobble") in searcher
193
194
195How many times does term X occur in the index?
196----------------------------------------------
197::
198
199    # Number of times content:wobble appears in all documents
200    freq = searcher.frequency("content", "wobble")
201
202    # Number of documents containing content:wobble
203    docfreq = searcher.doc_frequency("content", "wobble")
204
205
206Is term X in document Y?
207------------------------
208::
209
210    # Check if the "content" field of document 500 contains the term "wobble"
211
212    # Without term vectors, skipping through list...
213    postings = searcher.postings("content", "wobble")
214    postings.skip_to(500)
215    return postings.id() == 500
216
217    # ...or the slower but easier way
218    docset = set(searcher.postings("content", "wobble").all_ids())
219    return 500 in docset
220
221    # If field has term vectors, skipping through list...
222    vector = searcher.vector(500, "content")
223    vector.skip_to("wobble")
224    return vector.id() == "wobble"
225
226    # ...or the slower but easier way
227    wordset = set(searcher.vector(500, "content").all_ids())
228    return "wobble" in wordset
229
230