1===========================================
2Indexing and searching document hierarchies
3===========================================
4
5Overview
6========
7
8Whoosh's full-text index is essentially a flat database of documents. However,
9Whoosh supports two techniques for simulating the indexing and querying of
10hierarchical documents, that is, sets of documents that form a parent-child
11hierarchy, such as "Chapter - Section - Paragraph" or
12"Module - Class - Method".
13
14You can specify parent-child relationships *at indexing time*, by grouping
15documents in the same hierarchy, and then use the
16:class:`whoosh.query.NestedParent` and/or :class:`whoosh.query.NestedChildren`
17to find parents based on their children or vice-versa.
18
19Alternatively, you can use *query time joins*, essentially like external key
20joins in a database, where you perform one search to find a relevant document,
21then use a stored value on that document (for example, a ``parent`` field) to
22look up another document.
23
24Both methods have pros and cons.
25
26
27Using nested document indexing
28==============================
29
30Indexing
31--------
32
33This method works by indexing a "parent" document and all its "child" documents
34*as a "group"* so they are guaranteed to end up in the same segment. You can
35use the context manager returned by ``IndexWriter.group()`` to group
36documents::
37
38    with ix.writer() as w:
39        with w.group():
40            w.add_document(kind="class", name="Index")
41            w.add_document(kind="method", name="add document")
42            w.add_document(kind="method", name="add reader")
43            w.add_document(kind="method", name="close")
44        with w.group():
45            w.add_document(kind="class", name="Accumulator")
46            w.add_document(kind="method", name="add")
47            w.add_document(kind="method", name="get result")
48        with w.group():
49            w.add_document(kind="class", name="Calculator")
50            w.add_document(kind="method", name="add")
51            w.add_document(kind="method", name="add all")
52            w.add_document(kind="method", name="add some")
53            w.add_document(kind="method", name="multiply")
54            w.add_document(kind="method", name="close")
55        with w.group():
56            w.add_document(kind="class", name="Deleter")
57            w.add_document(kind="method", name="add")
58            w.add_document(kind="method", name="delete")
59
60Alternatively you can use the ``start_group()`` and ``end_group()`` methods::
61
62    with ix.writer() as w:
63        w.start_group()
64        w.add_document(kind="class", name="Index")
65        w.add_document(kind="method", name="add document")
66        w.add_document(kind="method", name="add reader")
67        w.add_document(kind="method", name="close")
68        w.end_group()
69
70Each level of the hierarchy should have a query that distinguishes it from
71other levels (for example, in the above index, you can use ``kind:class`` or
72``kind:method`` to match different levels of the hierarchy).
73
74Once you've indexed the hierarchy of documents, you can use two query types to
75find parents based on children or vice-versa.
76
77(There is currently no support in the default query parser for nested queries.)
78
79
80NestedParent query
81------------------
82
83The :class:`whoosh.query.NestedParent` query type lets you specify a query for
84child documents, but have the query return an "ancestor" document from higher
85in the hierarchy::
86
87    # First, we need a query that matches all the documents in the "parent"
88    # level we want of the hierarchy
89    all_parents = query.Term("kind", "class")
90
91    # Then, we need a query that matches the children we want to find
92    wanted_kids = query.Term("name", "close")
93
94    # Now we can make a query that will match documents where "name" is
95    # "close", but the query will return the "parent" documents of the matching
96    # children
97    q = query.NestedParent(all_parents, wanted_kids)
98    # results = Index, Calculator
99
100Note that in a hierarchy with more than two levels, you can specify a "parents"
101query that matches any level of the hierarchy, so you can return the top-level
102ancestors of the matching children, or the second level, third level, etc.
103
104The query works by first building a bit vector representing which documents are
105"parents"::
106
107     Index
108     |      Calculator
109     |      |
110     1000100100000100
111         |        |
112         |        Deleter
113         Accumulator
114
115Then for each match of the "child" query, it calculates the previous parent
116from the bit vector and returns it as a match (it only returns each parent once
117no matter how many children match). This parent lookup is very efficient::
118
119     1000100100000100
120        |
121     |<-+ close
122
123
124NestedChildren query
125--------------------
126
127The opposite of ``NestedParent`` is :class:`whoosh.query.NestedChildren`. This
128query lets you match parents but return their children. This is useful, for
129example, to search for an album title and return the songs in the album::
130
131    # Query that matches all documents in the "parent" level we want to match
132    # at
133    all_parents = query.Term("kind", "album")
134
135    # Parent documents we want to match
136    wanted_parents = query.Term("album_title", "heaven")
137
138    # Now we can make a query that will match parent documents where "album_title"
139    # contains "heaven", but the query will return the "child" documents of the
140    # matching parents
141    q1 = query.NestedChildren(all_parents, wanted_parents)
142
143You can then combine that query with an ``AND`` clause, for example to find
144songs with "hell" in the song title that occur on albums with "heaven" in the
145album title::
146
147    q2 = query.And([q1, query.Term("song_title", "hell")])
148
149
150Deleting and updating hierarchical documents
151--------------------------------------------
152
153The drawback of the index-time method is *updating and deleting*. Because the
154implementation of the queries depends on the parent and child documents being
155contiguous in the segment, you can't update/delete just one child document.
156You can only update/delete an entire top-level document at once (for example,
157if your hierarchy is "Chapter - Section - Paragraph", you can only update or
158delete entire chapters, not a section or paragraph). If the top-level of the
159hierarchy represents very large blocks of text, this can involve a lot of
160deleting and reindexing.
161
162Currently ``Writer.update_document()`` does not automatically work with nested
163documents. You must manually delete and re-add document groups to update them.
164
165To delete nested document groups, use the ``Writer.delete_by_query()``
166method with a ``NestedParent`` query::
167
168    # Delete the "Accumulator" class
169    all_parents = query.Term("kind", "class")
170    to_delete = query.Term("name", "Accumulator")
171    q = query.NestedParent(all_parents, to_delete)
172    with myindex.writer() as w:
173        w.delete_by_query(q)
174
175
176Using query-time joins
177======================
178
179A second technique for simulating hierarchical documents in Whoosh involves
180using a stored field on each document to point to its parent, and then using
181the value of that field at query time to find parents and children.
182
183For example, if we index a hierarchy of classes and methods using pointers
184to parents instead of nesting::
185
186    # Store a pointer to the parent on each "method" document
187    with ix.writer() as w:
188        w.add_document(kind="class", c_name="Index", docstring="...")
189        w.add_document(kind="method", m_name="add document", parent="Index")
190        w.add_document(kind="method", m_name="add reader", parent="Index")
191        w.add_document(kind="method", m_name="close", parent="Index")
192
193        w.add_document(kind="class", c_name="Accumulator", docstring="...")
194        w.add_document(kind="method", m_name="add", parent="Accumulator")
195        w.add_document(kind="method", m_name="get result", parent="Accumulator")
196
197        w.add_document(kind="class", c_name="Calculator", docstring="...")
198        w.add_document(kind="method", m_name="add", parent="Calculator")
199        w.add_document(kind="method", m_name="add all", parent="Calculator")
200        w.add_document(kind="method", m_name="add some", parent="Calculator")
201        w.add_document(kind="method", m_name="multiply", parent="Calculator")
202        w.add_document(kind="method", m_name="close", parent="Calculator")
203
204        w.add_document(kind="class", c_name="Deleter", docstring="...")
205        w.add_document(kind="method", m_name="add", parent="Deleter")
206        w.add_document(kind="method", m_name="delete", parent="Deleter")
207
208    # Now do manual joins at query time
209    with ix.searcher() as s:
210        # Tip: Searcher.document() and Searcher.documents() let you look up
211        # documents by field values more easily than using Searcher.search()
212
213        # Children to parents:
214        # Print the docstrings of classes on which "close" methods occur
215        for child_doc in s.documents(m_name="close"):
216            # Use the stored value of the "parent" field to look up the parent
217            # document
218            parent_doc = s.document(c_name=child_doc["parent"])
219            # Print the parent document's stored docstring field
220            print(parent_doc["docstring"])
221
222        # Parents to children:
223        # Find classes with "big" in the docstring and print their methods
224        q = query.Term("kind", "class") & query.Term("docstring", "big")
225        for hit in s.search(q, limit=None):
226            print("Class name=", hit["c_name"], "methods:")
227            for child_doc in s.documents(parent=hit["c_name"]):
228                print("  Method name=", child_doc["m_name"])
229
230This technique is more flexible than index-time nesting in that you can
231delete/update individual documents in the hierarchy piece by piece, although it
232doesn't support finding different parent levels as easily. It is also slower
233than index-time nesting (potentially much slower), since you must perform
234additional searches for each found document.
235
236Future versions of Whoosh may include "join" queries to make this process more
237efficient (or at least more automatic).
238
239