1=========================================== 2Indexing and searching document hierarchies 3=========================================== 4 5Overview 6======== 7 8Whoosh's full-text index is essentially a flat database of documents. However, 9Whoosh supports two techniques for simulating the indexing and querying of 10hierarchical documents, that is, sets of documents that form a parent-child 11hierarchy, such as "Chapter - Section - Paragraph" or 12"Module - Class - Method". 13 14You can specify parent-child relationships *at indexing time*, by grouping 15documents in the same hierarchy, and then use the 16:class:`whoosh.query.NestedParent` and/or :class:`whoosh.query.NestedChildren` 17to find parents based on their children or vice-versa. 18 19Alternatively, you can use *query time joins*, essentially like external key 20joins in a database, where you perform one search to find a relevant document, 21then use a stored value on that document (for example, a ``parent`` field) to 22look up another document. 23 24Both methods have pros and cons. 25 26 27Using nested document indexing 28============================== 29 30Indexing 31-------- 32 33This method works by indexing a "parent" document and all its "child" documents 34*as a "group"* so they are guaranteed to end up in the same segment. You can 35use the context manager returned by ``IndexWriter.group()`` to group 36documents:: 37 38 with ix.writer() as w: 39 with w.group(): 40 w.add_document(kind="class", name="Index") 41 w.add_document(kind="method", name="add document") 42 w.add_document(kind="method", name="add reader") 43 w.add_document(kind="method", name="close") 44 with w.group(): 45 w.add_document(kind="class", name="Accumulator") 46 w.add_document(kind="method", name="add") 47 w.add_document(kind="method", name="get result") 48 with w.group(): 49 w.add_document(kind="class", name="Calculator") 50 w.add_document(kind="method", name="add") 51 w.add_document(kind="method", name="add all") 52 w.add_document(kind="method", name="add some") 53 w.add_document(kind="method", name="multiply") 54 w.add_document(kind="method", name="close") 55 with w.group(): 56 w.add_document(kind="class", name="Deleter") 57 w.add_document(kind="method", name="add") 58 w.add_document(kind="method", name="delete") 59 60Alternatively you can use the ``start_group()`` and ``end_group()`` methods:: 61 62 with ix.writer() as w: 63 w.start_group() 64 w.add_document(kind="class", name="Index") 65 w.add_document(kind="method", name="add document") 66 w.add_document(kind="method", name="add reader") 67 w.add_document(kind="method", name="close") 68 w.end_group() 69 70Each level of the hierarchy should have a query that distinguishes it from 71other levels (for example, in the above index, you can use ``kind:class`` or 72``kind:method`` to match different levels of the hierarchy). 73 74Once you've indexed the hierarchy of documents, you can use two query types to 75find parents based on children or vice-versa. 76 77(There is currently no support in the default query parser for nested queries.) 78 79 80NestedParent query 81------------------ 82 83The :class:`whoosh.query.NestedParent` query type lets you specify a query for 84child documents, but have the query return an "ancestor" document from higher 85in the hierarchy:: 86 87 # First, we need a query that matches all the documents in the "parent" 88 # level we want of the hierarchy 89 all_parents = query.Term("kind", "class") 90 91 # Then, we need a query that matches the children we want to find 92 wanted_kids = query.Term("name", "close") 93 94 # Now we can make a query that will match documents where "name" is 95 # "close", but the query will return the "parent" documents of the matching 96 # children 97 q = query.NestedParent(all_parents, wanted_kids) 98 # results = Index, Calculator 99 100Note that in a hierarchy with more than two levels, you can specify a "parents" 101query that matches any level of the hierarchy, so you can return the top-level 102ancestors of the matching children, or the second level, third level, etc. 103 104The query works by first building a bit vector representing which documents are 105"parents":: 106 107 Index 108 | Calculator 109 | | 110 1000100100000100 111 | | 112 | Deleter 113 Accumulator 114 115Then for each match of the "child" query, it calculates the previous parent 116from the bit vector and returns it as a match (it only returns each parent once 117no matter how many children match). This parent lookup is very efficient:: 118 119 1000100100000100 120 | 121 |<-+ close 122 123 124NestedChildren query 125-------------------- 126 127The opposite of ``NestedParent`` is :class:`whoosh.query.NestedChildren`. This 128query lets you match parents but return their children. This is useful, for 129example, to search for an album title and return the songs in the album:: 130 131 # Query that matches all documents in the "parent" level we want to match 132 # at 133 all_parents = query.Term("kind", "album") 134 135 # Parent documents we want to match 136 wanted_parents = query.Term("album_title", "heaven") 137 138 # Now we can make a query that will match parent documents where "album_title" 139 # contains "heaven", but the query will return the "child" documents of the 140 # matching parents 141 q1 = query.NestedChildren(all_parents, wanted_parents) 142 143You can then combine that query with an ``AND`` clause, for example to find 144songs with "hell" in the song title that occur on albums with "heaven" in the 145album title:: 146 147 q2 = query.And([q1, query.Term("song_title", "hell")]) 148 149 150Deleting and updating hierarchical documents 151-------------------------------------------- 152 153The drawback of the index-time method is *updating and deleting*. Because the 154implementation of the queries depends on the parent and child documents being 155contiguous in the segment, you can't update/delete just one child document. 156You can only update/delete an entire top-level document at once (for example, 157if your hierarchy is "Chapter - Section - Paragraph", you can only update or 158delete entire chapters, not a section or paragraph). If the top-level of the 159hierarchy represents very large blocks of text, this can involve a lot of 160deleting and reindexing. 161 162Currently ``Writer.update_document()`` does not automatically work with nested 163documents. You must manually delete and re-add document groups to update them. 164 165To delete nested document groups, use the ``Writer.delete_by_query()`` 166method with a ``NestedParent`` query:: 167 168 # Delete the "Accumulator" class 169 all_parents = query.Term("kind", "class") 170 to_delete = query.Term("name", "Accumulator") 171 q = query.NestedParent(all_parents, to_delete) 172 with myindex.writer() as w: 173 w.delete_by_query(q) 174 175 176Using query-time joins 177====================== 178 179A second technique for simulating hierarchical documents in Whoosh involves 180using a stored field on each document to point to its parent, and then using 181the value of that field at query time to find parents and children. 182 183For example, if we index a hierarchy of classes and methods using pointers 184to parents instead of nesting:: 185 186 # Store a pointer to the parent on each "method" document 187 with ix.writer() as w: 188 w.add_document(kind="class", c_name="Index", docstring="...") 189 w.add_document(kind="method", m_name="add document", parent="Index") 190 w.add_document(kind="method", m_name="add reader", parent="Index") 191 w.add_document(kind="method", m_name="close", parent="Index") 192 193 w.add_document(kind="class", c_name="Accumulator", docstring="...") 194 w.add_document(kind="method", m_name="add", parent="Accumulator") 195 w.add_document(kind="method", m_name="get result", parent="Accumulator") 196 197 w.add_document(kind="class", c_name="Calculator", docstring="...") 198 w.add_document(kind="method", m_name="add", parent="Calculator") 199 w.add_document(kind="method", m_name="add all", parent="Calculator") 200 w.add_document(kind="method", m_name="add some", parent="Calculator") 201 w.add_document(kind="method", m_name="multiply", parent="Calculator") 202 w.add_document(kind="method", m_name="close", parent="Calculator") 203 204 w.add_document(kind="class", c_name="Deleter", docstring="...") 205 w.add_document(kind="method", m_name="add", parent="Deleter") 206 w.add_document(kind="method", m_name="delete", parent="Deleter") 207 208 # Now do manual joins at query time 209 with ix.searcher() as s: 210 # Tip: Searcher.document() and Searcher.documents() let you look up 211 # documents by field values more easily than using Searcher.search() 212 213 # Children to parents: 214 # Print the docstrings of classes on which "close" methods occur 215 for child_doc in s.documents(m_name="close"): 216 # Use the stored value of the "parent" field to look up the parent 217 # document 218 parent_doc = s.document(c_name=child_doc["parent"]) 219 # Print the parent document's stored docstring field 220 print(parent_doc["docstring"]) 221 222 # Parents to children: 223 # Find classes with "big" in the docstring and print their methods 224 q = query.Term("kind", "class") & query.Term("docstring", "big") 225 for hit in s.search(q, limit=None): 226 print("Class name=", hit["c_name"], "methods:") 227 for child_doc in s.documents(parent=hit["c_name"]): 228 print(" Method name=", child_doc["m_name"]) 229 230This technique is more flexible than index-time nesting in that you can 231delete/update individual documents in the hierarchy piece by piece, although it 232doesn't support finding different parent levels as easily. It is also slower 233than index-time nesting (potentially much slower), since you must perform 234additional searches for each found document. 235 236Future versions of Whoosh may include "join" queries to make this process more 237efficient (or at least more automatic). 238 239