1src/backend/access/gin/README 2 3Gin for PostgreSQL 4================== 5 6Gin was sponsored by jfg://networks (http://www.jfg-networks.com/) 7 8Gin stands for Generalized Inverted Index and should be considered as a genie, 9not a drink. 10 11Generalized means that the index does not know which operation it accelerates. 12It instead works with custom strategies, defined for specific data types (read 13"Index Method Strategies" in the PostgreSQL documentation). In that sense, Gin 14is similar to GiST and differs from btree indices, which have predefined, 15comparison-based operations. 16 17An inverted index is an index structure storing a set of (key, posting list) 18pairs, where 'posting list' is a set of heap rows in which the key occurs. 19(A text document would usually contain many keys.) The primary goal of 20Gin indices is support for highly scalable, full-text search in PostgreSQL. 21 22A Gin index consists of a B-tree index constructed over key values, 23where each key is an element of some indexed items (element of array, lexeme 24for tsvector) and where each tuple in a leaf page contains either a pointer to 25a B-tree over item pointers (posting tree), or a simple list of item pointers 26(posting list) if the list is small enough. 27 28Note: There is no delete operation in the key (entry) tree. The reason for 29this is that in our experience, the set of distinct words in a large corpus 30changes very slowly. This greatly simplifies the code and concurrency 31algorithms. 32 33Core PostgreSQL includes built-in Gin support for one-dimensional arrays 34(eg. integer[], text[]). The following operations are available: 35 36 * contains: value_array @> query_array 37 * overlaps: value_array && query_array 38 * is contained by: value_array <@ query_array 39 40Synopsis 41-------- 42 43=# create index txt_idx on aa using gin(a); 44 45Features 46-------- 47 48 * Concurrency 49 * Write-Ahead Logging (WAL). (Recoverability from crashes.) 50 * User-defined opclasses. (The scheme is similar to GiST.) 51 * Optimized index creation (Makes use of maintenance_work_mem to accumulate 52 postings in memory.) 53 * Text search support via an opclass 54 * Soft upper limit on the returned results set using a GUC variable: 55 gin_fuzzy_search_limit 56 57Gin Fuzzy Limit 58--------------- 59 60There are often situations when a full-text search returns a very large set of 61results. Since reading tuples from the disk and sorting them could take a 62lot of time, this is unacceptable for production. (Note that the search 63itself is very fast.) 64 65Such queries usually contain very frequent lexemes, so the results are not 66very helpful. To facilitate execution of such queries Gin has a configurable 67soft upper limit on the size of the returned set, determined by the 68'gin_fuzzy_search_limit' GUC variable. This is set to 0 by default (no 69limit). 70 71If a non-zero search limit is set, then the returned set is a subset of the 72whole result set, chosen at random. 73 74"Soft" means that the actual number of returned results could differ 75from the specified limit, depending on the query and the quality of the 76system's random number generator. 77 78From experience, a value of 'gin_fuzzy_search_limit' in the thousands 79(eg. 5000-20000) works well. This means that 'gin_fuzzy_search_limit' will 80have no effect for queries returning a result set with less tuples than this 81number. 82 83Index structure 84--------------- 85 86The "items" that a GIN index indexes are composite values that contain 87zero or more "keys". For example, an item might be an integer array, and 88then the keys would be the individual integer values. The index actually 89stores and searches for the key values, not the items per se. In the 90pg_opclass entry for a GIN opclass, the opcintype is the data type of the 91items, and the opckeytype is the data type of the keys. GIN is optimized 92for cases where items contain many keys and the same key values appear 93in many different items. 94 95A GIN index contains a metapage, a btree of key entries, and possibly 96"posting tree" pages, which hold the overflow when a key entry acquires 97too many heap tuple pointers to fit in a btree page. Additionally, if the 98fast-update feature is enabled, there can be "list pages" holding "pending" 99key entries that haven't yet been merged into the main btree. The list 100pages have to be scanned linearly when doing a search, so the pending 101entries should be merged into the main btree before there get to be too 102many of them. The advantage of the pending list is that bulk insertion of 103a few thousand entries can be much faster than retail insertion. (The win 104comes mainly from not having to do multiple searches/insertions when the 105same key appears in multiple new heap tuples.) 106 107Key entries are nominally of the same IndexTuple format as used in other 108index types, but since a leaf key entry typically refers to multiple heap 109tuples, there are significant differences. (See GinFormTuple, which works 110by building a "normal" index tuple and then modifying it.) The points to 111know are: 112 113* In a single-column index, a key tuple just contains the key datum, but 114in a multi-column index, a key tuple contains the pair (column number, 115key datum) where the column number is stored as an int2. This is needed 116to support different key data types in different columns. This much of 117the tuple is built by index_form_tuple according to the usual rules. 118The column number (if present) can never be null, but the key datum can 119be, in which case a null bitmap is present as usual. (As usual for index 120tuples, the size of the null bitmap is fixed at INDEX_MAX_KEYS.) 121 122* If the key datum is null (ie, IndexTupleHasNulls() is true), then 123just after the nominal index data (ie, at offset IndexInfoFindDataOffset 124or IndexInfoFindDataOffset + sizeof(int2)) there is a byte indicating 125the "category" of the null entry. These are the possible categories: 126 1 = ordinary null key value extracted from an indexable item 127 2 = placeholder for zero-key indexable item 128 3 = placeholder for null indexable item 129Placeholder null entries are inserted into the index because otherwise 130there would be no index entry at all for an empty or null indexable item, 131which would mean that full index scans couldn't be done and various corner 132cases would give wrong answers. The different categories of null entries 133are treated as distinct keys by the btree, but heap itempointers for the 134same category of null entry are merged into one index entry just as happens 135with ordinary key entries. 136 137* In a key entry at the btree leaf level, at the next SHORTALIGN boundary, 138there is a list of item pointers, in compressed format (see Posting List 139Compression section), pointing to the heap tuples for which the indexable 140items contain this key. This is called the "posting list". 141 142If the list would be too big for the index tuple to fit on an index page, the 143ItemPointers are pushed out to a separate posting page or pages, and none 144appear in the key entry itself. The separate pages are called a "posting 145tree" (see below); Note that in either case, the ItemPointers associated with 146a key can easily be read out in sorted order; this is relied on by the scan 147algorithms. 148 149* The index tuple header fields of a leaf key entry are abused as follows: 150 1511) Posting list case: 152 153* ItemPointerGetBlockNumber(&itup->t_tid) contains the offset from index 154 tuple start to the posting list. 155 Access macros: GinGetPostingOffset(itup) / GinSetPostingOffset(itup,n) 156 157* ItemPointerGetOffsetNumber(&itup->t_tid) contains the number of elements 158 in the posting list (number of heap itempointers). 159 Access macros: GinGetNPosting(itup) / GinSetNPosting(itup,n) 160 161* If IndexTupleHasNulls(itup) is true, the null category byte can be 162 accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c) 163 164* The posting list can be accessed with GinGetPosting(itup) 165 166* If GinITupIsCompressed(itup), the posting list is stored in compressed 167 format. Otherwise it is just an array of ItemPointers. New tuples are always 168 stored in compressed format, uncompressed items can be present if the 169 database was migrated from 9.3 or earlier version. 170 1712) Posting tree case: 172 173* ItemPointerGetBlockNumber(&itup->t_tid) contains the index block number 174 of the root of the posting tree. 175 Access macros: GinGetPostingTree(itup) / GinSetPostingTree(itup, blkno) 176 177* ItemPointerGetOffsetNumber(&itup->t_tid) contains the magic number 178 GIN_TREE_POSTING, which distinguishes this from the posting-list case 179 (it's large enough that that many heap itempointers couldn't possibly 180 fit on an index page). This value is inserted automatically by the 181 GinSetPostingTree macro. 182 183* If IndexTupleHasNulls(itup) is true, the null category byte can be 184 accessed/set with GinGetNullCategory(itup,gs) / GinSetNullCategory(itup,gs,c) 185 186* The posting list is not present and must not be accessed. 187 188Use the macro GinIsPostingTree(itup) to determine which case applies. 189 190In both cases, itup->t_info & INDEX_SIZE_MASK contains actual total size of 191tuple, and the INDEX_VAR_MASK and INDEX_NULL_MASK bits have their normal 192meanings as set by index_form_tuple. 193 194Index tuples in non-leaf levels of the btree contain the optional column 195number, key datum, and null category byte as above. They do not contain 196a posting list. ItemPointerGetBlockNumber(&itup->t_tid) is the downlink 197to the next lower btree level, and ItemPointerGetOffsetNumber(&itup->t_tid) 198is InvalidOffsetNumber. Use the access macros GinGetDownlink/GinSetDownlink 199to get/set the downlink. 200 201Index entries that appear in "pending list" pages work a tad differently as 202well. The optional column number, key datum, and null category byte are as 203for other GIN index entries. However, there is always exactly one heap 204itempointer associated with a pending entry, and it is stored in the t_tid 205header field just as in non-GIN indexes. There is no posting list. 206Furthermore, the code that searches the pending list assumes that all 207entries for a given heap tuple appear consecutively in the pending list and 208are sorted by the column-number-plus-key-datum. The GIN_LIST_FULLROW page 209flag bit tells whether entries for a given heap tuple are spread across 210multiple pending-list pages. If GIN_LIST_FULLROW is set, the page contains 211all the entries for one or more heap tuples. If GIN_LIST_FULLROW is clear, 212the page contains entries for only one heap tuple, *and* they are not all 213the entries for that tuple. (Thus, a heap tuple whose entries do not all 214fit on one pending-list page must have those pages to itself, even if this 215results in wasting much of the space on the preceding page and the last 216page for the tuple.) 217 218GIN packs downlinks and pivot keys into internal page tuples in a different way 219than nbtree does. Lehman & Yao defines it as following. 220 221P_0, K_1, P_1, K_2, P_2, ... , K_n, P_n, K_{n+1} 222 223There P_i is a downlink and K_i is a key. K_i splits key space between P_{i-1} 224and P_i (0 <= i <= n). K_{n+1} is high key. 225 226In internal page tuple is key and downlink grouped together. nbtree packs 227keys and downlinks into tuples as following. 228 229(K_{n+1}, None), (-Inf, P_0), (K_1, P_1), ... , (K_n, P_n) 230 231There tuples are shown in parentheses. So, highkey is stored separately. P_i 232is grouped with K_i. P_0 is grouped with -Inf key. 233 234GIN packs keys and downlinks into tuples in a different way. 235 236(P_0, K_1), (P_1, K_2), ... , (P_n, K_{n+1}) 237 238P_i is grouped with K_{i+1}. -Inf key is not needed. 239 240There are couple of additional notes regarding K_{n+1} key. 2411) In entry tree rightmost page, a key coupled with P_n doesn't really matter. 242Highkey is assumed to be infinity. 2432) In posting tree, a key coupled with P_n always doesn't matter. Highkey for 244non-rightmost pages is stored separately and accessed via 245GinDataPageGetRightBound(). 246 247Posting tree 248------------ 249 250If a posting list is too large to store in-line in a key entry, a posting tree 251is created. A posting tree is a B-tree structure, where the ItemPointer is 252used as the key. 253 254Internal posting tree pages use the standard PageHeader and the same "opaque" 255struct as other GIN page, but do not contain regular index tuples. Instead, 256the contents of the page is an array of PostingItem structs. Each PostingItem 257consists of the block number of the child page, and the right bound of that 258child page, as an ItemPointer. The right bound of the page is stored right 259after the page header, before the PostingItem array. 260 261Posting tree leaf pages also use the standard PageHeader and opaque struct, 262and the right bound of the page is stored right after the page header, but 263the page content comprises of a number of compressed posting lists. The 264compressed posting lists are stored one after each other, between page header 265and pd_lower. The space between pd_lower and pd_upper is unused, which allows 266full-page images of posting tree leaf pages to skip the unused space in middle 267(buffer_std = true in XLogRecData). 268 269The item pointers are stored in a number of independent compressed posting 270lists (also called segments), instead of one big one, to make random access 271to a given item pointer faster: to find an item in a compressed list, you 272have to read the list from the beginning, but when the items are split into 273multiple lists, you can first skip over to the list containing the item you're 274looking for, and read only that segment. Also, an update only needs to 275re-encode the affected segment. 276 277Posting List Compression 278------------------------ 279 280To fit as many item pointers on a page as possible, posting tree leaf pages 281and posting lists stored inline in entry tree leaf tuples use a lightweight 282form of compression. We take advantage of the fact that the item pointers 283are stored in sorted order. Instead of storing the block and offset number of 284each item pointer separately, we store the difference from the previous item. 285That in itself doesn't do much, but it allows us to use so-called varbyte 286encoding to compress them. 287 288Varbyte encoding is a method to encode integers, allowing smaller numbers to 289take less space at the cost of larger numbers. Each integer is represented by 290variable number of bytes. High bit of each byte in varbyte encoding determines 291whether the next byte is still part of this number. Therefore, to read a single 292varbyte encoded number, you have to read bytes until you find a byte with the 293high bit not set. 294 295When encoding, the block and offset number forming the item pointer are 296combined into a single integer. The offset number is stored in the 11 low 297bits (see MaxHeapTuplesPerPageBits in ginpostinglist.c), and the block number 298is stored in the higher bits. That requires 43 bits in total, which 299conveniently fits in at most 6 bytes. 300 301A compressed posting list is passed around and stored on disk in a 302PackedPostingList struct. The first item in the list is stored uncompressed 303as a regular ItemPointerData, followed by the length of the list in bytes, 304followed by the packed items. 305 306Concurrency 307----------- 308 309The entry tree and each posting tree are B-trees, with right-links connecting 310sibling pages at the same level. This is the same structure that is used in 311the regular B-tree indexam (invented by Lehman & Yao), but we don't support 312scanning a GIN trees backwards, so we don't need left-links. The entry tree 313leaves don't have dedicated high keys, instead greatest leaf tuple serves as 314high key. That works because tuples are never deleted from the entry tree. 315 316The algorithms used to operate entry and posting trees are considered below. 317 318### Locating the leaf page 319 320When we search for leaf page in GIN btree to perform a read, we descend from 321the root page to the leaf through using downlinks taking pin and shared lock on 322one page at once. So, we release pin and shared lock on previous page before 323getting them on the next page. 324 325The picture below shows tree state after finding the leaf page. Lower case 326letters depicts tree pages. 'S' depicts shared lock on the page. 327 328 a 329 / | \ 330 b c d 331 / | \ | \ | \ 332 eS f g h i j k 333 334### Steping right 335 336Concurrent page splits move the keyspace to right, so after following a 337downlink, the page actually containing the key we're looking for might be 338somewhere to the right of the page we landed on. In that case, we follow the 339right-links until we find the page we're looking for. 340 341During stepping right we take pin and shared lock on the right sibling before 342releasing them from the current page. This mechanism was designed to protect 343from stepping to delete page. We step to the right sibling while hold lock on 344the rightlink pointing there. So, it's guaranteed that nobody updates rightlink 345concurrently and doesn't delete right sibling accordingly. 346 347The picture below shows two pages locked at once during stepping right. 348 349 a 350 / | \ 351 b c d 352 / | \ | \ | \ 353 eS fS g h i j k 354 355### Insert 356 357While finding appropriate leaf for insertion we also descend from the root to 358leaf, while shared locking one page at once in. But during insertion we don't 359release pins from root and internal pages. That could save us some lookups to 360the buffers hash table for downlinks insertion assuming parents are not changed 361due to concurrent splits. Once we reach leaf we re-lock the page in exclusive 362mode. 363 364The picture below shows leaf page locked in exclusive mode and ready for 365insertion. 'P' and 'E' depict pin and exclusive lock correspondingly. 366 367 368 aP 369 / | \ 370 b cP d 371 / | \ | \ | \ 372 e f g hE i j k 373 374 375If insert causes a page split, the parent is locked in exclusive mode before 376unlocking the left child. So, insertion algorithm can exclusively lock both 377parent and child pages at once starting from child. 378 379The picture below shows tree state after leaf page split. 'q' is new page 380produced by split. Parent 'c' is about to have downlink inserted. 381 382 aP 383 / | \ 384 b cE d 385 / | \ / | \ | \ 386 e f g hE q i j k 387 388 389### Page deletion 390 391Vacuum never deletes tuples or pages from the entry tree. It traverses entry 392tree leafs in logical order by rightlinks and removes deletable TIDs from 393posting lists. Posting trees are processed by links from entry tree leafs. They 394are vacuumed in two stages. At first stage, deletable TIDs are removed from 395leafs. If first stage detects at least one empty page, then at the second stage 396ginScanToDelete() deletes empty pages. 397 398ginScanToDelete() traverses the whole tree in depth-first manner. It starts 399from the super-exclusive lock on the tree root. This lock prevents all the 400concurrent insertions into this tree while we're deleting pages. However, 401there are still might be some in-progress readers, who traversed root before 402we locked it. 403 404The picture below shows tree state after page deletion algorithm traversed to 405leftmost leaf of the tree. 406 407 aE 408 / | \ 409 bE c d 410 / | \ | \ | \ 411 eE f g h i j k 412 413Deletion algorithm keeps exclusive locks on left siblings of pages comprising 414currently investigated path. Thus, if current page is to be removed, all 415required pages to remove both downlink and rightlink are already locked. That 416evades potential right to left page locking order, which could deadlock with 417concurrent stepping right. 418 419A search concurrent to page deletion might already have read a pointer to the 420page to be deleted, and might be just about to follow it. A page can be reached 421via the right-link of its left sibling, or via its downlink in the parent. 422 423To prevent a backend from reaching a deleted page via a right-link, stepping 424right algorithm doesn't release lock on the current page until lock of the 425right page is acquired. 426 427The downlink is more tricky. A search descending the tree must release the lock 428on the parent page before locking the child, or it could deadlock with a 429concurrent split of the child page; a page split locks the parent, while already 430holding a lock on the child page. So, deleted page cannot be reclaimed 431immediately. Instead, we have to wait for every transaction, which might wait 432to reference this page, to finish. Corresponding processes must observe that 433the page is marked deleted and recover accordingly. 434 435The picture below shows tree state after page deletion algorithm further 436traversed the tree. Currently investigated path is 'a-c-h'. Left siblings 'b' 437and 'g' of 'c' and 'h' correspondingly are also exclusively locked. 438 439 aE 440 / | \ 441 bE cE d 442 / | \ | \ | \ 443 e f gE hE i j k 444 445The next picture shows tree state after page 'h' was deleted. It's marked with 446'deleted' flag and newest xid, which might visit it. Downlink from 'c' to 'h' 447is also deleted. 448 449 aE 450 / | \ 451 bE cE d 452 / | \ \ | \ 453 e f gE hD iE j k 454 455However, it's still possible that concurrent reader has seen downlink from 'c' 456to 'h' before we deleted it. In that case this reader will step right from 'h' 457to till find non-deleted page. Xid-marking of page 'h' guarantees that this 458page wouldn't be reused till all such readers gone. Next leaf page under 459investigation is 'i'. 'g' remains locked as it becomes left sibling of 'i'. 460 461The next picture shows tree state after 'i' and 'c' was deleted. Internal page 462'c' was deleted because it appeared to have no downlinks. The path under 463investigation is 'a-d-j'. Pages 'b' and 'g' are locked as self siblings of 'd' 464and 'j'. 465 466 aE 467 / \ 468 bE cD dE 469 / | \ | \ 470 e f gE hD iD jE k 471 472During the replay of page deletion at standby, the page's left sibling, the 473target page, and its parent, are locked in that order. This order guarantees 474no deadlock with concurrent reads. 475 476Compatibility 477------------- 478 479Compression of TIDs was introduced in 9.4. Some GIN indexes could remain in 480uncompressed format because of pg_upgrade from 9.3 or earlier versions. 481For compatibility, old uncompressed format is also supported. Following 482rules are used to handle it: 483 484* GIN_ITUP_COMPRESSED flag marks index tuples that contain a posting list. 485This flag is stored in high bit of ItemPointerGetBlockNumber(&itup->t_tid). 486Use GinItupIsCompressed(itup) to check the flag. 487 488* Posting tree pages in the new format are marked with the GIN_COMPRESSED flag. 489 Macros GinPageIsCompressed(page) and GinPageSetCompressed(page) are used to 490 check and set this flag. 491 492* All scan operations check format of posting list add use corresponding code 493to read its content. 494 495* When updating an index tuple containing an uncompressed posting list, it 496will be replaced with new index tuple containing a compressed list. 497 498* When updating an uncompressed posting tree leaf page, it's compressed. 499 500* If vacuum finds some dead TIDs in uncompressed posting lists, they are 501converted into compressed posting lists. This assumes that the compressed 502posting list fits in the space occupied by the uncompressed list. IOW, we 503assume that the compressed version of the page, with the dead items removed, 504takes less space than the old uncompressed version. 505 506Limitations 507----------- 508 509 * Gin doesn't use scan->kill_prior_tuple & scan->ignore_killed_tuples 510 * Gin searches entries only by equality matching, or simple range 511 matching using the "partial match" feature. 512 513TODO 514---- 515 516Nearest future: 517 518 * Opclasses for more types (no programming, just many catalog changes) 519 520Distant future: 521 522 * Replace B-tree of entries to something like GiST 523 524Authors 525------- 526 527Original work was done by Teodor Sigaev (teodor@sigaev.ru) and Oleg Bartunov 528(oleg@sai.msu.su). 529