1<!-- doc/src/sgml/btree.sgml --> 2 3<chapter id="btree"> 4<title>B-Tree Indexes</title> 5 6 <indexterm> 7 <primary>index</primary> 8 <secondary>B-Tree</secondary> 9 </indexterm> 10 11<sect1 id="btree-intro"> 12 <title>Introduction</title> 13 14 <para> 15 <productname>PostgreSQL</productname> includes an implementation of the 16 standard <acronym>btree</acronym> (multi-way balanced tree) index data 17 structure. Any data type that can be sorted into a well-defined linear 18 order can be indexed by a btree index. The only limitation is that an 19 index entry cannot exceed approximately one-third of a page (after TOAST 20 compression, if applicable). 21 </para> 22 23 <para> 24 Because each btree operator class imposes a sort order on its data type, 25 btree operator classes (or, really, operator families) have come to be 26 used as <productname>PostgreSQL</productname>'s general representation 27 and understanding of sorting semantics. Therefore, they've acquired 28 some features that go beyond what would be needed just to support btree 29 indexes, and parts of the system that are quite distant from the 30 btree AM make use of them. 31 </para> 32 33</sect1> 34 35<sect1 id="btree-behavior"> 36 <title>Behavior of B-Tree Operator Classes</title> 37 38 <para> 39 As shown in <xref linkend="xindex-btree-strat-table"/>, a btree operator 40 class must provide five comparison operators, 41 <literal><</literal>, 42 <literal><=</literal>, 43 <literal>=</literal>, 44 <literal>>=</literal> and 45 <literal>></literal>. 46 One might expect that <literal><></literal> should also be part of 47 the operator class, but it is not, because it would almost never be 48 useful to use a <literal><></literal> WHERE clause in an index 49 search. (For some purposes, the planner treats <literal><></literal> 50 as associated with a btree operator class; but it finds that operator via 51 the <literal>=</literal> operator's negator link, rather than 52 from <structname>pg_amop</structname>.) 53 </para> 54 55 <para> 56 When several data types share near-identical sorting semantics, their 57 operator classes can be grouped into an operator family. Doing so is 58 advantageous because it allows the planner to make deductions about 59 cross-type comparisons. Each operator class within the family should 60 contain the single-type operators (and associated support functions) 61 for its input data type, while cross-type comparison operators and 62 support functions are <quote>loose</quote> in the family. It is 63 recommendable that a complete set of cross-type operators be included 64 in the family, thus ensuring that the planner can represent any 65 comparison conditions that it deduces from transitivity. 66 </para> 67 68 <para> 69 There are some basic assumptions that a btree operator family must 70 satisfy: 71 </para> 72 73 <itemizedlist> 74 <listitem> 75 <para> 76 An <literal>=</literal> operator must be an equivalence relation; that 77 is, for all non-null values <replaceable>A</replaceable>, 78 <replaceable>B</replaceable>, <replaceable>C</replaceable> of the 79 data type: 80 81 <itemizedlist> 82 <listitem> 83 <para> 84 <replaceable>A</replaceable> <literal>=</literal> 85 <replaceable>A</replaceable> is true 86 (<firstterm>reflexive law</firstterm>) 87 </para> 88 </listitem> 89 <listitem> 90 <para> 91 if <replaceable>A</replaceable> <literal>=</literal> 92 <replaceable>B</replaceable>, 93 then <replaceable>B</replaceable> <literal>=</literal> 94 <replaceable>A</replaceable> 95 (<firstterm>symmetric law</firstterm>) 96 </para> 97 </listitem> 98 <listitem> 99 <para> 100 if <replaceable>A</replaceable> <literal>=</literal> 101 <replaceable>B</replaceable> and <replaceable>B</replaceable> 102 <literal>=</literal> <replaceable>C</replaceable>, 103 then <replaceable>A</replaceable> <literal>=</literal> 104 <replaceable>C</replaceable> 105 (<firstterm>transitive law</firstterm>) 106 </para> 107 </listitem> 108 </itemizedlist> 109 </para> 110 </listitem> 111 112 <listitem> 113 <para> 114 A <literal><</literal> operator must be a strong ordering relation; 115 that is, for all non-null values <replaceable>A</replaceable>, 116 <replaceable>B</replaceable>, <replaceable>C</replaceable>: 117 118 <itemizedlist> 119 <listitem> 120 <para> 121 <replaceable>A</replaceable> <literal><</literal> 122 <replaceable>A</replaceable> is false 123 (<firstterm>irreflexive law</firstterm>) 124 </para> 125 </listitem> 126 <listitem> 127 <para> 128 if <replaceable>A</replaceable> <literal><</literal> 129 <replaceable>B</replaceable> 130 and <replaceable>B</replaceable> <literal><</literal> 131 <replaceable>C</replaceable>, 132 then <replaceable>A</replaceable> <literal><</literal> 133 <replaceable>C</replaceable> 134 (<firstterm>transitive law</firstterm>) 135 </para> 136 </listitem> 137 </itemizedlist> 138 </para> 139 </listitem> 140 141 <listitem> 142 <para> 143 Furthermore, the ordering is total; that is, for all non-null 144 values <replaceable>A</replaceable>, <replaceable>B</replaceable>: 145 146 <itemizedlist> 147 <listitem> 148 <para> 149 exactly one of <replaceable>A</replaceable> <literal><</literal> 150 <replaceable>B</replaceable>, <replaceable>A</replaceable> 151 <literal>=</literal> <replaceable>B</replaceable>, and 152 <replaceable>B</replaceable> <literal><</literal> 153 <replaceable>A</replaceable> is true 154 (<firstterm>trichotomy law</firstterm>) 155 </para> 156 </listitem> 157 </itemizedlist> 158 159 (The trichotomy law justifies the definition of the comparison support 160 function, of course.) 161 </para> 162 </listitem> 163 </itemizedlist> 164 165 <para> 166 The other three operators are defined in terms of <literal>=</literal> 167 and <literal><</literal> in the obvious way, and must act consistently 168 with them. 169 </para> 170 171 <para> 172 For an operator family supporting multiple data types, the above laws must 173 hold when <replaceable>A</replaceable>, <replaceable>B</replaceable>, 174 <replaceable>C</replaceable> are taken from any data types in the family. 175 The transitive laws are the trickiest to ensure, as in cross-type 176 situations they represent statements that the behaviors of two or three 177 different operators are consistent. 178 As an example, it would not work to put <type>float8</type> 179 and <type>numeric</type> into the same operator family, at least not with 180 the current semantics that <type>numeric</type> values are converted 181 to <type>float8</type> for comparison to a <type>float8</type>. Because 182 of the limited accuracy of <type>float8</type>, this means there are 183 distinct <type>numeric</type> values that will compare equal to the 184 same <type>float8</type> value, and thus the transitive law would fail. 185 </para> 186 187 <para> 188 Another requirement for a multiple-data-type family is that any implicit 189 or binary-coercion casts that are defined between data types included in 190 the operator family must not change the associated sort ordering. 191 </para> 192 193 <para> 194 It should be fairly clear why a btree index requires these laws to hold 195 within a single data type: without them there is no ordering to arrange 196 the keys with. Also, index searches using a comparison key of a 197 different data type require comparisons to behave sanely across two 198 data types. The extensions to three or more data types within a family 199 are not strictly required by the btree index mechanism itself, but the 200 planner relies on them for optimization purposes. 201 </para> 202 203</sect1> 204 205<sect1 id="btree-support-funcs"> 206 <title>B-Tree Support Functions</title> 207 208 <para> 209 As shown in <xref linkend="xindex-btree-support-table"/>, btree defines 210 one required and four optional support functions. The five 211 user-defined methods are: 212 </para> 213 <variablelist> 214 <varlistentry> 215 <term><function>order</function></term> 216 <listitem> 217 <para> 218 For each combination of data types that a btree operator family 219 provides comparison operators for, it must provide a comparison 220 support function, registered in 221 <structname>pg_amproc</structname> with support function number 1 222 and 223 <structfield>amproclefttype</structfield>/<structfield>amprocrighttype</structfield> 224 equal to the left and right data types for the comparison (i.e., 225 the same data types that the matching operators are registered 226 with in <structname>pg_amop</structname>). The comparison 227 function must take two non-null values 228 <replaceable>A</replaceable> and <replaceable>B</replaceable> and 229 return an <type>int32</type> value that is 230 <literal><</literal> <literal>0</literal>, 231 <literal>0</literal>, or <literal>></literal> 232 <literal>0</literal> when <replaceable>A</replaceable> 233 <literal><</literal> <replaceable>B</replaceable>, 234 <replaceable>A</replaceable> <literal>=</literal> 235 <replaceable>B</replaceable>, or <replaceable>A</replaceable> 236 <literal>></literal> <replaceable>B</replaceable>, 237 respectively. A null result is disallowed: all values of the 238 data type must be comparable. See 239 <filename>src/backend/access/nbtree/nbtcompare.c</filename> for 240 examples. 241 </para> 242 243 <para> 244 If the compared values are of a collatable data type, the 245 appropriate collation OID will be passed to the comparison 246 support function, using the standard 247 <function>PG_GET_COLLATION()</function> mechanism. 248 </para> 249 </listitem> 250 </varlistentry> 251 <varlistentry> 252 <term><function>sortsupport</function></term> 253 <listitem> 254 <para> 255 Optionally, a btree operator family may provide <firstterm>sort 256 support</firstterm> function(s), registered under support 257 function number 2. These functions allow implementing 258 comparisons for sorting purposes in a more efficient way than 259 naively calling the comparison support function. The APIs 260 involved in this are defined in 261 <filename>src/include/utils/sortsupport.h</filename>. 262 </para> 263 </listitem> 264 </varlistentry> 265 <varlistentry> 266 <term><function>in_range</function></term> 267 <listitem> 268 <indexterm> 269 <primary>in_range support functions</primary> 270 </indexterm> 271 272 <indexterm> 273 <primary>support functions</primary> 274 <secondary>in_range</secondary> 275 </indexterm> 276 <para> 277 Optionally, a btree operator family may provide 278 <firstterm>in_range</firstterm> support function(s), registered 279 under support function number 3. These are not used during btree 280 index operations; rather, they extend the semantics of the 281 operator family so that it can support window clauses containing 282 the <literal>RANGE</literal> <replaceable>offset</replaceable> 283 <literal>PRECEDING</literal> and <literal>RANGE</literal> 284 <replaceable>offset</replaceable> <literal>FOLLOWING</literal> 285 frame bound types (see <xref 286 linkend="syntax-window-functions"/>). Fundamentally, the extra 287 information provided is how to add or subtract an 288 <replaceable>offset</replaceable> value in a way that is 289 compatible with the family's data ordering. 290 </para> 291 292 <para> 293 An <function>in_range</function> function must have the signature 294<synopsis> 295in_range(<replaceable>val</replaceable> type1, <replaceable>base</replaceable> type1, <replaceable>offset</replaceable> type2, <replaceable>sub</replaceable> bool, <replaceable>less</replaceable> bool) 296returns bool 297</synopsis> 298 <replaceable>val</replaceable> and 299 <replaceable>base</replaceable> must be of the same type, which 300 is one of the types supported by the operator family (i.e., a 301 type for which it provides an ordering). However, 302 <replaceable>offset</replaceable> could be of a different type, 303 which might be one otherwise unsupported by the family. An 304 example is that the built-in <literal>time_ops</literal> family 305 provides an <function>in_range</function> function that has 306 <replaceable>offset</replaceable> of type <type>interval</type>. 307 A family can provide <function>in_range</function> functions for 308 any of its supported types and one or more 309 <replaceable>offset</replaceable> types. Each 310 <function>in_range</function> function should be entered in 311 <structname>pg_amproc</structname> with 312 <structfield>amproclefttype</structfield> equal to 313 <type>type1</type> and <structfield>amprocrighttype</structfield> 314 equal to <type>type2</type>. 315 </para> 316 317 <para> 318 The essential semantics of an <function>in_range</function> 319 function depend on the two Boolean flag parameters. It should 320 add or subtract <replaceable>base</replaceable> and 321 <replaceable>offset</replaceable>, then compare 322 <replaceable>val</replaceable> to the result, as follows: 323 <itemizedlist> 324 <listitem> 325 <para> 326 if <literal>!</literal><replaceable>sub</replaceable> and 327 <literal>!</literal><replaceable>less</replaceable>, return 328 <replaceable>val</replaceable> <literal>>=</literal> 329 (<replaceable>base</replaceable> <literal>+</literal> 330 <replaceable>offset</replaceable>) 331 </para> 332 </listitem> 333 <listitem> 334 <para> 335 if <literal>!</literal><replaceable>sub</replaceable> and 336 <replaceable>less</replaceable>, return 337 <replaceable>val</replaceable> <literal><=</literal> 338 (<replaceable>base</replaceable> <literal>+</literal> 339 <replaceable>offset</replaceable>) 340 </para> 341 </listitem> 342 <listitem> 343 <para> 344 if <replaceable>sub</replaceable> and 345 <literal>!</literal><replaceable>less</replaceable>, return 346 <replaceable>val</replaceable> <literal>>=</literal> 347 (<replaceable>base</replaceable> <literal>-</literal> 348 <replaceable>offset</replaceable>) 349 </para> 350 </listitem> 351 <listitem> 352 <para> 353 if <replaceable>sub</replaceable> and 354 <replaceable>less</replaceable>, return 355 <replaceable>val</replaceable> <literal><=</literal> 356 (<replaceable>base</replaceable> <literal>-</literal> 357 <replaceable>offset</replaceable>) 358 </para> 359 </listitem> 360 </itemizedlist> 361 Before doing so, the function should check the sign of 362 <replaceable>offset</replaceable>: if it is less than zero, raise 363 error 364 <literal>ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</literal> 365 (22013) with error text like <quote>invalid preceding or 366 following size in window function</quote>. (This is required by 367 the SQL standard, although nonstandard operator families might 368 perhaps choose to ignore this restriction, since there seems to 369 be little semantic necessity for it.) This requirement is 370 delegated to the <function>in_range</function> function so that 371 the core code needn't understand what <quote>less than 372 zero</quote> means for a particular data type. 373 </para> 374 375 <para> 376 An additional expectation is that <function>in_range</function> 377 functions should, if practical, avoid throwing an error if 378 <replaceable>base</replaceable> <literal>+</literal> 379 <replaceable>offset</replaceable> or 380 <replaceable>base</replaceable> <literal>-</literal> 381 <replaceable>offset</replaceable> would overflow. The correct 382 comparison result can be determined even if that value would be 383 out of the data type's range. Note that if the data type 384 includes concepts such as <quote>infinity</quote> or 385 <quote>NaN</quote>, extra care may be needed to ensure that 386 <function>in_range</function>'s results agree with the normal 387 sort order of the operator family. 388 </para> 389 390 <para> 391 The results of the <function>in_range</function> function must be 392 consistent with the sort ordering imposed by the operator family. 393 To be precise, given any fixed values of 394 <replaceable>offset</replaceable> and 395 <replaceable>sub</replaceable>, then: 396 <itemizedlist> 397 <listitem> 398 <para> 399 If <function>in_range</function> with 400 <replaceable>less</replaceable> = true is true for some 401 <replaceable>val1</replaceable> and 402 <replaceable>base</replaceable>, it must be true for every 403 <replaceable>val2</replaceable> <literal><=</literal> 404 <replaceable>val1</replaceable> with the same 405 <replaceable>base</replaceable>. 406 </para> 407 </listitem> 408 <listitem> 409 <para> 410 If <function>in_range</function> with 411 <replaceable>less</replaceable> = true is false for some 412 <replaceable>val1</replaceable> and 413 <replaceable>base</replaceable>, it must be false for every 414 <replaceable>val2</replaceable> <literal>>=</literal> 415 <replaceable>val1</replaceable> with the same 416 <replaceable>base</replaceable>. 417 </para> 418 </listitem> 419 <listitem> 420 <para> 421 If <function>in_range</function> with 422 <replaceable>less</replaceable> = true is true for some 423 <replaceable>val</replaceable> and 424 <replaceable>base1</replaceable>, it must be true for every 425 <replaceable>base2</replaceable> <literal>>=</literal> 426 <replaceable>base1</replaceable> with the same 427 <replaceable>val</replaceable>. 428 </para> 429 </listitem> 430 <listitem> 431 <para> 432 If <function>in_range</function> with 433 <replaceable>less</replaceable> = true is false for some 434 <replaceable>val</replaceable> and 435 <replaceable>base1</replaceable>, it must be false for every 436 <replaceable>base2</replaceable> <literal><=</literal> 437 <replaceable>base1</replaceable> with the same 438 <replaceable>val</replaceable>. 439 </para> 440 </listitem> 441 </itemizedlist> 442 Analogous statements with inverted conditions hold when 443 <replaceable>less</replaceable> = false. 444 </para> 445 446 <para> 447 If the type being ordered (<type>type1</type>) is collatable, the 448 appropriate collation OID will be passed to the 449 <function>in_range</function> function, using the standard 450 PG_GET_COLLATION() mechanism. 451 </para> 452 453 <para> 454 <function>in_range</function> functions need not handle NULL 455 inputs, and typically will be marked strict. 456 </para> 457 </listitem> 458 </varlistentry> 459 <varlistentry> 460 <term><function>equalimage</function></term> 461 <listitem> 462 <para> 463 Optionally, a btree operator family may provide 464 <function>equalimage</function> (<quote>equality implies image 465 equality</quote>) support functions, registered under support 466 function number 4. These functions allow the core code to 467 determine when it is safe to apply the btree deduplication 468 optimization. Currently, <function>equalimage</function> 469 functions are only called when building or rebuilding an index. 470 </para> 471 <para> 472 An <function>equalimage</function> function must have the 473 signature 474<synopsis> 475equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool 476</synopsis> 477 The return value is static information about an operator class 478 and collation. Returning <literal>true</literal> indicates that 479 the <function>order</function> function for the operator class is 480 guaranteed to only return <literal>0</literal> (<quote>arguments 481 are equal</quote>) when its <replaceable>A</replaceable> and 482 <replaceable>B</replaceable> arguments are also interchangeable 483 without any loss of semantic information. Not registering an 484 <function>equalimage</function> function or returning 485 <literal>false</literal> indicates that this condition cannot be 486 assumed to hold. 487 </para> 488 <para> 489 The <replaceable>opcintype</replaceable> argument is the 490 <literal><structname>pg_type</structname>.oid</literal> of the 491 data type that the operator class indexes. This is a convenience 492 that allows reuse of the same underlying 493 <function>equalimage</function> function across operator classes. 494 If <replaceable>opcintype</replaceable> is a collatable data 495 type, the appropriate collation OID will be passed to the 496 <function>equalimage</function> function, using the standard 497 <function>PG_GET_COLLATION()</function> mechanism. 498 </para> 499 <para> 500 As far as the operator class is concerned, returning 501 <literal>true</literal> indicates that deduplication is safe (or 502 safe for the collation whose OID was passed to its 503 <function>equalimage</function> function). However, the core 504 code will only deem deduplication safe for an index when 505 <emphasis>every</emphasis> indexed column uses an operator class 506 that registers an <function>equalimage</function> function, and 507 each function actually returns <literal>true</literal> when 508 called. 509 </para> 510 <para> 511 Image equality is <emphasis>almost</emphasis> the same condition 512 as simple bitwise equality. There is one subtle difference: When 513 indexing a varlena data type, the on-disk representation of two 514 image equal datums may not be bitwise equal due to inconsistent 515 application of <acronym>TOAST</acronym> compression on input. 516 Formally, when an operator class's 517 <function>equalimage</function> function returns 518 <literal>true</literal>, it is safe to assume that the 519 <literal>datum_image_eq()</literal> C function will always agree 520 with the operator class's <function>order</function> function 521 (provided that the same collation OID is passed to both the 522 <function>equalimage</function> and <function>order</function> 523 functions). 524 </para> 525 <para> 526 The core code is fundamentally unable to deduce anything about 527 the <quote>equality implies image equality</quote> status of an 528 operator class within a multiple-data-type family based on 529 details from other operator classes in the same family. Also, it 530 is not sensible for an operator family to register a cross-type 531 <function>equalimage</function> function, and attempting to do so 532 will result in an error. This is because <quote>equality implies 533 image equality</quote> status does not just depend on 534 sorting/equality semantics, which are more or less defined at the 535 operator family level. In general, the semantics that one 536 particular data type implements must be considered separately. 537 </para> 538 <para> 539 The convention followed by the operator classes included with the 540 core <productname>PostgreSQL</productname> distribution is to 541 register a stock, generic <function>equalimage</function> 542 function. Most operator classes register 543 <function>btequalimage()</function>, which indicates that 544 deduplication is safe unconditionally. Operator classes for 545 collatable data types such as <type>text</type> register 546 <function>btvarstrequalimage()</function>, which indicates that 547 deduplication is safe with deterministic collations. Best 548 practice for third-party extensions is to register their own 549 custom function to retain control. 550 </para> 551 </listitem> 552 </varlistentry> 553 <varlistentry> 554 <term><function>options</function></term> 555 <listitem> 556 <para> 557 Optionally, a B-tree operator family may provide 558 <function>options</function> (<quote>operator class specific 559 options</quote>) support functions, registered under support 560 function number 5. These functions define a set of user-visible 561 parameters that control operator class behavior. 562 </para> 563 <para> 564 An <function>options</function> support function must have the 565 signature 566<synopsis> 567options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns void 568</synopsis> 569 The function is passed a pointer to a <replaceable>local_relopts</replaceable> 570 struct, which needs to be filled with a set of operator class 571 specific options. The options can be accessed from other support 572 functions using the <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and 573 <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros. 574 </para> 575 <para> 576 Currently, no B-Tree operator class has an <function>options</function> 577 support function. B-tree doesn't allow flexible representation of keys 578 like GiST, SP-GiST, GIN and BRIN do. So, <function>options</function> 579 probably doesn't have much application in the current B-tree index 580 access method. Nevertheless, this support function was added to B-tree 581 for uniformity, and will probably find uses during further 582 evolution of B-tree in <productname>PostgreSQL</productname>. 583 </para> 584 </listitem> 585 </varlistentry> 586 </variablelist> 587 588</sect1> 589 590<sect1 id="btree-implementation"> 591 <title>Implementation</title> 592 593 <para> 594 This section covers B-Tree index implementation details that may be 595 of use to advanced users. See 596 <filename>src/backend/access/nbtree/README</filename> in the source 597 distribution for a much more detailed, internals-focused description 598 of the B-Tree implementation. 599 </para> 600 <sect2 id="btree-structure"> 601 <title>B-Tree Structure</title> 602 <para> 603 <productname>PostgreSQL</productname> B-Tree indexes are 604 multi-level tree structures, where each level of the tree can be 605 used as a doubly-linked list of pages. A single metapage is stored 606 in a fixed position at the start of the first segment file of the 607 index. All other pages are either leaf pages or internal pages. 608 Leaf pages are the pages on the lowest level of the tree. All 609 other levels consist of internal pages. Each leaf page contains 610 tuples that point to table rows. Each internal page contains 611 tuples that point to the next level down in the tree. Typically, 612 over 99% of all pages are leaf pages. Both internal pages and leaf 613 pages use the standard page format described in <xref 614 linkend="storage-page-layout"/>. 615 </para> 616 <para> 617 New leaf pages are added to a B-Tree index when an existing leaf 618 page cannot fit an incoming tuple. A <firstterm>page 619 split</firstterm> operation makes room for items that originally 620 belonged on the overflowing page by moving a portion of the items 621 to a new page. Page splits must also insert a new 622 <firstterm>downlink</firstterm> to the new page in the parent page, 623 which may cause the parent to split in turn. Page splits 624 <quote>cascade upwards</quote> in a recursive fashion. When the 625 root page finally cannot fit a new downlink, a <firstterm>root page 626 split</firstterm> operation takes place. This adds a new level to 627 the tree structure by creating a new root page that is one level 628 above the original root page. 629 </para> 630 </sect2> 631 632 <sect2 id="btree-deduplication"> 633 <title>Deduplication</title> 634 <para> 635 A duplicate is a leaf page tuple (a tuple that points to a table 636 row) where <emphasis>all</emphasis> indexed key columns have values 637 that match corresponding column values from at least one other leaf 638 page tuple in the same index. Duplicate tuples are quite common in 639 practice. B-Tree indexes can use a special, space-efficient 640 representation for duplicates when an optional technique is 641 enabled: <firstterm>deduplication</firstterm>. 642 </para> 643 <para> 644 Deduplication works by periodically merging groups of duplicate 645 tuples together, forming a single <firstterm>posting list</firstterm> tuple for each 646 group. The column key value(s) only appear once in this 647 representation. This is followed by a sorted array of 648 <acronym>TID</acronym>s that point to rows in the table. This 649 significantly reduces the storage size of indexes where each value 650 (or each distinct combination of column values) appears several 651 times on average. The latency of queries can be reduced 652 significantly. Overall query throughput may increase 653 significantly. The overhead of routine index vacuuming may also be 654 reduced significantly. 655 </para> 656 <note> 657 <para> 658 B-Tree deduplication is just as effective with 659 <quote>duplicates</quote> that contain a NULL value, even though 660 NULL values are never equal to each other according to the 661 <literal>=</literal> member of any B-Tree operator class. As far 662 as any part of the implementation that understands the on-disk 663 B-Tree structure is concerned, NULL is just another value from the 664 domain of indexed values. 665 </para> 666 </note> 667 <para> 668 The deduplication process occurs lazily, when a new item is 669 inserted that cannot fit on an existing leaf page. This prevents 670 (or at least delays) leaf page splits. Unlike GIN posting list 671 tuples, B-Tree posting list tuples do not need to expand every time 672 a new duplicate is inserted; they are merely an alternative 673 physical representation of the original logical contents of the 674 leaf page. This design prioritizes consistent performance with 675 mixed read-write workloads. Most client applications will at least 676 see a moderate performance benefit from using deduplication. 677 Deduplication is enabled by default. 678 </para> 679 <para> 680 <command>CREATE INDEX</command> and <command>REINDEX</command> 681 apply deduplication to create posting list tuples, though the 682 strategy they use is slightly different. Each group of duplicate 683 ordinary tuples encountered in the sorted input taken from the 684 table is merged into a posting list tuple 685 <emphasis>before</emphasis> being added to the current pending leaf 686 page. Individual posting list tuples are packed with as many 687 <acronym>TID</acronym>s as possible. Leaf pages are written out in 688 the usual way, without any separate deduplication pass. This 689 strategy is well-suited to <command>CREATE INDEX</command> and 690 <command>REINDEX</command> because they are once-off batch 691 operations. 692 </para> 693 <para> 694 Write-heavy workloads that don't benefit from deduplication due to 695 having few or no duplicate values in indexes will incur a small, 696 fixed performance penalty (unless deduplication is explicitly 697 disabled). The <literal>deduplicate_items</literal> storage 698 parameter can be used to disable deduplication within individual 699 indexes. There is never any performance penalty with read-only 700 workloads, since reading posting list tuples is at least as 701 efficient as reading the standard tuple representation. Disabling 702 deduplication isn't usually helpful. 703 </para> 704 <para> 705 B-Tree indexes are not directly aware that under MVCC, there might 706 be multiple extant versions of the same logical table row; to an 707 index, each tuple is an independent object that needs its own index 708 entry. <quote>Version duplicates</quote> may sometimes accumulate 709 and adversely affect query latency and throughput. This typically 710 occurs with <command>UPDATE</command>-heavy workloads where most 711 individual updates cannot apply the <acronym>HOT</acronym> 712 optimization (often because at least one indexed column gets 713 modified, necessitating a new set of index tuple versions — 714 one new tuple for <emphasis>each and every</emphasis> index). In 715 effect, B-Tree deduplication ameliorates index bloat caused by 716 version churn. Note that even the tuples from a unique index are 717 not necessarily <emphasis>physically</emphasis> unique when stored 718 on disk due to version churn. The deduplication optimization is 719 selectively applied within unique indexes. It targets those pages 720 that appear to have version duplicates. The high level goal is to 721 give <command>VACUUM</command> more time to run before an 722 <quote>unnecessary</quote> page split caused by version churn can 723 take place. 724 </para> 725 <tip> 726 <para> 727 A special heuristic is applied to determine whether a 728 deduplication pass in a unique index should take place. It can 729 often skip straight to splitting a leaf page, avoiding a 730 performance penalty from wasting cycles on unhelpful deduplication 731 passes. If you're concerned about the overhead of deduplication, 732 consider setting <literal>deduplicate_items = off</literal> 733 selectively. Leaving deduplication enabled in unique indexes has 734 little downside. 735 </para> 736 </tip> 737 <para> 738 Deduplication cannot be used in all cases due to 739 implementation-level restrictions. Deduplication safety is 740 determined when <command>CREATE INDEX</command> or 741 <command>REINDEX</command> is run. 742 </para> 743 <para> 744 Note that deduplication is deemed unsafe and cannot be used in the 745 following cases involving semantically significant differences 746 among equal datums: 747 </para> 748 <para> 749 <itemizedlist> 750 <listitem> 751 <para> 752 <type>text</type>, <type>varchar</type>, and <type>char</type> 753 cannot use deduplication when a 754 <emphasis>nondeterministic</emphasis> collation is used. Case 755 and accent differences must be preserved among equal datums. 756 </para> 757 </listitem> 758 759 <listitem> 760 <para> 761 <type>numeric</type> cannot use deduplication. Numeric display 762 scale must be preserved among equal datums. 763 </para> 764 </listitem> 765 766 <listitem> 767 <para> 768 <type>jsonb</type> cannot use deduplication, since the 769 <type>jsonb</type> B-Tree operator class uses 770 <type>numeric</type> internally. 771 </para> 772 </listitem> 773 774 <listitem> 775 <para> 776 <type>float4</type> and <type>float8</type> cannot use 777 deduplication. These types have distinct representations for 778 <literal>-0</literal> and <literal>0</literal>, which are 779 nevertheless considered equal. This difference must be 780 preserved. 781 </para> 782 </listitem> 783 </itemizedlist> 784 </para> 785 <para> 786 There is one further implementation-level restriction that may be 787 lifted in a future version of 788 <productname>PostgreSQL</productname>: 789 </para> 790 <para> 791 <itemizedlist> 792 <listitem> 793 <para> 794 Container types (such as composite types, arrays, or range 795 types) cannot use deduplication. 796 </para> 797 </listitem> 798 </itemizedlist> 799 </para> 800 <para> 801 There is one further implementation-level restriction that applies 802 regardless of the operator class or collation used: 803 </para> 804 <para> 805 <itemizedlist> 806 <listitem> 807 <para> 808 <literal>INCLUDE</literal> indexes can never use deduplication. 809 </para> 810 </listitem> 811 </itemizedlist> 812 </para> 813 814 </sect2> 815</sect1> 816 817</chapter> 818