1<!-- doc/src/sgml/btree.sgml -->
2
3<chapter id="btree">
4<title>B-Tree Indexes</title>
5
6   <indexterm>
7    <primary>index</primary>
8    <secondary>B-Tree</secondary>
9   </indexterm>
10
11<sect1 id="btree-intro">
12 <title>Introduction</title>
13
14 <para>
15  <productname>PostgreSQL</productname> includes an implementation of the
16  standard <acronym>btree</acronym> (multi-way balanced tree) index data
17  structure.  Any data type that can be sorted into a well-defined linear
18  order can be indexed by a btree index.  The only limitation is that an
19  index entry cannot exceed approximately one-third of a page (after TOAST
20  compression, if applicable).
21 </para>
22
23 <para>
24  Because each btree operator class imposes a sort order on its data type,
25  btree operator classes (or, really, operator families) have come to be
26  used as <productname>PostgreSQL</productname>'s general representation
27  and understanding of sorting semantics.  Therefore, they've acquired
28  some features that go beyond what would be needed just to support btree
29  indexes, and parts of the system that are quite distant from the
30  btree AM make use of them.
31 </para>
32
33</sect1>
34
35<sect1 id="btree-behavior">
36 <title>Behavior of B-Tree Operator Classes</title>
37
38 <para>
39  As shown in <xref linkend="xindex-btree-strat-table"/>, a btree operator
40  class must provide five comparison operators,
41  <literal>&lt;</literal>,
42  <literal>&lt;=</literal>,
43  <literal>=</literal>,
44  <literal>&gt;=</literal> and
45  <literal>&gt;</literal>.
46  One might expect that <literal>&lt;&gt;</literal> should also be part of
47  the operator class, but it is not, because it would almost never be
48  useful to use a <literal>&lt;&gt;</literal> WHERE clause in an index
49  search.  (For some purposes, the planner treats <literal>&lt;&gt;</literal>
50  as associated with a btree operator class; but it finds that operator via
51  the <literal>=</literal> operator's negator link, rather than
52  from <structname>pg_amop</structname>.)
53 </para>
54
55 <para>
56  When several data types share near-identical sorting semantics, their
57  operator classes can be grouped into an operator family.  Doing so is
58  advantageous because it allows the planner to make deductions about
59  cross-type comparisons.  Each operator class within the family should
60  contain the single-type operators (and associated support functions)
61  for its input data type, while cross-type comparison operators and
62  support functions are <quote>loose</quote> in the family.  It is
63  recommendable that a complete set of cross-type operators be included
64  in the family, thus ensuring that the planner can represent any
65  comparison conditions that it deduces from transitivity.
66 </para>
67
68 <para>
69  There are some basic assumptions that a btree operator family must
70  satisfy:
71 </para>
72
73 <itemizedlist>
74  <listitem>
75   <para>
76    An <literal>=</literal> operator must be an equivalence relation; that
77    is, for all non-null values <replaceable>A</replaceable>,
78    <replaceable>B</replaceable>, <replaceable>C</replaceable> of the
79    data type:
80
81    <itemizedlist>
82     <listitem>
83      <para>
84       <replaceable>A</replaceable> <literal>=</literal>
85       <replaceable>A</replaceable> is true
86       (<firstterm>reflexive law</firstterm>)
87      </para>
88     </listitem>
89     <listitem>
90      <para>
91       if <replaceable>A</replaceable> <literal>=</literal>
92       <replaceable>B</replaceable>,
93       then <replaceable>B</replaceable> <literal>=</literal>
94       <replaceable>A</replaceable>
95       (<firstterm>symmetric law</firstterm>)
96      </para>
97     </listitem>
98     <listitem>
99      <para>
100       if <replaceable>A</replaceable> <literal>=</literal>
101       <replaceable>B</replaceable> and <replaceable>B</replaceable>
102       <literal>=</literal> <replaceable>C</replaceable>,
103       then <replaceable>A</replaceable> <literal>=</literal>
104       <replaceable>C</replaceable>
105       (<firstterm>transitive law</firstterm>)
106      </para>
107     </listitem>
108    </itemizedlist>
109   </para>
110  </listitem>
111
112  <listitem>
113   <para>
114    A <literal>&lt;</literal> operator must be a strong ordering relation;
115    that is, for all non-null values <replaceable>A</replaceable>,
116    <replaceable>B</replaceable>, <replaceable>C</replaceable>:
117
118    <itemizedlist>
119     <listitem>
120      <para>
121       <replaceable>A</replaceable> <literal>&lt;</literal>
122       <replaceable>A</replaceable> is false
123       (<firstterm>irreflexive law</firstterm>)
124      </para>
125     </listitem>
126     <listitem>
127      <para>
128       if <replaceable>A</replaceable> <literal>&lt;</literal>
129       <replaceable>B</replaceable>
130       and <replaceable>B</replaceable> <literal>&lt;</literal>
131       <replaceable>C</replaceable>,
132       then <replaceable>A</replaceable> <literal>&lt;</literal>
133       <replaceable>C</replaceable>
134       (<firstterm>transitive law</firstterm>)
135      </para>
136     </listitem>
137    </itemizedlist>
138   </para>
139  </listitem>
140
141  <listitem>
142   <para>
143    Furthermore, the ordering is total; that is, for all non-null
144    values <replaceable>A</replaceable>, <replaceable>B</replaceable>:
145
146    <itemizedlist>
147     <listitem>
148      <para>
149       exactly one of <replaceable>A</replaceable> <literal>&lt;</literal>
150       <replaceable>B</replaceable>, <replaceable>A</replaceable>
151       <literal>=</literal> <replaceable>B</replaceable>, and
152       <replaceable>B</replaceable> <literal>&lt;</literal>
153       <replaceable>A</replaceable> is true
154       (<firstterm>trichotomy law</firstterm>)
155      </para>
156     </listitem>
157    </itemizedlist>
158
159    (The trichotomy law justifies the definition of the comparison support
160    function, of course.)
161   </para>
162  </listitem>
163 </itemizedlist>
164
165 <para>
166  The other three operators are defined in terms of <literal>=</literal>
167  and <literal>&lt;</literal> in the obvious way, and must act consistently
168  with them.
169 </para>
170
171 <para>
172  For an operator family supporting multiple data types, the above laws must
173  hold when <replaceable>A</replaceable>, <replaceable>B</replaceable>,
174  <replaceable>C</replaceable> are taken from any data types in the family.
175  The transitive laws are the trickiest to ensure, as in cross-type
176  situations they represent statements that the behaviors of two or three
177  different operators are consistent.
178  As an example, it would not work to put <type>float8</type>
179  and <type>numeric</type> into the same operator family, at least not with
180  the current semantics that <type>numeric</type> values are converted
181  to <type>float8</type> for comparison to a <type>float8</type>.  Because
182  of the limited accuracy of <type>float8</type>, this means there are
183  distinct <type>numeric</type> values that will compare equal to the
184  same <type>float8</type> value, and thus the transitive law would fail.
185 </para>
186
187 <para>
188  Another requirement for a multiple-data-type family is that any implicit
189  or binary-coercion casts that are defined between data types included in
190  the operator family must not change the associated sort ordering.
191 </para>
192
193 <para>
194  It should be fairly clear why a btree index requires these laws to hold
195  within a single data type: without them there is no ordering to arrange
196  the keys with.  Also, index searches using a comparison key of a
197  different data type require comparisons to behave sanely across two
198  data types.  The extensions to three or more data types within a family
199  are not strictly required by the btree index mechanism itself, but the
200  planner relies on them for optimization purposes.
201 </para>
202
203</sect1>
204
205<sect1 id="btree-support-funcs">
206 <title>B-Tree Support Functions</title>
207
208 <para>
209  As shown in <xref linkend="xindex-btree-support-table"/>, btree defines
210  one required and four optional support functions.  The five
211  user-defined methods are:
212 </para>
213 <variablelist>
214  <varlistentry>
215   <term><function>order</function></term>
216   <listitem>
217    <para>
218     For each combination of data types that a btree operator family
219     provides comparison operators for, it must provide a comparison
220     support function, registered in
221     <structname>pg_amproc</structname> with support function number 1
222     and
223     <structfield>amproclefttype</structfield>/<structfield>amprocrighttype</structfield>
224     equal to the left and right data types for the comparison (i.e.,
225     the same data types that the matching operators are registered
226     with in <structname>pg_amop</structname>).  The comparison
227     function must take two non-null values
228     <replaceable>A</replaceable> and <replaceable>B</replaceable> and
229     return an <type>int32</type> value that is
230     <literal>&lt;</literal> <literal>0</literal>,
231     <literal>0</literal>, or <literal>&gt;</literal>
232     <literal>0</literal> when <replaceable>A</replaceable>
233     <literal>&lt;</literal> <replaceable>B</replaceable>,
234     <replaceable>A</replaceable> <literal>=</literal>
235     <replaceable>B</replaceable>, or <replaceable>A</replaceable>
236     <literal>&gt;</literal> <replaceable>B</replaceable>,
237     respectively.  A null result is disallowed: all values of the
238     data type must be comparable.  See
239     <filename>src/backend/access/nbtree/nbtcompare.c</filename> for
240     examples.
241    </para>
242
243    <para>
244     If the compared values are of a collatable data type, the
245     appropriate collation OID will be passed to the comparison
246     support function, using the standard
247     <function>PG_GET_COLLATION()</function> mechanism.
248    </para>
249   </listitem>
250  </varlistentry>
251  <varlistentry>
252   <term><function>sortsupport</function></term>
253   <listitem>
254    <para>
255     Optionally, a btree operator family may provide <firstterm>sort
256      support</firstterm> function(s), registered under support
257     function number 2.  These functions allow implementing
258     comparisons for sorting purposes in a more efficient way than
259     naively calling the comparison support function.  The APIs
260     involved in this are defined in
261     <filename>src/include/utils/sortsupport.h</filename>.
262    </para>
263   </listitem>
264  </varlistentry>
265  <varlistentry>
266   <term><function>in_range</function></term>
267   <listitem>
268    <indexterm>
269     <primary>in_range support functions</primary>
270    </indexterm>
271
272    <indexterm>
273     <primary>support functions</primary>
274     <secondary>in_range</secondary>
275    </indexterm>
276    <para>
277     Optionally, a btree operator family may provide
278     <firstterm>in_range</firstterm> support function(s), registered
279     under support function number 3.  These are not used during btree
280     index operations; rather, they extend the semantics of the
281     operator family so that it can support window clauses containing
282     the <literal>RANGE</literal> <replaceable>offset</replaceable>
283     <literal>PRECEDING</literal> and <literal>RANGE</literal>
284     <replaceable>offset</replaceable> <literal>FOLLOWING</literal>
285     frame bound types (see <xref
286      linkend="syntax-window-functions"/>).  Fundamentally, the extra
287     information provided is how to add or subtract an
288     <replaceable>offset</replaceable> value in a way that is
289     compatible with the family's data ordering.
290    </para>
291
292    <para>
293     An <function>in_range</function> function must have the signature
294<synopsis>
295in_range(<replaceable>val</replaceable> type1, <replaceable>base</replaceable> type1, <replaceable>offset</replaceable> type2, <replaceable>sub</replaceable> bool, <replaceable>less</replaceable> bool)
296returns bool
297</synopsis>
298     <replaceable>val</replaceable> and
299     <replaceable>base</replaceable> must be of the same type, which
300     is one of the types supported by the operator family (i.e., a
301     type for which it provides an ordering).  However,
302     <replaceable>offset</replaceable> could be of a different type,
303     which might be one otherwise unsupported by the family.  An
304     example is that the built-in <literal>time_ops</literal> family
305     provides an <function>in_range</function> function that has
306     <replaceable>offset</replaceable> of type <type>interval</type>.
307     A family can provide <function>in_range</function> functions for
308     any of its supported types and one or more
309     <replaceable>offset</replaceable> types.  Each
310     <function>in_range</function> function should be entered in
311     <structname>pg_amproc</structname> with
312     <structfield>amproclefttype</structfield> equal to
313     <type>type1</type> and <structfield>amprocrighttype</structfield>
314     equal to <type>type2</type>.
315    </para>
316
317    <para>
318     The essential semantics of an <function>in_range</function>
319     function depend on the two Boolean flag parameters.  It should
320     add or subtract <replaceable>base</replaceable> and
321     <replaceable>offset</replaceable>, then compare
322     <replaceable>val</replaceable> to the result, as follows:
323     <itemizedlist>
324      <listitem>
325       <para>
326        if <literal>!</literal><replaceable>sub</replaceable> and
327        <literal>!</literal><replaceable>less</replaceable>, return
328        <replaceable>val</replaceable> <literal>&gt;=</literal>
329        (<replaceable>base</replaceable> <literal>+</literal>
330        <replaceable>offset</replaceable>)
331       </para>
332      </listitem>
333      <listitem>
334       <para>
335        if <literal>!</literal><replaceable>sub</replaceable> and
336        <replaceable>less</replaceable>, return
337        <replaceable>val</replaceable> <literal>&lt;=</literal>
338        (<replaceable>base</replaceable> <literal>+</literal>
339        <replaceable>offset</replaceable>)
340       </para>
341      </listitem>
342      <listitem>
343       <para>
344        if <replaceable>sub</replaceable> and
345        <literal>!</literal><replaceable>less</replaceable>, return
346        <replaceable>val</replaceable> <literal>&gt;=</literal>
347        (<replaceable>base</replaceable> <literal>-</literal>
348        <replaceable>offset</replaceable>)
349       </para>
350      </listitem>
351      <listitem>
352       <para>
353        if <replaceable>sub</replaceable> and
354        <replaceable>less</replaceable>, return
355        <replaceable>val</replaceable> <literal>&lt;=</literal>
356        (<replaceable>base</replaceable> <literal>-</literal>
357        <replaceable>offset</replaceable>)
358       </para>
359      </listitem>
360     </itemizedlist>
361     Before doing so, the function should check the sign of
362     <replaceable>offset</replaceable>: if it is less than zero, raise
363     error
364     <literal>ERRCODE_INVALID_PRECEDING_OR_FOLLOWING_SIZE</literal>
365     (22013) with error text like <quote>invalid preceding or
366      following size in window function</quote>.  (This is required by
367     the SQL standard, although nonstandard operator families might
368     perhaps choose to ignore this restriction, since there seems to
369     be little semantic necessity for it.) This requirement is
370     delegated to the <function>in_range</function> function so that
371     the core code needn't understand what <quote>less than
372      zero</quote> means for a particular data type.
373    </para>
374
375    <para>
376     An additional expectation is that <function>in_range</function>
377     functions should, if practical, avoid throwing an error if
378     <replaceable>base</replaceable> <literal>+</literal>
379     <replaceable>offset</replaceable> or
380     <replaceable>base</replaceable> <literal>-</literal>
381     <replaceable>offset</replaceable> would overflow.  The correct
382     comparison result can be determined even if that value would be
383     out of the data type's range.  Note that if the data type
384     includes concepts such as <quote>infinity</quote> or
385     <quote>NaN</quote>, extra care may be needed to ensure that
386     <function>in_range</function>'s results agree with the normal
387     sort order of the operator family.
388    </para>
389
390    <para>
391     The results of the <function>in_range</function> function must be
392     consistent with the sort ordering imposed by the operator family.
393     To be precise, given any fixed values of
394     <replaceable>offset</replaceable> and
395     <replaceable>sub</replaceable>, then:
396     <itemizedlist>
397      <listitem>
398       <para>
399        If <function>in_range</function> with
400        <replaceable>less</replaceable> = true is true for some
401        <replaceable>val1</replaceable> and
402        <replaceable>base</replaceable>, it must be true for every
403        <replaceable>val2</replaceable> <literal>&lt;=</literal>
404        <replaceable>val1</replaceable> with the same
405        <replaceable>base</replaceable>.
406       </para>
407      </listitem>
408      <listitem>
409       <para>
410        If <function>in_range</function> with
411        <replaceable>less</replaceable> = true is false for some
412        <replaceable>val1</replaceable> and
413        <replaceable>base</replaceable>, it must be false for every
414        <replaceable>val2</replaceable> <literal>&gt;=</literal>
415        <replaceable>val1</replaceable> with the same
416        <replaceable>base</replaceable>.
417       </para>
418      </listitem>
419      <listitem>
420       <para>
421        If <function>in_range</function> with
422        <replaceable>less</replaceable> = true is true for some
423        <replaceable>val</replaceable> and
424        <replaceable>base1</replaceable>, it must be true for every
425        <replaceable>base2</replaceable> <literal>&gt;=</literal>
426        <replaceable>base1</replaceable> with the same
427        <replaceable>val</replaceable>.
428       </para>
429      </listitem>
430      <listitem>
431       <para>
432        If <function>in_range</function> with
433        <replaceable>less</replaceable> = true is false for some
434        <replaceable>val</replaceable> and
435        <replaceable>base1</replaceable>, it must be false for every
436        <replaceable>base2</replaceable> <literal>&lt;=</literal>
437        <replaceable>base1</replaceable> with the same
438        <replaceable>val</replaceable>.
439       </para>
440      </listitem>
441     </itemizedlist>
442     Analogous statements with inverted conditions hold when
443     <replaceable>less</replaceable> = false.
444    </para>
445
446    <para>
447     If the type being ordered (<type>type1</type>) is collatable, the
448     appropriate collation OID will be passed to the
449     <function>in_range</function> function, using the standard
450     PG_GET_COLLATION() mechanism.
451    </para>
452
453    <para>
454     <function>in_range</function> functions need not handle NULL
455     inputs, and typically will be marked strict.
456    </para>
457   </listitem>
458  </varlistentry>
459  <varlistentry>
460   <term><function>equalimage</function></term>
461   <listitem>
462    <para>
463     Optionally, a btree operator family may provide
464     <function>equalimage</function> (<quote>equality implies image
465      equality</quote>) support functions, registered under support
466     function number 4.  These functions allow the core code to
467     determine when it is safe to apply the btree deduplication
468     optimization.  Currently, <function>equalimage</function>
469     functions are only called when building or rebuilding an index.
470    </para>
471    <para>
472     An <function>equalimage</function> function must have the
473     signature
474<synopsis>
475equalimage(<replaceable>opcintype</replaceable> <type>oid</type>) returns bool
476</synopsis>
477     The return value is static information about an operator class
478     and collation.  Returning <literal>true</literal> indicates that
479     the <function>order</function> function for the operator class is
480     guaranteed to only return <literal>0</literal> (<quote>arguments
481      are equal</quote>) when its <replaceable>A</replaceable> and
482     <replaceable>B</replaceable> arguments are also interchangeable
483     without any loss of semantic information.  Not registering an
484     <function>equalimage</function> function or returning
485     <literal>false</literal> indicates that this condition cannot be
486     assumed to hold.
487    </para>
488    <para>
489     The <replaceable>opcintype</replaceable> argument is the
490     <literal><structname>pg_type</structname>.oid</literal> of the
491     data type that the operator class indexes.  This is a convenience
492     that allows reuse of the same underlying
493     <function>equalimage</function> function across operator classes.
494     If <replaceable>opcintype</replaceable> is a collatable data
495     type, the appropriate collation OID will be passed to the
496     <function>equalimage</function> function, using the standard
497     <function>PG_GET_COLLATION()</function> mechanism.
498    </para>
499    <para>
500     As far as the operator class is concerned, returning
501     <literal>true</literal> indicates that deduplication is safe (or
502     safe for the collation whose OID was passed to its
503     <function>equalimage</function> function).  However, the core
504     code will only deem deduplication safe for an index when
505     <emphasis>every</emphasis> indexed column uses an operator class
506     that registers an <function>equalimage</function> function, and
507     each function actually returns <literal>true</literal> when
508     called.
509    </para>
510    <para>
511     Image equality is <emphasis>almost</emphasis> the same condition
512     as simple bitwise equality.  There is one subtle difference: When
513     indexing a varlena data type, the on-disk representation of two
514     image equal datums may not be bitwise equal due to inconsistent
515     application of <acronym>TOAST</acronym> compression on input.
516     Formally, when an operator class's
517     <function>equalimage</function> function returns
518     <literal>true</literal>, it is safe to assume that the
519     <literal>datum_image_eq()</literal> C function will always agree
520     with the operator class's <function>order</function> function
521     (provided that the same collation OID is passed to both the
522     <function>equalimage</function> and <function>order</function>
523     functions).
524    </para>
525    <para>
526     The core code is fundamentally unable to deduce anything about
527     the <quote>equality implies image equality</quote> status of an
528     operator class within a multiple-data-type family based on
529     details from other operator classes in the same family.  Also, it
530     is not sensible for an operator family to register a cross-type
531     <function>equalimage</function> function, and attempting to do so
532     will result in an error.  This is because <quote>equality implies
533      image equality</quote> status does not just depend on
534     sorting/equality semantics, which are more or less defined at the
535     operator family level.  In general, the semantics that one
536     particular data type implements must be considered separately.
537    </para>
538    <para>
539     The convention followed by the operator classes included with the
540     core <productname>PostgreSQL</productname> distribution is to
541     register a stock, generic <function>equalimage</function>
542     function.  Most operator classes register
543     <function>btequalimage()</function>, which indicates that
544     deduplication is safe unconditionally.  Operator classes for
545     collatable data types such as <type>text</type> register
546     <function>btvarstrequalimage()</function>, which indicates that
547     deduplication is safe with deterministic collations.  Best
548     practice for third-party extensions is to register their own
549     custom function to retain control.
550    </para>
551   </listitem>
552  </varlistentry>
553  <varlistentry>
554   <term><function>options</function></term>
555   <listitem>
556    <para>
557     Optionally, a B-tree operator family may provide
558     <function>options</function> (<quote>operator class specific
559     options</quote>) support functions, registered under support
560     function number 5.  These functions define a set of user-visible
561     parameters that control operator class behavior.
562    </para>
563    <para>
564     An <function>options</function> support function must have the
565     signature
566<synopsis>
567options(<replaceable>relopts</replaceable> <type>local_relopts *</type>) returns void
568</synopsis>
569     The function is passed a pointer to a <replaceable>local_relopts</replaceable>
570     struct, which needs to be filled with a set of operator class
571     specific options.  The options can be accessed from other support
572     functions using the <literal>PG_HAS_OPCLASS_OPTIONS()</literal> and
573     <literal>PG_GET_OPCLASS_OPTIONS()</literal> macros.
574    </para>
575    <para>
576     Currently, no B-Tree operator class has an <function>options</function>
577     support function.  B-tree doesn't allow flexible representation of keys
578     like GiST, SP-GiST, GIN and BRIN do.  So, <function>options</function>
579     probably doesn't have much application in the current B-tree index
580     access method.  Nevertheless, this support function was added to B-tree
581     for uniformity, and will probably find uses during further
582     evolution of B-tree in <productname>PostgreSQL</productname>.
583    </para>
584   </listitem>
585  </varlistentry>
586 </variablelist>
587
588</sect1>
589
590<sect1 id="btree-implementation">
591 <title>Implementation</title>
592
593 <para>
594  This section covers B-Tree index implementation details that may be
595  of use to advanced users.  See
596  <filename>src/backend/access/nbtree/README</filename> in the source
597  distribution for a much more detailed, internals-focused description
598  of the B-Tree implementation.
599 </para>
600 <sect2 id="btree-structure">
601  <title>B-Tree Structure</title>
602  <para>
603   <productname>PostgreSQL</productname> B-Tree indexes are
604   multi-level tree structures, where each level of the tree can be
605   used as a doubly-linked list of pages.  A single metapage is stored
606   in a fixed position at the start of the first segment file of the
607   index.  All other pages are either leaf pages or internal pages.
608   Leaf pages are the pages on the lowest level of the tree.  All
609   other levels consist of internal pages.  Each leaf page contains
610   tuples that point to table rows.  Each internal page contains
611   tuples that point to the next level down in the tree.  Typically,
612   over 99% of all pages are leaf pages.  Both internal pages and leaf
613   pages use the standard page format described in <xref
614    linkend="storage-page-layout"/>.
615  </para>
616  <para>
617   New leaf pages are added to a B-Tree index when an existing leaf
618   page cannot fit an incoming tuple.  A <firstterm>page
619    split</firstterm> operation makes room for items that originally
620   belonged on the overflowing page by moving a portion of the items
621   to a new page.  Page splits must also insert a new
622   <firstterm>downlink</firstterm> to the new page in the parent page,
623   which may cause the parent to split in turn.  Page splits
624   <quote>cascade upwards</quote> in a recursive fashion.  When the
625   root page finally cannot fit a new downlink, a <firstterm>root page
626    split</firstterm> operation takes place.  This adds a new level to
627   the tree structure by creating a new root page that is one level
628   above the original root page.
629  </para>
630 </sect2>
631
632 <sect2 id="btree-deduplication">
633  <title>Deduplication</title>
634  <para>
635   A duplicate is a leaf page tuple (a tuple that points to a table
636   row) where <emphasis>all</emphasis> indexed key columns have values
637   that match corresponding column values from at least one other leaf
638   page tuple in the same index.  Duplicate tuples are quite common in
639   practice.  B-Tree indexes can use a special, space-efficient
640   representation for duplicates when an optional technique is
641   enabled: <firstterm>deduplication</firstterm>.
642  </para>
643  <para>
644   Deduplication works by periodically merging groups of duplicate
645   tuples together, forming a single <firstterm>posting list</firstterm> tuple for each
646   group.  The column key value(s) only appear once in this
647   representation.  This is followed by a sorted array of
648   <acronym>TID</acronym>s that point to rows in the table.  This
649   significantly reduces the storage size of indexes where each value
650   (or each distinct combination of column values) appears several
651   times on average.  The latency of queries can be reduced
652   significantly.  Overall query throughput may increase
653   significantly.  The overhead of routine index vacuuming may also be
654   reduced significantly.
655  </para>
656  <note>
657   <para>
658    B-Tree deduplication is just as effective with
659    <quote>duplicates</quote> that contain a NULL value, even though
660    NULL values are never equal to each other according to the
661    <literal>=</literal> member of any B-Tree operator class.  As far
662    as any part of the implementation that understands the on-disk
663    B-Tree structure is concerned, NULL is just another value from the
664    domain of indexed values.
665   </para>
666  </note>
667  <para>
668   The deduplication process occurs lazily, when a new item is
669   inserted that cannot fit on an existing leaf page.  This prevents
670   (or at least delays) leaf page splits.  Unlike GIN posting list
671   tuples, B-Tree posting list tuples do not need to expand every time
672   a new duplicate is inserted; they are merely an alternative
673   physical representation of the original logical contents of the
674   leaf page.  This design prioritizes consistent performance with
675   mixed read-write workloads.  Most client applications will at least
676   see a moderate performance benefit from using deduplication.
677   Deduplication is enabled by default.
678  </para>
679  <para>
680   <command>CREATE INDEX</command> and <command>REINDEX</command>
681   apply deduplication to create posting list tuples, though the
682   strategy they use is slightly different.  Each group of duplicate
683   ordinary tuples encountered in the sorted input taken from the
684   table is merged into a posting list tuple
685   <emphasis>before</emphasis> being added to the current pending leaf
686   page.  Individual posting list tuples are packed with as many
687   <acronym>TID</acronym>s as possible.  Leaf pages are written out in
688   the usual way, without any separate deduplication pass.  This
689   strategy is well-suited to <command>CREATE INDEX</command> and
690   <command>REINDEX</command> because they are once-off batch
691   operations.
692  </para>
693  <para>
694   Write-heavy workloads that don't benefit from deduplication due to
695   having few or no duplicate values in indexes will incur a small,
696   fixed performance penalty (unless deduplication is explicitly
697   disabled).  The <literal>deduplicate_items</literal> storage
698   parameter can be used to disable deduplication within individual
699   indexes.  There is never any performance penalty with read-only
700   workloads, since reading posting list tuples is at least as
701   efficient as reading the standard tuple representation.  Disabling
702   deduplication isn't usually helpful.
703  </para>
704  <para>
705   B-Tree indexes are not directly aware that under MVCC, there might
706   be multiple extant versions of the same logical table row; to an
707   index, each tuple is an independent object that needs its own index
708   entry.  <quote>Version duplicates</quote> may sometimes accumulate
709   and adversely affect query latency and throughput.  This typically
710   occurs with <command>UPDATE</command>-heavy workloads where most
711   individual updates cannot apply the <acronym>HOT</acronym>
712   optimization (often because at least one indexed column gets
713   modified, necessitating a new set of index tuple versions &mdash;
714   one new tuple for <emphasis>each and every</emphasis> index).  In
715   effect, B-Tree deduplication ameliorates index bloat caused by
716   version churn.  Note that even the tuples from a unique index are
717   not necessarily <emphasis>physically</emphasis> unique when stored
718   on disk due to version churn.  The deduplication optimization is
719   selectively applied within unique indexes.  It targets those pages
720   that appear to have version duplicates.  The high level goal is to
721   give <command>VACUUM</command> more time to run before an
722   <quote>unnecessary</quote> page split caused by version churn can
723   take place.
724  </para>
725  <tip>
726   <para>
727    A special heuristic is applied to determine whether a
728    deduplication pass in a unique index should take place.  It can
729    often skip straight to splitting a leaf page, avoiding a
730    performance penalty from wasting cycles on unhelpful deduplication
731    passes.  If you're concerned about the overhead of deduplication,
732    consider setting <literal>deduplicate_items = off</literal>
733    selectively.  Leaving deduplication enabled in unique indexes has
734    little downside.
735   </para>
736  </tip>
737  <para>
738   Deduplication cannot be used in all cases due to
739   implementation-level restrictions.  Deduplication safety is
740   determined when <command>CREATE INDEX</command> or
741   <command>REINDEX</command> is run.
742  </para>
743  <para>
744   Note that deduplication is deemed unsafe and cannot be used in the
745   following cases involving semantically significant differences
746   among equal datums:
747  </para>
748  <para>
749   <itemizedlist>
750    <listitem>
751     <para>
752      <type>text</type>, <type>varchar</type>, and <type>char</type>
753      cannot use deduplication when a
754      <emphasis>nondeterministic</emphasis> collation is used.  Case
755      and accent differences must be preserved among equal datums.
756     </para>
757    </listitem>
758
759    <listitem>
760     <para>
761      <type>numeric</type> cannot use deduplication.  Numeric display
762      scale must be preserved among equal datums.
763     </para>
764    </listitem>
765
766    <listitem>
767     <para>
768      <type>jsonb</type> cannot use deduplication, since the
769      <type>jsonb</type> B-Tree operator class uses
770      <type>numeric</type> internally.
771     </para>
772    </listitem>
773
774    <listitem>
775     <para>
776      <type>float4</type> and <type>float8</type> cannot use
777      deduplication.  These types have distinct representations for
778      <literal>-0</literal> and <literal>0</literal>, which are
779      nevertheless considered equal.  This difference must be
780      preserved.
781     </para>
782    </listitem>
783   </itemizedlist>
784  </para>
785  <para>
786   There is one further implementation-level restriction that may be
787   lifted in a future version of
788   <productname>PostgreSQL</productname>:
789  </para>
790  <para>
791   <itemizedlist>
792    <listitem>
793     <para>
794      Container types (such as composite types, arrays, or range
795      types) cannot use deduplication.
796     </para>
797    </listitem>
798   </itemizedlist>
799  </para>
800  <para>
801   There is one further implementation-level restriction that applies
802   regardless of the operator class or collation used:
803  </para>
804  <para>
805   <itemizedlist>
806    <listitem>
807     <para>
808      <literal>INCLUDE</literal> indexes can never use deduplication.
809     </para>
810    </listitem>
811   </itemizedlist>
812  </para>
813
814 </sect2>
815</sect1>
816
817</chapter>
818