1<!-- doc/src/sgml/bloom.sgml -->
2
3<sect1 id="bloom" xreflabel="bloom">
4 <title>bloom</title>
5
6 <indexterm zone="bloom">
7  <primary>bloom</primary>
8 </indexterm>
9
10 <para>
11  <literal>bloom</literal> provides an index access method based on
12  <ulink url="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</ulink>.
13 </para>
14
15 <para>
16  A Bloom filter is a space-efficient data structure that is used to test
17  whether an element is a member of a set.  In the case of an index access
18  method, it allows fast exclusion of non-matching tuples via signatures
19  whose size is determined at index creation.
20 </para>
21
22 <para>
23  A signature is a lossy representation of the indexed attribute(s), and as
24  such is prone to reporting false positives; that is, it may be reported
25  that an element is in the set, when it is not.  So index search results
26  must always be rechecked using the actual attribute values from the heap
27  entry.  Larger signatures reduce the odds of a false positive and thus
28  reduce the number of useless heap visits, but of course also make the index
29  larger and hence slower to scan.
30 </para>
31
32 <para>
33  This type of index is most useful when a table has many attributes and
34  queries test arbitrary combinations of them.  A traditional btree index is
35  faster than a bloom index, but it can require many btree indexes to support
36  all possible queries where one needs only a single bloom index.  Note
37  however that bloom indexes only support equality queries, whereas btree
38  indexes can also perform inequality and range searches.
39 </para>
40
41 <sect2>
42  <title>Parameters</title>
43
44  <para>
45   A <literal>bloom</literal> index accepts the following parameters in its
46   <literal>WITH</literal> clause:
47  </para>
48
49   <variablelist>
50   <varlistentry>
51    <term><literal>length</literal></term>
52    <listitem>
53     <para>
54      Length of each signature (index entry) in bits. It is rounded up to the
55      nearest multiple of <literal>16</literal>. The default is
56      <literal>80</literal> bits and the maximum is <literal>4096</literal>.
57     </para>
58    </listitem>
59   </varlistentry>
60   </variablelist>
61   <variablelist>
62   <varlistentry>
63    <term><literal>col1 &mdash; col32</literal></term>
64    <listitem>
65     <para>
66      Number of bits generated for each index column. Each parameter's name
67      refers to the number of the index column that it controls.  The default
68      is <literal>2</literal> bits and the maximum is <literal>4095</literal>.
69      Parameters for index columns not actually used are ignored.
70     </para>
71    </listitem>
72   </varlistentry>
73   </variablelist>
74 </sect2>
75
76 <sect2>
77  <title>Examples</title>
78
79  <para>
80   This is an example of creating a bloom index:
81  </para>
82
83<programlisting>
84CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3)
85       WITH (length=80, col1=2, col2=2, col3=4);
86</programlisting>
87
88  <para>
89   The index is created with a signature length of 80 bits, with attributes
90   i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits.  We could
91   have omitted the <literal>length</literal>, <literal>col1</literal>,
92   and <literal>col2</literal> specifications since those have the default values.
93  </para>
94
95  <para>
96   Here is a more complete example of bloom index definition and usage, as
97   well as a comparison with equivalent btree indexes.  The bloom index is
98   considerably smaller than the btree index, and can perform better.
99  </para>
100
101<programlisting>
102=# CREATE TABLE tbloom AS
103   SELECT
104     (random() * 1000000)::int as i1,
105     (random() * 1000000)::int as i2,
106     (random() * 1000000)::int as i3,
107     (random() * 1000000)::int as i4,
108     (random() * 1000000)::int as i5,
109     (random() * 1000000)::int as i6
110   FROM
111  generate_series(1,10000000);
112SELECT 10000000
113</programlisting>
114
115  <para>
116   A sequential scan over this large table takes a long time:
117<programlisting>
118=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
119                                              QUERY PLAN
120------------------------------------------------------------------------------------------------------
121 Seq Scan on tbloom  (cost=0.00..2137.14 rows=3 width=24) (actual time=18.372..18.373 rows=0 loops=1)
122   Filter: ((i2 = 898732) AND (i5 = 123451))
123   Rows Removed by Filter: 100000
124 Planning Time: 0.400 ms
125 Execution Time: 18.397 ms
126(5 rows)
127</programlisting>
128  </para>
129
130  <para>
131   Even with the btree index defined the result will still be a
132   sequential scan:
133<programlisting>
134=# CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6);
135CREATE INDEX
136=# SELECT pg_size_pretty(pg_relation_size('btreeidx'));
137 pg_size_pretty
138----------------
139 3992 kB
140(1 row)
141=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
142                                              QUERY PLAN
143------------------------------------------------------------------------------------------------------
144 Seq Scan on tbloom  (cost=0.00..2137.00 rows=2 width=24) (actual time=11.880..11.881 rows=0 loops=1)
145   Filter: ((i2 = 898732) AND (i5 = 123451))
146   Rows Removed by Filter: 100000
147 Planning Time: 0.154 ms
148 Execution Time: 11.896 ms
149(5 rows)
150</programlisting>
151  </para>
152
153  <para>
154   Having the bloom index defined on the table is better than btree in
155   handling this type of search:
156<programlisting>
157=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6);
158CREATE INDEX
159=# SELECT pg_size_pretty(pg_relation_size('bloomidx'));
160 pg_size_pretty
161----------------
162 1584 kB
163(1 row)
164=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
165                                                     QUERY PLAN
166---------------------------------------------------------------------------------------------------------------------
167 Bitmap Heap Scan on tbloom  (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.388..0.388 rows=0 loops=1)
168   Recheck Cond: ((i2 = 898732) AND (i5 = 123451))
169   Rows Removed by Index Recheck: 25
170   Heap Blocks: exact=22
171   -&gt;  Bitmap Index Scan on bloomidx  (cost=0.00..1792.00 rows=2 width=0) (actual time=0.358..0.358 rows=25 loops=1)
172         Index Cond: ((i2 = 898732) AND (i5 = 123451))
173 Planning Time: 0.118 ms
174 Execution Time: 0.412 ms
175(8 rows)
176</programlisting>
177  </para>
178
179  <para>
180   Now, the main problem with the btree search is that btree is inefficient
181   when the search conditions do not constrain the leading index column(s).
182   A better strategy for btree is to create a separate index on each column.
183   Then the planner will choose something like this:
184<programlisting>
185=# CREATE INDEX btreeidx1 ON tbloom (i1);
186CREATE INDEX
187=# CREATE INDEX btreeidx2 ON tbloom (i2);
188CREATE INDEX
189=# CREATE INDEX btreeidx3 ON tbloom (i3);
190CREATE INDEX
191=# CREATE INDEX btreeidx4 ON tbloom (i4);
192CREATE INDEX
193=# CREATE INDEX btreeidx5 ON tbloom (i5);
194CREATE INDEX
195=# CREATE INDEX btreeidx6 ON tbloom (i6);
196CREATE INDEX
197=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451;
198                                                        QUERY PLAN
199---------------------------------------------------------------------------------------------------------------------------
200 Bitmap Heap Scan on tbloom  (cost=24.34..32.03 rows=2 width=24) (actual time=0.036..0.037 rows=0 loops=1)
201   Recheck Cond: ((i5 = 123451) AND (i2 = 898732))
202   -&gt;  BitmapAnd  (cost=24.34..24.34 rows=2 width=0) (actual time=0.033..0.034 rows=0 loops=1)
203         -&gt;  Bitmap Index Scan on btreeidx5  (cost=0.00..12.04 rows=500 width=0) (actual time=0.032..0.032 rows=0 loops=1)
204               Index Cond: (i5 = 123451)
205         -&gt;  Bitmap Index Scan on btreeidx2  (cost=0.00..12.04 rows=500 width=0) (never executed)
206               Index Cond: (i2 = 898732)
207 Planning Time: 0.531 ms
208 Execution Time: 0.072 ms
209(9 rows)
210</programlisting>
211   Although this query runs much faster than with either of the single
212   indexes, we pay a penalty in index size.  Each of the single-column
213   btree indexes occupies 2 MB, so the total space needed is 12 MB,
214   eight times the space used by the bloom index.
215  </para>
216 </sect2>
217
218 <sect2>
219  <title>Operator Class Interface</title>
220
221  <para>
222   An operator class for bloom indexes requires only a hash function for the
223   indexed data type and an equality operator for searching. This example
224   shows the operator class definition for the <type>text</type> data type:
225  </para>
226
227<programlisting>
228CREATE OPERATOR CLASS text_ops
229DEFAULT FOR TYPE text USING bloom AS
230    OPERATOR    1   =(text, text),
231    FUNCTION    1   hashtext(text);
232</programlisting>
233 </sect2>
234
235 <sect2>
236  <title>Limitations</title>
237  <para>
238   <itemizedlist>
239    <listitem>
240     <para>
241      Only operator classes for <type>int4</type> and <type>text</type> are
242      included with the module.
243     </para>
244    </listitem>
245
246    <listitem>
247     <para>
248      Only the <literal>=</literal> operator is supported for search.  But
249      it is possible to add support for arrays with union and intersection
250      operations in the future.
251     </para>
252    </listitem>
253
254    <listitem>
255     <para>
256       <literal>bloom</literal> access method doesn't support
257       <literal>UNIQUE</literal> indexes.
258     </para>
259    </listitem>
260
261    <listitem>
262     <para>
263       <literal>bloom</literal> access method doesn't support searching for
264       <literal>NULL</literal> values.
265     </para>
266    </listitem>
267   </itemizedlist>
268  </para>
269 </sect2>
270
271 <sect2>
272  <title>Authors</title>
273
274  <para>
275   Teodor Sigaev <email>teodor@postgrespro.ru</email>,
276   Postgres Professional, Moscow, Russia
277  </para>
278
279  <para>
280   Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>,
281   Postgres Professional, Moscow, Russia
282  </para>
283
284  <para>
285   Oleg Bartunov <email>obartunov@postgrespro.ru</email>,
286   Postgres Professional, Moscow, Russia
287  </para>
288 </sect2>
289
290</sect1>
291