1<!-- doc/src/sgml/bloom.sgml --> 2 3<sect1 id="bloom" xreflabel="bloom"> 4 <title>bloom</title> 5 6 <indexterm zone="bloom"> 7 <primary>bloom</primary> 8 </indexterm> 9 10 <para> 11 <literal>bloom</literal> provides an index access method based on 12 <ulink url="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</ulink>. 13 </para> 14 15 <para> 16 A Bloom filter is a space-efficient data structure that is used to test 17 whether an element is a member of a set. In the case of an index access 18 method, it allows fast exclusion of non-matching tuples via signatures 19 whose size is determined at index creation. 20 </para> 21 22 <para> 23 A signature is a lossy representation of the indexed attribute(s), and as 24 such is prone to reporting false positives; that is, it may be reported 25 that an element is in the set, when it is not. So index search results 26 must always be rechecked using the actual attribute values from the heap 27 entry. Larger signatures reduce the odds of a false positive and thus 28 reduce the number of useless heap visits, but of course also make the index 29 larger and hence slower to scan. 30 </para> 31 32 <para> 33 This type of index is most useful when a table has many attributes and 34 queries test arbitrary combinations of them. A traditional btree index is 35 faster than a bloom index, but it can require many btree indexes to support 36 all possible queries where one needs only a single bloom index. Note 37 however that bloom indexes only support equality queries, whereas btree 38 indexes can also perform inequality and range searches. 39 </para> 40 41 <sect2> 42 <title>Parameters</title> 43 44 <para> 45 A <literal>bloom</literal> index accepts the following parameters in its 46 <literal>WITH</literal> clause: 47 </para> 48 49 <variablelist> 50 <varlistentry> 51 <term><literal>length</literal></term> 52 <listitem> 53 <para> 54 Length of each signature (index entry) in bits. It is rounded up to the 55 nearest multiple of <literal>16</literal>. The default is 56 <literal>80</literal> bits and the maximum is <literal>4096</literal>. 57 </para> 58 </listitem> 59 </varlistentry> 60 </variablelist> 61 <variablelist> 62 <varlistentry> 63 <term><literal>col1 — col32</literal></term> 64 <listitem> 65 <para> 66 Number of bits generated for each index column. Each parameter's name 67 refers to the number of the index column that it controls. The default 68 is <literal>2</literal> bits and the maximum is <literal>4095</literal>. 69 Parameters for index columns not actually used are ignored. 70 </para> 71 </listitem> 72 </varlistentry> 73 </variablelist> 74 </sect2> 75 76 <sect2> 77 <title>Examples</title> 78 79 <para> 80 This is an example of creating a bloom index: 81 </para> 82 83<programlisting> 84CREATE INDEX bloomidx ON tbloom USING bloom (i1,i2,i3) 85 WITH (length=80, col1=2, col2=2, col3=4); 86</programlisting> 87 88 <para> 89 The index is created with a signature length of 80 bits, with attributes 90 i1 and i2 mapped to 2 bits, and attribute i3 mapped to 4 bits. We could 91 have omitted the <literal>length</literal>, <literal>col1</literal>, 92 and <literal>col2</literal> specifications since those have the default values. 93 </para> 94 95 <para> 96 Here is a more complete example of bloom index definition and usage, as 97 well as a comparison with equivalent btree indexes. The bloom index is 98 considerably smaller than the btree index, and can perform better. 99 </para> 100 101<programlisting> 102=# CREATE TABLE tbloom AS 103 SELECT 104 (random() * 1000000)::int as i1, 105 (random() * 1000000)::int as i2, 106 (random() * 1000000)::int as i3, 107 (random() * 1000000)::int as i4, 108 (random() * 1000000)::int as i5, 109 (random() * 1000000)::int as i6 110 FROM 111 generate_series(1,10000000); 112SELECT 10000000 113</programlisting> 114 115 <para> 116 A sequential scan over this large table takes a long time: 117<programlisting> 118=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451; 119 QUERY PLAN 120------------------------------------------------------------------------------------------------------ 121 Seq Scan on tbloom (cost=0.00..2137.14 rows=3 width=24) (actual time=18.372..18.373 rows=0 loops=1) 122 Filter: ((i2 = 898732) AND (i5 = 123451)) 123 Rows Removed by Filter: 100000 124 Planning Time: 0.400 ms 125 Execution Time: 18.397 ms 126(5 rows) 127</programlisting> 128 </para> 129 130 <para> 131 Even with the btree index defined the result will still be a 132 sequential scan: 133<programlisting> 134=# CREATE INDEX btreeidx ON tbloom (i1, i2, i3, i4, i5, i6); 135CREATE INDEX 136=# SELECT pg_size_pretty(pg_relation_size('btreeidx')); 137 pg_size_pretty 138---------------- 139 3992 kB 140(1 row) 141=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451; 142 QUERY PLAN 143------------------------------------------------------------------------------------------------------ 144 Seq Scan on tbloom (cost=0.00..2137.00 rows=2 width=24) (actual time=11.880..11.881 rows=0 loops=1) 145 Filter: ((i2 = 898732) AND (i5 = 123451)) 146 Rows Removed by Filter: 100000 147 Planning Time: 0.154 ms 148 Execution Time: 11.896 ms 149(5 rows) 150</programlisting> 151 </para> 152 153 <para> 154 Having the bloom index defined on the table is better than btree in 155 handling this type of search: 156<programlisting> 157=# CREATE INDEX bloomidx ON tbloom USING bloom (i1, i2, i3, i4, i5, i6); 158CREATE INDEX 159=# SELECT pg_size_pretty(pg_relation_size('bloomidx')); 160 pg_size_pretty 161---------------- 162 1584 kB 163(1 row) 164=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451; 165 QUERY PLAN 166--------------------------------------------------------------------------------------------------------------------- 167 Bitmap Heap Scan on tbloom (cost=1792.00..1799.69 rows=2 width=24) (actual time=0.388..0.388 rows=0 loops=1) 168 Recheck Cond: ((i2 = 898732) AND (i5 = 123451)) 169 Rows Removed by Index Recheck: 25 170 Heap Blocks: exact=22 171 -> Bitmap Index Scan on bloomidx (cost=0.00..1792.00 rows=2 width=0) (actual time=0.358..0.358 rows=25 loops=1) 172 Index Cond: ((i2 = 898732) AND (i5 = 123451)) 173 Planning Time: 0.118 ms 174 Execution Time: 0.412 ms 175(8 rows) 176</programlisting> 177 </para> 178 179 <para> 180 Now, the main problem with the btree search is that btree is inefficient 181 when the search conditions do not constrain the leading index column(s). 182 A better strategy for btree is to create a separate index on each column. 183 Then the planner will choose something like this: 184<programlisting> 185=# CREATE INDEX btreeidx1 ON tbloom (i1); 186CREATE INDEX 187=# CREATE INDEX btreeidx2 ON tbloom (i2); 188CREATE INDEX 189=# CREATE INDEX btreeidx3 ON tbloom (i3); 190CREATE INDEX 191=# CREATE INDEX btreeidx4 ON tbloom (i4); 192CREATE INDEX 193=# CREATE INDEX btreeidx5 ON tbloom (i5); 194CREATE INDEX 195=# CREATE INDEX btreeidx6 ON tbloom (i6); 196CREATE INDEX 197=# EXPLAIN ANALYZE SELECT * FROM tbloom WHERE i2 = 898732 AND i5 = 123451; 198 QUERY PLAN 199--------------------------------------------------------------------------------------------------------------------------- 200 Bitmap Heap Scan on tbloom (cost=24.34..32.03 rows=2 width=24) (actual time=0.036..0.037 rows=0 loops=1) 201 Recheck Cond: ((i5 = 123451) AND (i2 = 898732)) 202 -> BitmapAnd (cost=24.34..24.34 rows=2 width=0) (actual time=0.033..0.034 rows=0 loops=1) 203 -> Bitmap Index Scan on btreeidx5 (cost=0.00..12.04 rows=500 width=0) (actual time=0.032..0.032 rows=0 loops=1) 204 Index Cond: (i5 = 123451) 205 -> Bitmap Index Scan on btreeidx2 (cost=0.00..12.04 rows=500 width=0) (never executed) 206 Index Cond: (i2 = 898732) 207 Planning Time: 0.531 ms 208 Execution Time: 0.072 ms 209(9 rows) 210</programlisting> 211 Although this query runs much faster than with either of the single 212 indexes, we pay a penalty in index size. Each of the single-column 213 btree indexes occupies 2 MB, so the total space needed is 12 MB, 214 eight times the space used by the bloom index. 215 </para> 216 </sect2> 217 218 <sect2> 219 <title>Operator Class Interface</title> 220 221 <para> 222 An operator class for bloom indexes requires only a hash function for the 223 indexed data type and an equality operator for searching. This example 224 shows the operator class definition for the <type>text</type> data type: 225 </para> 226 227<programlisting> 228CREATE OPERATOR CLASS text_ops 229DEFAULT FOR TYPE text USING bloom AS 230 OPERATOR 1 =(text, text), 231 FUNCTION 1 hashtext(text); 232</programlisting> 233 </sect2> 234 235 <sect2> 236 <title>Limitations</title> 237 <para> 238 <itemizedlist> 239 <listitem> 240 <para> 241 Only operator classes for <type>int4</type> and <type>text</type> are 242 included with the module. 243 </para> 244 </listitem> 245 246 <listitem> 247 <para> 248 Only the <literal>=</literal> operator is supported for search. But 249 it is possible to add support for arrays with union and intersection 250 operations in the future. 251 </para> 252 </listitem> 253 254 <listitem> 255 <para> 256 <literal>bloom</literal> access method doesn't support 257 <literal>UNIQUE</literal> indexes. 258 </para> 259 </listitem> 260 261 <listitem> 262 <para> 263 <literal>bloom</literal> access method doesn't support searching for 264 <literal>NULL</literal> values. 265 </para> 266 </listitem> 267 </itemizedlist> 268 </para> 269 </sect2> 270 271 <sect2> 272 <title>Authors</title> 273 274 <para> 275 Teodor Sigaev <email>teodor@postgrespro.ru</email>, 276 Postgres Professional, Moscow, Russia 277 </para> 278 279 <para> 280 Alexander Korotkov <email>a.korotkov@postgrespro.ru</email>, 281 Postgres Professional, Moscow, Russia 282 </para> 283 284 <para> 285 Oleg Bartunov <email>obartunov@postgrespro.ru</email>, 286 Postgres Professional, Moscow, Russia 287 </para> 288 </sect2> 289 290</sect1> 291