• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

MakefileH A D08-Nov-2021469 229

READMEH A D08-Nov-20213.8 KiB10267

README.dependenciesH A D08-Nov-20214.9 KiB11781

README.mcvH A D08-Nov-20214.2 KiB10475

dependencies.cH A D08-Nov-202138.5 KiB1,362697

extended_stats.cH A D08-Nov-202142.5 KiB1,515819

mcv.cH A D08-Nov-202156.6 KiB1,939960

mvdistinct.cH A D08-Nov-202117.5 KiB702401

README

1Extended statistics
2===================
3
4When estimating various quantities (e.g. condition selectivities) the default
5approach relies on the assumption of independence. In practice that's often
6not true, resulting in estimation errors.
7
8Extended statistics track different types of dependencies between the columns,
9hopefully improving the estimates and producing better plans.
10
11
12Types of statistics
13-------------------
14
15There are currently two kinds of extended statistics:
16
17    (a) ndistinct coefficients
18
19    (b) soft functional dependencies (README.dependencies)
20
21    (c) MCV lists (README.mcv)
22
23
24Compatible clause types
25-----------------------
26
27Each type of statistics may be used to estimate some subset of clause types.
28
29    (a) functional dependencies - equality clauses (AND), possibly IS NULL
30
31    (b) MCV lists - equality and inequality clauses (AND, OR, NOT), IS [NOT] NULL
32
33Currently, only OpExprs in the form Var op Const, or Const op Var are
34supported, however it's feasible to expand the code later to also estimate the
35selectivities on clauses such as Var op Var.
36
37
38Complex clauses
39---------------
40
41We also support estimating more complex clauses - essentially AND/OR clauses
42with (Var op Const) as leaves, as long as all the referenced attributes are
43covered by a single statistics object.
44
45For example this condition
46
47    (a=1) AND ((b=2) OR ((c=3) AND (d=4)))
48
49may be estimated using statistics on (a,b,c,d). If we only have statistics on
50(b,c,d) we may estimate the second part, and estimate (a=1) using simple stats.
51
52If we only have statistics on (a,b,c) we can't apply it at all at this point,
53but it's worth pointing out clauselist_selectivity() works recursively and when
54handling the second part (the OR-clause), we'll be able to apply the statistics.
55
56Note: The multi-statistics estimation patch also makes it possible to pass some
57clauses as 'conditions' into the deeper parts of the expression tree.
58
59
60Selectivity estimation
61----------------------
62
63Throughout the planner clauselist_selectivity() still remains in charge of
64most selectivity estimate requests. clauselist_selectivity() can be instructed
65to try to make use of any extended statistics on the given RelOptInfo, which
66it will do if:
67
68    (a) An actual valid RelOptInfo was given. Join relations are passed in as
69        NULL, therefore are invalid.
70
71    (b) The relation given actually has any extended statistics defined which
72        are actually built.
73
74When the above conditions are met, clauselist_selectivity() first attempts to
75pass the clause list off to the extended statistics selectivity estimation
76function. This functions may not find any clauses which is can perform any
77estimations on. In such cases these clauses are simply ignored. When actual
78estimation work is performed in these functions they're expected to mark which
79clauses they've performed estimations for so that any other function
80performing estimations knows which clauses are to be skipped.
81
82Size of sample in ANALYZE
83-------------------------
84
85When performing ANALYZE, the number of rows to sample is determined as
86
87    (300 * statistics_target)
88
89That works reasonably well for statistics on individual columns, but perhaps
90it's not enough for extended statistics. Papers analyzing estimation errors
91all use samples proportional to the table (usually finding that 1-3% of the
92table is enough to build accurate stats).
93
94The requested accuracy (number of MCV items or histogram bins) should also
95be considered when determining the sample size, and in extended statistics
96those are not necessarily limited by statistics_target.
97
98This however merits further discussion, because collecting the sample is quite
99expensive and increasing it further would make ANALYZE even more painful.
100Judging by the experiments with the current implementation, the fixed size
101seems to work reasonably well for now, so we leave this as future work.
102

README.dependencies

1Soft functional dependencies
2============================
3
4Functional dependencies are a concept well described in relational theory,
5particularly in the definition of normalization and "normal forms". Wikipedia
6has a nice definition of a functional dependency [1]:
7
8    In a given table, an attribute Y is said to have a functional dependency
9    on a set of attributes X (written X -> Y) if and only if each X value is
10    associated with precisely one Y value. For example, in an "Employee"
11    table that includes the attributes "Employee ID" and "Employee Date of
12    Birth", the functional dependency
13
14        {Employee ID} -> {Employee Date of Birth}
15
16    would hold. It follows from the previous two sentences that each
17    {Employee ID} is associated with precisely one {Employee Date of Birth}.
18
19    [1] https://en.wikipedia.org/wiki/Functional_dependency
20
21In practical terms, functional dependencies mean that a value in one column
22determines values in some other column. Consider for example this trivial
23table with two integer columns:
24
25    CREATE TABLE t (a INT, b INT)
26        AS SELECT i, i/10 FROM generate_series(1,100000) s(i);
27
28Clearly, knowledge of the value in column 'a' is sufficient to determine the
29value in column 'b', as it's simply (a/10). A more practical example may be
30addresses, where the knowledge of a ZIP code (usually) determines city. Larger
31cities may have multiple ZIP codes, so the dependency can't be reversed.
32
33Many datasets might be normalized not to contain such dependencies, but often
34it's not practical for various reasons. In some cases, it's actually a conscious
35design choice to model the dataset in a denormalized way, either because of
36performance or to make querying easier.
37
38
39Soft dependencies
40-----------------
41
42Real-world data sets often contain data errors, either because of data entry
43mistakes (user mistyping the ZIP code) or perhaps issues in generating the
44data (e.g. a ZIP code mistakenly assigned to two cities in different states).
45
46A strict implementation would either ignore dependencies in such cases,
47rendering the approach mostly useless even for slightly noisy data sets, or
48result in sudden changes in behavior depending on minor differences between
49samples provided to ANALYZE.
50
51For this reason, extended statistics implement "soft" functional dependencies,
52associating each functional dependency with a degree of validity (a number
53between 0 and 1). This degree is then used to combine selectivities in a
54smooth manner.
55
56
57Mining dependencies (ANALYZE)
58-----------------------------
59
60The current algorithm is fairly simple - generate all possible functional
61dependencies, and for each one count the number of rows consistent with it.
62Then use the fraction of rows (supporting/total) as the degree.
63
64To count the rows consistent with the dependency (a => b):
65
66 (a) Sort the data lexicographically, i.e. first by 'a' then 'b'.
67
68 (b) For each group of rows with the same 'a' value, count the number of
69     distinct values in 'b'.
70
71 (c) If there's a single distinct value in 'b', the rows are consistent with
72     the functional dependency, otherwise they contradict it.
73
74
75Clause reduction (planner/optimizer)
76------------------------------------
77
78Applying the functional dependencies is fairly simple: given a list of
79equality clauses, we compute selectivities of each clause and then use the
80degree to combine them using this formula
81
82    P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?))
83
84Where 'd' is the degree of functional dependency (a => b).
85
86With more than two equality clauses, this process happens recursively. For
87example for (a,b,c) we first use (a,b => c) to break the computation into
88
89    P(a=?,b=?,c=?) = P(a=?,b=?) * (e + (1-e) * P(c=?))
90
91where 'e' is the degree of functional dependency (a,b => c); then we can
92apply (a=>b) the same way on P(a=?,b=?).
93
94
95Consistency of clauses
96----------------------
97
98Functional dependencies only express general dependencies between columns,
99without referencing particular values. This assumes that the equality clauses
100are in fact consistent with the functional dependency, i.e. that given a
101dependency (a=>b), the value in (b=?) clause is the value determined by (a=?).
102If that's not the case, the clauses are "inconsistent" with the functional
103dependency and the result will be over-estimation.
104
105This may happen, for example, when using conditions on the ZIP code and city
106name with mismatching values (ZIP code for a different city), etc. In such a
107case, the result set will be empty, but we'll estimate the selectivity using
108the ZIP code condition.
109
110In this case, the default estimation based on AVIA principle happens to work
111better, but mostly by chance.
112
113This issue is the price for the simplicity of functional dependencies. If the
114application frequently constructs queries with clauses inconsistent with
115functional dependencies present in the data, the best solution is not to
116use functional dependencies, but one of the more complex types of statistics.
117

README.mcv

1MCV lists
2=========
3
4Multivariate MCV (most-common values) lists are a straightforward extension of
5regular MCV list, tracking most frequent combinations of values for a group of
6attributes.
7
8This works particularly well for columns with a small number of distinct values,
9as the list may include all the combinations and approximate the distribution
10very accurately.
11
12For columns with a large number of distinct values (e.g. those with continuous
13domains), the list will only track the most frequent combinations. If the
14distribution is mostly uniform (all combinations about equally frequent), the
15MCV list will be empty.
16
17Estimates of some clauses (e.g. equality) based on MCV lists are more accurate
18than when using histograms.
19
20Also, MCV lists don't necessarily require sorting of the values (the fact that
21we use sorting when building them is implementation detail), but even more
22importantly the ordering is not built into the approximation (while histograms
23are built on ordering). So MCV lists work well even for attributes where the
24ordering of the data type is disconnected from the meaning of the data. For
25example we know how to sort strings, but it's unlikely to make much sense for
26city names (or other label-like attributes).
27
28
29Selectivity estimation
30----------------------
31
32The estimation, implemented in mcv_clauselist_selectivity(), is quite simple
33in principle - we need to identify MCV items matching all the clauses and sum
34frequencies of all those items.
35
36Currently MCV lists support estimation of the following clause types:
37
38    (a) equality clauses      WHERE (a = 1) AND (b = 2)
39    (b) inequality clauses    WHERE (a < 1) AND (b >= 2)
40    (c) NULL clauses          WHERE (a IS NULL) AND (b IS NOT NULL)
41    (d) OR clauses            WHERE (a < 1) OR (b >= 2)
42
43It's possible to add support for additional clauses, for example:
44
45    (e) multi-var clauses     WHERE (a > b)
46
47and possibly others. These are tasks for the future, not yet implemented.
48
49
50Hashed MCV (not yet implemented)
51--------------------------------
52
53Regular MCV lists have to include actual values for each item, so if those items
54are large the list may be quite large. This is especially true for multivariate
55MCV lists, although the current implementation partially mitigates this by
56performing de-duplicating the values before storing them on disk.
57
58It's possible to only store hashes (32-bit values) instead of the actual values,
59significantly reducing the space requirements. Obviously, this would only make
60the MCV lists useful for estimating equality conditions (assuming the 32-bit
61hashes make the collisions rare enough).
62
63This might also complicate matching the columns to available stats.
64
65
66TODO Consider implementing hashed MCV list, storing just 32-bit hashes instead
67     of the actual values. This type of MCV list will be useful only for
68     estimating equality clauses, and will reduce space requirements for large
69     varlena types (in such cases we usually only want equality anyway).
70
71
72Inspecting the MCV list
73-----------------------
74
75Inspecting the regular (per-attribute) MCV lists is trivial, as it's enough
76to select the columns from pg_stats. The data is encoded as anyarrays, and
77all the items have the same data type, so anyarray provides a simple way to
78get a text representation.
79
80With multivariate MCV lists the columns may use different data types, making
81it impossible to use anyarrays. It might be possible to produce a similar
82array-like representation, but that would complicate further processing and
83analysis of the MCV list.
84
85So instead the MCV lists are stored in a custom data type (pg_mcv_list),
86which however makes it more difficult to inspect the contents. To make that
87easier, there's a SRF returning detailed information about the MCV lists.
88
89    SELECT m.* FROM pg_statistic_ext s,
90                    pg_statistic_ext_data d,
91                    pg_mcv_list_items(stxdmcv) m
92              WHERE s.stxname = 'stts2'
93                AND d.stxoid = s.oid;
94
95It accepts one parameter - a pg_mcv_list value (which can only be obtained
96from pg_statistic_ext_data catalog, to defend against malicious input), and
97returns these columns:
98
99    - item index (0, ..., (nitems-1))
100    - values (string array)
101    - nulls only (boolean array)
102    - frequency (double precision)
103    - base_frequency (double precision)
104