• Home
  • History
  • Annotate
Name Date Size #Lines LOC

..03-May-2022-

MakefileH A D08-Nov-2021451 185

READMEH A D08-Nov-20213.7 KiB9865

README.dependenciesH A D08-Nov-20214.9 KiB11781

dependencies.cH A D08-Nov-202130.5 KiB1,109578

extended_stats.cH A D08-Nov-202113.5 KiB523330

mvdistinct.cH A D08-Nov-202116.9 KiB692401

README

1Extended statistics
2===================
3
4When estimating various quantities (e.g. condition selectivities) the default
5approach relies on the assumption of independence. In practice that's often
6not true, resulting in estimation errors.
7
8Extended statistics track different types of dependencies between the columns,
9hopefully improving the estimates and producing better plans.
10
11
12Types of statistics
13-------------------
14
15There are currently two kinds of extended statistics:
16
17    (a) ndistinct coefficients
18
19    (b) soft functional dependencies (README.dependencies)
20
21
22Compatible clause types
23-----------------------
24
25Each type of statistics may be used to estimate some subset of clause types.
26
27    (a) functional dependencies - equality clauses (AND), possibly IS NULL
28
29Currently, only OpExprs in the form Var op Const, or Const op Var are
30supported, however it's feasible to expand the code later to also estimate the
31selectivities on clauses such as Var op Var.
32
33
34Complex clauses
35---------------
36
37We also support estimating more complex clauses - essentially AND/OR clauses
38with (Var op Const) as leaves, as long as all the referenced attributes are
39covered by a single statistics object.
40
41For example this condition
42
43    (a=1) AND ((b=2) OR ((c=3) AND (d=4)))
44
45may be estimated using statistics on (a,b,c,d). If we only have statistics on
46(b,c,d) we may estimate the second part, and estimate (a=1) using simple stats.
47
48If we only have statistics on (a,b,c) we can't apply it at all at this point,
49but it's worth pointing out clauselist_selectivity() works recursively and when
50handling the second part (the OR-clause), we'll be able to apply the statistics.
51
52Note: The multi-statistics estimation patch also makes it possible to pass some
53clauses as 'conditions' into the deeper parts of the expression tree.
54
55
56Selectivity estimation
57----------------------
58
59Throughout the planner clauselist_selectivity() still remains in charge of
60most selectivity estimate requests. clauselist_selectivity() can be instructed
61to try to make use of any extended statistics on the given RelOptInfo, which
62it will do if:
63
64    (a) An actual valid RelOptInfo was given. Join relations are passed in as
65        NULL, therefore are invalid.
66
67    (b) The relation given actually has any extended statistics defined which
68        are actually built.
69
70When the above conditions are met, clauselist_selectivity() first attempts to
71pass the clause list off to the extended statistics selectivity estimation
72function. This functions may not find any clauses which is can perform any
73estimations on. In such cases these clauses are simply ignored. When actual
74estimation work is performed in these functions they're expected to mark which
75clauses they've performed estimations for so that any other function
76performing estimations knows which clauses are to be skipped.
77
78Size of sample in ANALYZE
79-------------------------
80
81When performing ANALYZE, the number of rows to sample is determined as
82
83    (300 * statistics_target)
84
85That works reasonably well for statistics on individual columns, but perhaps
86it's not enough for extended statistics. Papers analyzing estimation errors
87all use samples proportional to the table (usually finding that 1-3% of the
88table is enough to build accurate stats).
89
90The requested accuracy (number of MCV items or histogram bins) should also
91be considered when determining the sample size, and in extended statistics
92those are not necessarily limited by statistics_target.
93
94This however merits further discussion, because collecting the sample is quite
95expensive and increasing it further would make ANALYZE even more painful.
96Judging by the experiments with the current implementation, the fixed size
97seems to work reasonably well for now, so we leave this as future work.
98

README.dependencies

1Soft functional dependencies
2============================
3
4Functional dependencies are a concept well described in relational theory,
5particularly in the definition of normalization and "normal forms". Wikipedia
6has a nice definition of a functional dependency [1]:
7
8    In a given table, an attribute Y is said to have a functional dependency
9    on a set of attributes X (written X -> Y) if and only if each X value is
10    associated with precisely one Y value. For example, in an "Employee"
11    table that includes the attributes "Employee ID" and "Employee Date of
12    Birth", the functional dependency
13
14        {Employee ID} -> {Employee Date of Birth}
15
16    would hold. It follows from the previous two sentences that each
17    {Employee ID} is associated with precisely one {Employee Date of Birth}.
18
19    [1] https://en.wikipedia.org/wiki/Functional_dependency
20
21In practical terms, functional dependencies mean that a value in one column
22determines values in some other column. Consider for example this trivial
23table with two integer columns:
24
25    CREATE TABLE t (a INT, b INT)
26        AS SELECT i, i/10 FROM generate_series(1,100000) s(i);
27
28Clearly, knowledge of the value in column 'a' is sufficient to determine the
29value in column 'b', as it's simply (a/10). A more practical example may be
30addresses, where the knowledge of a ZIP code (usually) determines city. Larger
31cities may have multiple ZIP codes, so the dependency can't be reversed.
32
33Many datasets might be normalized not to contain such dependencies, but often
34it's not practical for various reasons. In some cases, it's actually a conscious
35design choice to model the dataset in a denormalized way, either because of
36performance or to make querying easier.
37
38
39Soft dependencies
40-----------------
41
42Real-world data sets often contain data errors, either because of data entry
43mistakes (user mistyping the ZIP code) or perhaps issues in generating the
44data (e.g. a ZIP code mistakenly assigned to two cities in different states).
45
46A strict implementation would either ignore dependencies in such cases,
47rendering the approach mostly useless even for slightly noisy data sets, or
48result in sudden changes in behavior depending on minor differences between
49samples provided to ANALYZE.
50
51For this reason, extended statistics implement "soft" functional dependencies,
52associating each functional dependency with a degree of validity (a number
53between 0 and 1). This degree is then used to combine selectivities in a
54smooth manner.
55
56
57Mining dependencies (ANALYZE)
58-----------------------------
59
60The current algorithm is fairly simple - generate all possible functional
61dependencies, and for each one count the number of rows consistent with it.
62Then use the fraction of rows (supporting/total) as the degree.
63
64To count the rows consistent with the dependency (a => b):
65
66 (a) Sort the data lexicographically, i.e. first by 'a' then 'b'.
67
68 (b) For each group of rows with the same 'a' value, count the number of
69     distinct values in 'b'.
70
71 (c) If there's a single distinct value in 'b', the rows are consistent with
72     the functional dependency, otherwise they contradict it.
73
74
75Clause reduction (planner/optimizer)
76------------------------------------
77
78Applying the functional dependencies is fairly simple: given a list of
79equality clauses, we compute selectivities of each clause and then use the
80degree to combine them using this formula
81
82    P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?))
83
84Where 'd' is the degree of functional dependency (a => b).
85
86With more than two equality clauses, this process happens recursively. For
87example for (a,b,c) we first use (a,b => c) to break the computation into
88
89    P(a=?,b=?,c=?) = P(a=?,b=?) * (e + (1-e) * P(c=?))
90
91where 'e' is the degree of functional dependency (a,b => c); then we can
92apply (a=>b) the same way on P(a=?,b=?).
93
94
95Consistency of clauses
96----------------------
97
98Functional dependencies only express general dependencies between columns,
99without referencing particular values. This assumes that the equality clauses
100are in fact consistent with the functional dependency, i.e. that given a
101dependency (a=>b), the value in (b=?) clause is the value determined by (a=?).
102If that's not the case, the clauses are "inconsistent" with the functional
103dependency and the result will be over-estimation.
104
105This may happen, for example, when using conditions on the ZIP code and city
106name with mismatching values (ZIP code for a different city), etc. In such a
107case, the result set will be empty, but we'll estimate the selectivity using
108the ZIP code condition.
109
110In this case, the default estimation based on AVIA principle happens to work
111better, but mostly by chance.
112
113This issue is the price for the simplicity of functional dependencies. If the
114application frequently constructs queries with clauses inconsistent with
115functional dependencies present in the data, the best solution is not to
116use functional dependencies, but one of the more complex types of statistics.
117