1Soft functional dependencies 2============================ 3 4Functional dependencies are a concept well described in relational theory, 5particularly in the definition of normalization and "normal forms". Wikipedia 6has a nice definition of a functional dependency [1]: 7 8 In a given table, an attribute Y is said to have a functional dependency 9 on a set of attributes X (written X -> Y) if and only if each X value is 10 associated with precisely one Y value. For example, in an "Employee" 11 table that includes the attributes "Employee ID" and "Employee Date of 12 Birth", the functional dependency 13 14 {Employee ID} -> {Employee Date of Birth} 15 16 would hold. It follows from the previous two sentences that each 17 {Employee ID} is associated with precisely one {Employee Date of Birth}. 18 19 [1] https://en.wikipedia.org/wiki/Functional_dependency 20 21In practical terms, functional dependencies mean that a value in one column 22determines values in some other column. Consider for example this trivial 23table with two integer columns: 24 25 CREATE TABLE t (a INT, b INT) 26 AS SELECT i, i/10 FROM generate_series(1,100000) s(i); 27 28Clearly, knowledge of the value in column 'a' is sufficient to determine the 29value in column 'b', as it's simply (a/10). A more practical example may be 30addresses, where the knowledge of a ZIP code (usually) determines city. Larger 31cities may have multiple ZIP codes, so the dependency can't be reversed. 32 33Many datasets might be normalized not to contain such dependencies, but often 34it's not practical for various reasons. In some cases, it's actually a conscious 35design choice to model the dataset in a denormalized way, either because of 36performance or to make querying easier. 37 38 39Soft dependencies 40----------------- 41 42Real-world data sets often contain data errors, either because of data entry 43mistakes (user mistyping the ZIP code) or perhaps issues in generating the 44data (e.g. a ZIP code mistakenly assigned to two cities in different states). 45 46A strict implementation would either ignore dependencies in such cases, 47rendering the approach mostly useless even for slightly noisy data sets, or 48result in sudden changes in behavior depending on minor differences between 49samples provided to ANALYZE. 50 51For this reason, extended statistics implement "soft" functional dependencies, 52associating each functional dependency with a degree of validity (a number 53between 0 and 1). This degree is then used to combine selectivities in a 54smooth manner. 55 56 57Mining dependencies (ANALYZE) 58----------------------------- 59 60The current algorithm is fairly simple - generate all possible functional 61dependencies, and for each one count the number of rows consistent with it. 62Then use the fraction of rows (supporting/total) as the degree. 63 64To count the rows consistent with the dependency (a => b): 65 66 (a) Sort the data lexicographically, i.e. first by 'a' then 'b'. 67 68 (b) For each group of rows with the same 'a' value, count the number of 69 distinct values in 'b'. 70 71 (c) If there's a single distinct value in 'b', the rows are consistent with 72 the functional dependency, otherwise they contradict it. 73 74 75Clause reduction (planner/optimizer) 76------------------------------------ 77 78Applying the functional dependencies is fairly simple: given a list of 79equality clauses, we compute selectivities of each clause and then use the 80degree to combine them using this formula 81 82 P(a=?,b=?) = P(a=?) * (d + (1-d) * P(b=?)) 83 84Where 'd' is the degree of functional dependency (a => b). 85 86With more than two equality clauses, this process happens recursively. For 87example for (a,b,c) we first use (a,b => c) to break the computation into 88 89 P(a=?,b=?,c=?) = P(a=?,b=?) * (e + (1-e) * P(c=?)) 90 91where 'e' is the degree of functional dependency (a,b => c); then we can 92apply (a=>b) the same way on P(a=?,b=?). 93 94 95Consistency of clauses 96---------------------- 97 98Functional dependencies only express general dependencies between columns, 99without referencing particular values. This assumes that the equality clauses 100are in fact consistent with the functional dependency, i.e. that given a 101dependency (a=>b), the value in (b=?) clause is the value determined by (a=?). 102If that's not the case, the clauses are "inconsistent" with the functional 103dependency and the result will be over-estimation. 104 105This may happen, for example, when using conditions on the ZIP code and city 106name with mismatching values (ZIP code for a different city), etc. In such a 107case, the result set will be empty, but we'll estimate the selectivity using 108the ZIP code condition. 109 110In this case, the default estimation based on AVIA principle happens to work 111better, but mostly by chance. 112 113This issue is the price for the simplicity of functional dependencies. If the 114application frequently constructs queries with clauses inconsistent with 115functional dependencies present in the data, the best solution is not to 116use functional dependencies, but one of the more complex types of statistics. 117