January 23, 2006
Coverage: [DBCB] Chapter 5, pp. 224-226 and Chapter 16, pp. 821-835
Warm-up
- how is
related to
?
- finished example from end of last class
- called an equijoin (!= theta-join)
Big Picture
- once we have arrived at the preferred logical query plan, we must
convert it to a physical query plan (PQP)
- consider several PQPs and estimate the cost of each
- called cost-based enumeration
- must select an order and grouping, algorithms, additional
operators, parameter passing mechanism
- finally, choose the PQP with the least cost and pass it to the query
execution engine
- estimation vs. execution
- #tuples in intermediate relations of a plan affect its cost
- #tuples is necessary; it is sufficient?
- desiderata
- accurate
- easy to compute
- logically consistent
Estimation Notation
- B(R) represents the #blocks required to store all of
the tuples of relation R
- T(R) represents the #tuples in R
- V(R, a) represents the #(distinct values)
for attribute a in relation R.
- V(R, [a_1, a_2, ...,
a_n]) considers the values in attributes a_1, a_2, ..., a_n
as a whole and equals
![]()
This metric is called the value count for attribute a
of relation R
How to Estimate Size of a Projection
- T(R) is computable
- is the size of the relation resulting from another operator computable?
if so, which?
- remember we are dealing with extended relational algebra
- under what circumstances, would the result grow?
Zipfian Distribution
- frequency of ith most common values are in proportion to
1/sqrt(i)
- e.g., if most common value appears 1000 times, then the 2nd
most common would be expected to occur 1000/sqrt(2) or 707 times, and so on
- common in many types of data
- relative frequencies of words in English sentences
- US state populations
- distribution of attributes irrelevant if constant is chosen randomly
- if constant chosen with a Zipfian distribution, then we would
expect T(S) to be larger than T(R)/V(R,a)
How to Estimate Size of a Selection
involving an inequality comparison
- half or a third?
- T(S) = T(R)/3
involving a `not equals'
- rare
- T(S) = T(R) or
- T(S) = T(R)(V(R,a)-1)/V(R,a) = T(R) - T(R)/V(R,a)
involving an AND
involving an OR
- might assume that no tuple satisfies both conditions
- can be an overestimate or even absurd
- how?
- T(S) = min(T(R), sum of each condition)
- assume C_1 and C_2 are independent
- T(S) = n(1-(1-(m_1/n))(1-(m_2/n)))
- intuition
- 1-(m_1/n) is the fraction of tuples that do not
satisfy C_1; likewise for C_2
- their product is fraction not in S
- 1 minus this fraction is the fraction in S
- less simple, more accurate
involving a NOT
- T(S) = T(R) - (those that satisfy C)
in summary, selectivity factor
- 1/V(R,A) for any attribute A compared to a constant
- 1/3 for any inequality
- 1 for `not equal'
Example (courtesy [DBCB] example 16.21, p. 824):
How to Estimate Size of a Natural Join with
Single Join Attribute
- assume R(X,Y)
S(Y,Z)
where Y is a single
attribute
- two simplifying assumptions
- containment of value sets
- preservation of value sets
References
[DBCB] H. Garcia-Molina, J. D. Ullman, and J. Widom.
Database Systems: The Complete Book. Prentice Hall, 2002.