February 1, 2006
Coverage: [DBCB] Chapter 16, pp. 847-856
Study of Databases Unifies Core Areas of CS
- query languages: programming languages
- query processing and optimization: algorithms
- transaction management: operating systems
Join Algorithms
- e.g., one-pass join, hash-join, nested-loop join, index-join
- most are asymmetric
- convention: smaller relation stored in main memory;
larger read a block at a time
- build relation and probe relation
Join Trees
- join trees
- left-deep tree (all right children are leaves)
- right-deep tree (all left children are leaves)
- bushy tree (neither left- or right-deep)
- how many left-deep tree shapes exist for n relations?
- total #tree shapes given by the following recurrence relation:
-
T(1) = 1
T(n) =
T(i)T(n-i)
- intuition behind this recurrence relation
- how to go from #shapes to #trees?
- estimated #trees for 6 relations = 30,240
- clearly #left/right-deep trees grows much slower
than #bushy trees as n increases
Join Ordering
- approaches
- enumerate all possible trees
- consider only a subset of the total possible
- use a heuristic to guide the search
- we'll restrict our search to left-deep trees, why?
- query processor only searches through left-deep trees
- one-pass join algorithm
- on a left-deep tree,
requires memory for at most two temporary relations at a time
- B(R) + B(R
S)
main-memory buffers
- what about right-deep trees?
Algorithmic Paradigms
- divide and conquer
- greedy
- dynamic programming
- an approach to enumeration which involves computing and saving solutions
to sub-problems in an intelligent way so that they can be re-used (retrieved)
later to solve larger instances of the problem
- mathematicians usually mean table when they say programming
- e.g., dynamic programming, linear programming, non-linear programming
- algorithm design is more of an art than science
Dynamic Programming for Join Ordering
- construct a table with an entry for each subset of one or more of the
n relations
- each table entry is a 3-tuple
- estimated size of the join
- least cost of computing the join
- sum of the sizes of intermediate relations (excluding that for each
individual relation and the entire join)
- optimal logical plan
- let's restrict ourselves to left-deep trees
- will be reflected in the cost-estimate as well
- fill-in entries of the table bottom up (i.e., first size-1 subsets, then
size-2 subsets, ...,
size-n subset) and reuse table entries as you proceed along
References
[DBCB] H. Garcia-Molina, J. D. Ullman, and J. Widom.
Database Systems: The Complete Book. Prentice Hall, 2002.