February 8, 2006
Coverage: [DBCB] Chapter 13, pp. 605-638; Chapter 15, p. 736;
and [DBMS] Chapter 16, pp. 436-439, 457-461
Analysis of Block-based Nested-Loop Join
- outer loop runs for B(S)/(M-1) iterations
- each outer iteration reads M-1 blocks of S
and B(R) blocks of R
- leads to
B(S)/M-1(M-1 + B(R)) disk i/o's
- B(S)B(R)/M approximates this,
assuming M, B(S), and B(R) are large,
but M is smallest
- should outer loop go through the
smaller or larger of the two relations?
- comparison to the one-pass join algorithm
Review of Database Design
- requirements analysis
- conceptual design
- schema refinement -- normalization
- physical database design
- tuning encompasses the prior three
Indices
- an index is a data structure which takes a property of
records as input and returns quickly all records that satisfy that
property
- also dictates how the data is physically organized
- implementation choices of data structures
- pointers
- B-trees (actually a family of trees, e.g., B+tree is the particular
instance used most often)
- hash tables
Dense vs. Sparse Index
- dense index
- contains a pointer to each record
- structure is an exact replica of data on disk
- places no constraints on the physical organization of the data on disk
- inefficient storage
- fast access
- good for set and counting operations: MAX, MIN, AVG
- graphic depiction on attribute age
- we assumed each block contains two records
- sparse index
- one pointer to each block
- assumes data is sorted on disk
- graphic depiction on attribute age
- comparison to the index at the back of a book
- same concepts exist for b-trees and hash tables
- indices in commercial DBMS's
- Sybase: sparse
- IBM's DB2: dense
- ORACLE: dense or sparse, you can tune it
Primary vs. Secondary Index
- primary index
- puts constrains on data
- can use dense or sparse (intelligent) index
- secondary index
- does not assume anything about the organization of the data
- must use dense index
B-Trees
- all leaves reside at the same level
- a generalization of a binary search tree for secondary memory
- each node can have => 2 children
- each node = 1 disk block
- Example (courtesy [DBCB] example 13.19, p. 633):
- 1 block = 4k = 4096 bytes
- an int = 4 bytes
- a pointer to a block = 8 bytes
- 4n + 8(n+1) = 4096
- n = 340
- a typical b-tree does not extend more than 3 levels
- for a reasonably-sized db (~1mil records)
- 2 disk i/o's because root is usually in the main memory buffer
- F = fan-out: the #pointers per index block
- N = #primary data blocks
- height of tree is proportional to log_F (N)
- clearly important to maximize F to minimize h
- in the study of databases, complexity is usually measure in
disk assesses, not just time
- #disk i/o's needed to access a block = h+1
- where h is the height of the tree
- +1 for the actually block to be read
- +2 for insertion/deletion
- not time complexity
- time complexity: O(hn)
- assuming linear search within a block
- can improve this by using binary search within each block
- take-away: all the data structures you learn for main memory have analogs in
secondary memory
- these are all issues that the DBMS's storage manager must deal with
Database Tuning
- why tune?
- primarily to improve performance
- sources of poor performance
- imprecise data structures
- random vs. sequential disk access
- short bursts of database interaction
- delays due to multiple transactions
- consider workloads
- mix of queries and updates
- many forms:
- tune hardware
- tune OS
- tune data structures and indices
- not a day-to-day activity
- conducted, perhaps, every 6 months
- more of an art than science
- tuning tools:
- RDB Expert for ORACLE
- AutoAdmin for MS SQL Server
Options for Database Tuning
- tune schema (sometimes called schema evolution
- normalization: BCNF or 3NF?
- vertical partitioning
- horizontal partitioning
- taxpayers with AGI <= 50,000 and > 50,000
- use VIEW to combine partitions
- (maintainability of) storing aggregate values
- sometimes we accept a degree of redundancy
- Parts(id, sales, store) and PartsAggr(id, totalsales)
- consider insert, delete, and update actions
- SUM (easiest)
- AVG (requires O(1) storage)
- MAX (insert easy; delete and update difficult)
- index issues
- primary vs. secondary
- dense vs. sparse
- b-trees vs. hash tables
- place index on which attribute?
- when should we use a secondary index?
References
[DBCB] H. Garcia-Molina, J. D. Ullman, and J. Widom.
Database Systems: The Complete Book. Prentice Hall, 2002.
[DBMS] R. Ramakrishnan and J. Gehrke.
Database Management Systems. McGraw-Hill, Third Edition, 2002.