CPS 432/562 Homework #3 Solution Sketches
- (5+5+5=15 points) In addition to indices, commercial DBMSs provide
table scan operations -- these are operations
that are meant to read many blocks (pages)
in one stroke, provided the table is stored in contiguous locations.
In fact, one of the tricks employed by DBMS vendors is to store data pertaining
to one single relation on the same cylinder. Thus, if you are attempting
to read all data from a single relation, then you just have to position
the arm once for all reads from all the platters (the first
and one of the expensive parts of the access time --- the seek time ---
thus needs to be accounted for only once, in the beginning).
For example, Ingres can read nearly 10 pages in a scan and this
is sometimes beneficial (up to 10 times faster) to using a secondary index,
even (and particularly) if the index returns all the pages of the table.
Assume that a table has 2 million records and a query returns 75,000 records.
Assume, further, that a block contains 22 records. Is it beneficial to
have a secondary index? Why/why not? What if there are only 2 records
per block?
What guidelines can you provide for when to use secondary indices?
First, realize that since the question involves secondary indices, it implies a
dense index.
- If each block contains 22 records, we will require about 90,910 blocks.
Sine the
query returns 75,000 records, the probability that each block has `an
answer record' is high (> 82%). Assuming the table scan can read
10 pages per scan, it requires 9,091 disk accesses.
On the other hand,
use of the secondary index requires
random access to retrieve every record in the answer (i.e., 75,000
disk accesses). Therefore,
the table scan will use fewer disk accesses.
-
If each block has only 2 records, we will require about 1
million blocks of storage and 100,000 disk accesses.
The query, on the other hand, will thus access only 7.5% of the
total blocks. Hence, a secondary index will result in fewer
disk accesses (75,000).
Moral of the story: use a secondary index when the average query which uses the
index will retrieve far fewer records than there are blocks. Thus, you need
both a large number of records + high discrimination, to justify the
implementation of a secondary index.
- (1+2+2=5 points) Exercise 15.3.3 (parts a, b, c) on p. 737
from [DBCB]: Suppose B(R) = B(S) = 10,000.
What value of M would be necessary to compute R
S using the
nested-loop algorithm with no more than
a) 100,000, b) 25,000, and 15,000 disk I/O's?
- 10009/9
- 20003/3
- 20,001
- (10+10+10=30 points) Exercise 13.3.1 (parts b, c, and d) on p. 646
from [DBCB]:
Suppose that
blocks can hold either 10 records or 99 keys and 100 pointers. Also assume
that the average B-tree node is 70% full;
i.e., it will have 69 keys
and 70 pointers. We can use B-trees as part of several different structures.
For each structure defined below, determine (i) the total number
of blocks needed for a 1,000,000-record file, and (ii) the average
number of disk accesses to retrieve a record given its search key.
You may assume that nothing is in main memory initially, and that the
search key is the primary key for all the records.
- The data file is a sequential, unsorted file,
with 10 records per block. The B-tree is a dense index.
- 114,495 blocks =
100,000 (= 1,000,000/10, for data; @ lowest level) +
14,286 (= 1,000,000/70 @ leaf level) +
205 (= 14,286/70 @ pre-leaf level) + 3 (= 205/70 @ second level)
+ 1 (the root)
- 5
- The data file is a sequential file, sorted on the search key,
with 10 records per block. The B-tree is a sparse index.
- 101,451 blocks = 100,000 (= 1,000,000/10, for data; @ lowest level) +
1,429 (= 100,000/70 @ leaf level) + 21 (= 1,429/70 @ second level)
+ 1 (the root)
- 4
- Instead of the B-tree leaves having pointers to records, the B-tree
leaves hold the records themselves. A block can hold 10 records,
but on average, a leaf block is 70% full. i.e.,
there are 7 records per leaf block.
- 144,930 blocks =
142,858 (= 1,000,000/7, for data; @ leaf level) + 2,041 (= 142,858/70,
@ pre-leaf level) + 30 (= 2,041/70 @ second level) + 1 (the root)
- 4
- (10 points) Required only for CPS 562 students
We mentioned in class that the choice of an index,
to a certain extent, depends on (and influences) the physical organization of
data on secondary storage. Give an example of a physical organization
of data (i.e., something like sorted files, B-trees, or hash tables)
where a secondary index is needed even on the primary key!
Explain why.
An important goal of a DBMS data organization is to `try to organize blocks so
that if we use the maximum data inside each retrieved block' and avoid
unnecessary disk accesses. An example of such a data organization is a
`clustered file' where records belonging to two different relations are
actually clustered together based on some access patterns. For example, assume
that we require queries such as `Find the office number of the department where
Saverio is an instructor' and `Find all the instructors in the CPS department.'
Notice that both queries involve two relations, namely Departments and
Instructors. If we knew that such queries are common, we can take
advantage of this fact and put information pertaining to the departments and
instructors together. Thus, we could have a clustered file with interspersed
records where one department lists all instructors followed by the next
department and so on. The goal is that records that will be retrieved together
are `clustered' in the same block, so that the retrieved blocks are more likely
to contain all of the necessary data. Thus, we would have a secondary index on
`Instructor Name' even though it is the primary key for the Instructor
relation!
The other example is the heap
data structure which does not impose any
ordering on the individual records. In other words, records are
organized in
the order in which they arrived (presumably). Now, what use is such
a data structure? The only real advantage comes when you are reading the whole
relation (the table scan
operation, again!), in no particular order. Thus,
heaps provide both storage efficiency and fast table scans. Insertions are
easier than other data structures, because they do not involve any `searching'
(before inserting). Of course, they are pretty inefficient for searches or
deletions.
If your answer was hash tables, it is a moot-point whether they satisfy the
requirements of the question. It depends the particular hashing function and
your particular assumptions.
|