March 22, 2006
Coverage:
[DBCB] Chapter 20, pp. 1064-1069
Query Plan Selection in Mediators
- problem: how can a mediator, with very little knowledge
of the sources in which it interacts, select an effective query plan?
- mediators cannot rely solely on cost-based measures
- we need a new approach
- more a question of plan feasibility vs. plan cost
Why are the Capabilities of Sources Limited?
- there are several reasons, most
are clustered around the following issues
- the legacy database problem
- still in existence for backwards capability
- many applications rely on these
- may not be a relational DBMS or a DBMS at all; could be
flat files
- may not be queried with SQL; some use COBOL
- may have to be queried in very specific ways
- security
- examples:
- for purposes of competition, Amazon
does not permit the following query
SELECT *
FROM Books
(note: Amazon actually uses models from information
retrieval)
- Amazon provides the sales rank of
every book, yet you cannot query for the
top-n books sold on a particular day
- for purposes of patient privacy, a medical site
may only expose aggregates of several
underlying patient attributes rather than
revealing individual data points
- some sites block robots, spiders
- consult robots.txt for the policy
- limited availability of indices
- without an index on author, it would be
extremely difficult (and slow) to ask for
all the books written by a particular author
- problems with banging on a website through
its web interface to harvest its data
(courtesy [DOW] p. 3)
- wrapper is brittle; minor HTML formatting changes
could break it
- more importantly, may need to download (and reconstruct)
entire
database to answer a query that might only need a few tuples
to answer
How Can a Source Describe its Capabilities?
- use a simple notation
- free
- bound
- unspecified
- choice[S], where S is a set of possible values
- optional[S]
- superscript any character with ' to suppress the output of the
associated attribute's value
- a string of these characters, one for each attribute,
is called an adornment
- a capabilities specification = a set of adornments
- query must match one of the adornments
- example (adapted from [DBCB] Example 20.10, p. 1066-1067):
- consider Books(isbn, author, title, soft, used)
- ways to query Books
- b'uuuu
- ubbo[yes,no]o[yes,no]
- can limit scope on values of author and title
attributes further as some combinations may be illegal
- an adornment may not be able to answer a query
- an adornment is sound, but may not be complete
Capability-based Query Plan Selection
- mediator first considers queries it can ask of its sources
- it asks those which can help answer the original query
- in answering those queries, additional attributes may become bound
- repeat this process until either
- all conditions of the original query are resolved meaning that
the query plan is feasible, or
- construction of no additional source queries are possible,
yet the original query still remains unanswered; this means the
query is impossible to answer
Example of Capability-based Query Plan Selection
- example (adapted from [DBCB] Example 20.11, p. 1068-1069)
- consider the following relations with their associated adornments:
Books(isbn, title, author) (ufb)
Options(isbn, option) (bu or uc[soft,used])
- consider a query for the isbns and titles of all used
books by Orwell
- the following are some query plans:
-
- query Books for the isbns and titles for
all books written by Orwell
- for each isbn from 1(a),
query Options (using its bu adornment)
for option and filter those that are not used
-
- query Options (using the uc[soft,used] adornment)
for the isbns of all used books
- do 1(a)
- take the intersection of the results of 2(a) and 2(b)
-
- do 2(a)
- using these isbns, query Books for those whose
author is Orwell
- plan #3 can be eliminated immediately because it is not feasible, why?
- therefore the mediator must make a choice among plans #1 and #2
- how can we evaluate which of these two plans are preferable?
- can we foster a marriage between the ideas of
cost-based estimation we studied earlier with capability-based
plan selection?
- plan #1 requires 1+n queries, where n
is the number of queries required for 1(b)
- plan #2 requires 2 queries plus an intersection operation
- this analysis stacks the cards in favor of plan #2
- however, consider that plan #2 results in a relation with no
titles
- which is not faithful to the original query
- would require additional queries to collect titles from Books
- this makes plan #1 appear more favorable
- moral of the story, incorporating cost-based approaches is
context-sensitive
- try to find some quantitative aspect
- here we use #sub-queries required
- mediator might need to do some long-term observation of
data sources to be more effective
- discern times of heavy and light loads
References
[DBCB] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems:
The Complete Book. Prentice Hall, 2002.
[DOW] S. Abiteboul, P. Buneman, and D. Suciu.
Data on the Web: From Relations to Semistructured Data and XML.
Morgan Kaufmann, 1999.