CPS 432/562 Homework #5 Solution Sketches
- (5+5=10 points) Exercise 18.6.1 (only parts b and c) on p. 963 from [DBCB]:
Consider, for variety, an object-oriented database.
The objects of class C are stored on two blocks,
B1 and B2.
Block B1 contains objects O1 and
O2, while block B2 contains
objects O3, O4, and
O5. Class extents, blocks, and objects form a
hierarchy of lockable
database elements. Give the sequence of lock requests and the response
of a warning-protocol-based scheduler
to the following sequences of requests. You may assume all requests
occur just before they are needed, and all unlocks occur at the
end of the transaction.
- r1(O5); w2(O5);
r2(O3); w1(O4);
isl1(C); isl1(B2);
sl1(O5); r1(O5);
ixl2(C); ixl2(B2);
xl2(O5); w2(O5);
isl2(C); isl2(B2);
sl2(O3); r2(O3);
ul2(O5); ul2(O3);
ixl1(C); ixl1(B2);
xl1(O4); w1(O4);
ul1(O5); ul1(O4);
where the scheduler would deny the xl2(O5);
request because
transaction #1 hold an S lock on element O5
at the time of the request.
- r1(O1); r1(O3);
r2(O1); w2(O4);
w2(O5);
isl1(C); isl1(B1);
sl1(O1); r1(O1);
isl1(B2);
sl1(O3); r1(O3);
ul1(O1); ul1(O3);
isl2(C); isl2(B1);
sl2(O1); r2(O1);
ixl2(C); ixl2(B2);
xl2(O4); w2(O4);
xl2(O5); w2(O5);
ul2(O4); ul2(O5);
- (10 points) Exercise 20.3.2 (only part c) on p. 1070 from [DBCB]:
Consider the following relations:
Computer(number, proc, speed, memory, hd)
Monitors(number, screen, maxResX, maxResY)
Understand that the attribute number of both relations
to refer to the number of a complete system, some of whose
attributes are found in one source and some in the other. Suppose
that the adornments describing access to the Computers
relation are buuuu, ubbff, and uuubb, while
the adornments for Monitors are bfff and ubbb.
For the following query, give 2-3 feasible plans (exclude
any plans that are obviously more expensive than other
plans on your list):
Find all systems with a G4 processor running at 750 megahertz with 256
megabytes of memory, a 40 gigabyte disk, and a 17-inch monitor.
- Query the Computers
source using the ubbff adornment
for machines with a G4, 750 MHz speed processor
with 256Mb of memory and a 40Gb
hard disk, then use
the retrieved system numbers to query Monitors (using the
bfff adornment) for each number,
to see if the system has a 17" screen size.
- Query the Computers source using the uuubb adornment for
machines with with 256Mb of memory and a 40Gb hard disk, then use the retrieved
system numbers to query Computers (using the buuuu adornment)
for machines (with those numbers) with a G4, 750 MHz speed processor. Finally,
use the retrieved system numbers to query Monitors (using the
bfff adornment) for each number, to see if the system has a 17" screen
size.
- (15 points) Describe a data management or
search problem which involves integrating information
from multiple web sources. For example, planning a vacation may require culling
information from online flight and hotel reservation systems, car rental
websites, and tourist guides. Filling your taxes online is another popular
example, especially around this time of year. Here we would require documents
detailing our income and benefits from employer HR websites, end of year
statements from our banks and investment companies, and online filing forms from
federal and state websites. Indicate which web sources you will use and how
they satisfy your information needs. Next discuss which approach to
information integration (federation, mediation, or warehousing) you would take
and why. Be specific. Include advantages and disadvantages, and tradeoffs.
Describe how you will extract information from the sources to answer queries.
For example, will you build wrappers? If so, how sensitive will they be to the
underlying sources, how will they be implemented,
and how will they adapt? How will
your wrappers or interfaces expose the underlying data and capabilities
of the sources to other components. For instance,
will they use relational schemas, flat files,
or information hierarchies? Are any indices necessary?
Finally, give 3 specific queries
that your solution approach would accommodate and for one of them provide a
feasible query plan.
(one possible answer, courtesy Matt Mize)
Consider the
data management problem of comparing courses
offered by distance-learning programs at various universities. I want to allow
students, faculty, and staff at various institutions to be able to compare
courses being offered, how often they are offered, and their
price. The following are some of the sites that I would include:
I would take a data warehousing approach where all of the data is updated on a
nightly basis. I did not opt for a federated database as I would probably not
have access to the database backends. Moreover, there is no guarantee
that the sites are even driven from a database backend. I chose not to use a
mediator as the data will be updated infrequently, e.g., typically one major
update for each semester or term, and, thus,
the overhead of a mediator would not
be worth the benefit. I would update the warehouse nightly because the courses
offered are not going to be modified frequently enough to warrant in situ
updates, nor would I have a way of being notified when a site was updated.
Furthermore, incremental updates would be too complex to track when parsing a
webpage.
To extract the data, I would develop a web crawler and use
XML files, specifying what data to pull from
what site and which HTML tags to seek to identify
particular data, to configure it. For example, on the Miami site, the
following XML stanza
<abstract>
<indicator type"text" count="1" value="Syllabus:"/>
<value type="after" count="1" tag="p"/>
</abstract>
would specify that
the web crawler should seek course syllabi from
Miami of Ohio in the first paragraph tag after the first occurrence of
the string
`Syllabus:' on the page. On the Yale CME site, an abstract could
be found with the following specification:
<abstract>
<indicator type="value" value="title"/>
<value type="link" count="1">
<indicator type="tag">
<tag>td</tag>
<attribute name="valign" value="TOP"/>
<attribute name="bgcolor" value="#eeeeee"/>
</indicator>
</value>
</abstract>
This specification indicates that the abstract can be found from the course
page by following the first link after the title. It also specifies that the
crawler can find the abstract within the <td> tag with
attributes that match valign="top" and bgcolor="#eeeeee"
exactly on the liked page.
An XML specification is useful because I would only need one web crawler to
traverse all of the pages and, also, if the page layout changes, no code would
need to be modified; only the XML configuration file would need to be updated.
In addition, a series of tags could be used to indicate how to transform
received data to be consistent with what is stored in the warehouse. While the
XML configuration files would allow a high level of flexibility, they would
also add complexity, especially in developing a schema for them which captures
all possible data representations. Once I have scraped the necessary data from
the sites, I would represent it in relations in the warehouse to allow it to be
queried.
We could run the following three queries:
- Retrieve the syllabi for all courses offered in the field of Religion.
- Which school offers the least expensive course in the field of Medicine?
- Which school offers the most online courses?
To answer query one, I would:
- Cull the websites at night using the web-crawler
and XML configuration files and store the syllabi in the warehouse.
Also, the web crawler would store the field to which the courses relate.
- Display all of the courses and syllabi being offered in the field of Religion at any school represented in the warehouse.
- (15 points) Read one of the articles in the section on data warehouses in
the December 2001 issue of IEEE
Computer (if you are seeking access from a computer outside of the UD
network, you made need to may need to authenticate yourself here)
and write a 2 page (11pt Times
font, single-spaced, 1" margins) analysis of it. Your
analysis should contain about a page of text summarizing the article. Your
summary should answer the following questions: What problem does the paper
address? What solution approach was taken and why? What is the most central
contribution to the database community made in the paper? In addition, your
analysis should address the advantages and disadvantages of the solution, how it
scales, and how it aligns with existing approaches. Since these articles are
nearly 5 years old now, briefly discuss what new developments in the field of
database systems have augmented/superseded this work. Lastly, identify and
explain what new things you learned, beyond what we have covered class, by
reading this article.
(one summary and critique, courtesy Matt Mize)
I read the article titled `Multidimensional Database Technology,' by T. B.
Pedersen and C. S. Jensen. The article provided a brief introduction to the
uses of multidimensional database technology, how multidimensional database
technology works, and where it needs to go in the near future. The authors'
approach was simply to discuss, at a medium-high level, multidimensional
technology. The most central contribution to the database community provided
by the article was to give people who might be
unfamiliar with multidimensional
database technology a first look at it.
First, the authors discuss the uses and intended uses of the multidimensional
data model. They specify that it should be used when the objective is to
analyze data as opposed to running online transactions. Data should be treated
as multidimensional when it can be divided into facts and
measures. Facts represent the data, and measures represent how we might
want to access that data. For example, a company might track their sales,
where
a purchase is a fact, and what was bought, where it was bought, and when it was
bought represent measures.
The authors then went on to explain that spreadsheets and relations are
inadequate techniques for storing this type of data. Spreadsheets are
inadequate because they tie the way in which
data is stored too closely to the
way data is presented. When the number of measures increase, spreadsheets no
longer work. Relations fail because, although there is greater flexibility in
data storage, it is difficult or impossible to formulate many of the queries
that the end-user might consider most interesting, e.g., querying for the
top-10 selling products.
Once they present the reasons to use multidimensional databases, the authors
discuss the structure of the data itself beginning with the data cube. The
data cube is first not necessarily a cube. It can have more than three
dimensions. Although when it begins to have more than 10 to 15 dimensions
performance problems can arise. However, inside each cube are cells that
correspond to an intersection of the different measures, also known as
dimensions. At the intersection of the dimensions the fact that was
mentioned earlier is stored. Using the example above, it becomes very easy to
find out how many of a certain item was bought at a specific place at a
specific time. The item type, where it was bought, and time it was bought all
become dimensions of the cube. We can simply look at the intersection of
those three dimensions to the desired information.
After describing the layout of the cube, the authors expound on dimensions.
Here, the most interesting observation is that in multidimensional databases,
some redundancy is tolerable. This is primarily because the
data in the cubes is derived from another source and that source typically
eliminates redundancy on its own.
After discussing dimensions, the authors move on to facts. They point out that
in most multidimensional data models, facts are implicitly defined by the
combination of the dimension values. The facts can be customized to certain
levels of granularity. Typically, the smaller the granularity, the more
detailed the report. However, performance can degrade with
a large number of small facts.
The authors discusses measures next. A measure has two components: a
fact and a formula representing how that fact can be aggregated to provide
higher-level data. The authors then discuss the three types of measures:
additive, semiadditive, and nonadditive. An additive measure can be
taken along any of the cube's dimensions. A semiadditive measure can
only be taken along certain dimensions. A nonadditive measure cannot be
taken along any of the cube's dimensions.
Next, the authors discuss types of operations that can be performed on a
multidimensional database. These operations include slice-and-dice
(reduces the cube's dimensions), drill-down and roll-up
(aggregates the dimensions), drill-across (combines several cubes
that share a dimension), ranking (returns the top n or bottom
n cells), and rotating (groups the data by a different dimension).
The authors cover the two primary implementations of multidimensional
databases: multidimensional online analytical processing (MOLAP) and relational
online analytical processing (ROLAP). MOLAP uses specialized multidimensional
structures while ROLAP uses traditional relational database technology.
Next, the authors discuss how the time needed to answer queries can be reduced
by storing aggregates in the multidimensional database and then constructing the
answers to other queries using these pre-built aggregates (this is reminiscent
of dynamic programming). They also discuss the difficulties that may arise as
the data stored in the cubes becomes more complex. Finally, the authors
conclude with a summary of the future directions of multidimensional technology.
Since this article provides an overview of a technology rather than a solution
to a specific problem, it is difficult to analyze the advantages and
disadvantages of the approach. The article provided an
effective overview of how multidimensional databases work. I think that
authors could have included more discussion on the uses of multidimensional
databases beyond OLAP.
Again, since this article was more of an overview,
it is difficult to address work
that has superseded it. The authors did not discuss formal
data cubes. They mentioned the importance of pre-computing aggregates to
enable faster real-time query processing, but did not refer to these
pre-aggregates and associated facts as a formal data cube.
One concept that the authors included that we did not cover in class is that of
measure. While I knew that we could aggregate the data in a cube along
a dimension to provide a new view, I had never heard the term
measure.
Overall, I found the article to be satisfactory. It provided a nice overview
of multidimensional database
technology and the motivation for it. I do think, however, that had I
not had some background from class, I would have had a difficult time
understanding everything, especially since the examples were scarce and
understanding this type of database structure without any real-world examples
is challenging. However, given what I had learned in class, this article
provided an interesting second perspective through which I can begin to develop
a more comprehensive understand of how and why data cubes work.
- (10 points) Required only for CPS 562 students
We studied the usefulness of an SIX mode for multiple-granularity locking.
Explain why an exclusive and intension-shared (XIS) mode is of no use.
(courtesy Tom Epps)
An XIS lock is of no use because the X
lock dominates the IS (as well as any other) lock.
However, the S lock does not dominate the IX
and the IX lock does not dominate the S,
rendering the existence of the SIX lock useful. If a transaction
has an X lock on an object, the transaction has exclusive
access to that object (which includes all of its children).
Thus, providing support for that transaction to declare
its intention to read at a finer level of granularity
would be pointless.
|