CPS 432/562 Homework #5 Solution Sketches


  1. (5+5=10 points) Exercise 18.6.1 (only parts b and c) on p. 963 from [DBCB]: Consider, for variety, an object-oriented database. The objects of class C are stored on two blocks, B1 and B2. Block B1 contains objects O1 and O2, while block B2 contains objects O3, O4, and O5. Class extents, blocks, and objects form a hierarchy of lockable database elements. Give the sequence of lock requests and the response of a warning-protocol-based scheduler to the following sequences of requests. You may assume all requests occur just before they are needed, and all unlocks occur at the end of the transaction.

    1. r1(O5); w2(O5); r2(O3); w1(O4);

      • isl1(C); isl1(B2); sl1(O5); r1(O5); ixl2(C); ixl2(B2); xl2(O5); w2(O5); isl2(C); isl2(B2); sl2(O3); r2(O3); ul2(O5); ul2(O3); ixl1(C); ixl1(B2); xl1(O4); w1(O4); ul1(O5); ul1(O4); where the scheduler would deny the xl2(O5); request because transaction #1 hold an S lock on element O5 at the time of the request.
    2. r1(O1); r1(O3); r2(O1); w2(O4); w2(O5);

      • isl1(C); isl1(B1); sl1(O1); r1(O1); isl1(B2); sl1(O3); r1(O3); ul1(O1); ul1(O3); isl2(C); isl2(B1); sl2(O1); r2(O1); ixl2(C); ixl2(B2); xl2(O4); w2(O4); xl2(O5); w2(O5); ul2(O4); ul2(O5);

  2. (10 points) Exercise 20.3.2 (only part c) on p. 1070 from [DBCB]: Consider the following relations:
    Computer(number, proc, speed, memory, hd)
    Monitors(number, screen, maxResX, maxResY)
    
    Understand that the attribute number of both relations to refer to the number of a complete system, some of whose attributes are found in one source and some in the other. Suppose that the adornments describing access to the Computers relation are buuuu, ubbff, and uuubb, while the adornments for Monitors are bfff and ubbb. For the following query, give 2-3 feasible plans (exclude any plans that are obviously more expensive than other plans on your list):

      Find all systems with a G4 processor running at 750 megahertz with 256 megabytes of memory, a 40 gigabyte disk, and a 17-inch monitor.
      • Query the Computers source using the ubbff adornment for machines with a G4, 750 MHz speed processor with 256Mb of memory and a 40Gb hard disk, then use the retrieved system numbers to query Monitors (using the bfff adornment) for each number, to see if the system has a 17" screen size.
      • Query the Computers source using the uuubb adornment for machines with with 256Mb of memory and a 40Gb hard disk, then use the retrieved system numbers to query Computers (using the buuuu adornment) for machines (with those numbers) with a G4, 750 MHz speed processor. Finally, use the retrieved system numbers to query Monitors (using the bfff adornment) for each number, to see if the system has a 17" screen size.

  3. (15 points) Describe a data management or search problem which involves integrating information from multiple web sources. For example, planning a vacation may require culling information from online flight and hotel reservation systems, car rental websites, and tourist guides. Filling your taxes online is another popular example, especially around this time of year. Here we would require documents detailing our income and benefits from employer HR websites, end of year statements from our banks and investment companies, and online filing forms from federal and state websites. Indicate which web sources you will use and how they satisfy your information needs. Next discuss which approach to information integration (federation, mediation, or warehousing) you would take and why. Be specific. Include advantages and disadvantages, and tradeoffs. Describe how you will extract information from the sources to answer queries. For example, will you build wrappers? If so, how sensitive will they be to the underlying sources, how will they be implemented, and how will they adapt? How will your wrappers or interfaces expose the underlying data and capabilities of the sources to other components. For instance, will they use relational schemas, flat files, or information hierarchies? Are any indices necessary? Finally, give 3 specific queries that your solution approach would accommodate and for one of them provide a feasible query plan.

      (one possible answer, courtesy Matt Mize)
      Consider the data management problem of comparing courses offered by distance-learning programs at various universities. I want to allow students, faculty, and staff at various institutions to be able to compare courses being offered, how often they are offered, and their price. The following are some of the sites that I would include:

      I would take a data warehousing approach where all of the data is updated on a nightly basis. I did not opt for a federated database as I would probably not have access to the database backends. Moreover, there is no guarantee that the sites are even driven from a database backend. I chose not to use a mediator as the data will be updated infrequently, e.g., typically one major update for each semester or term, and, thus, the overhead of a mediator would not be worth the benefit. I would update the warehouse nightly because the courses offered are not going to be modified frequently enough to warrant in situ updates, nor would I have a way of being notified when a site was updated. Furthermore, incremental updates would be too complex to track when parsing a webpage.

      To extract the data, I would develop a web crawler and use XML files, specifying what data to pull from what site and which HTML tags to seek to identify particular data, to configure it. For example, on the Miami site, the following XML stanza

      <abstract>
         <indicator type"text" count="1" value="Syllabus:"/>
         <value type="after" count="1" tag="p"/>
      </abstract>
      

      would specify that the web crawler should seek course syllabi from Miami of Ohio in the first paragraph tag after the first occurrence of the string `Syllabus:' on the page. On the Yale CME site, an abstract could be found with the following specification:
      <abstract>
         <indicator type="value" value="title"/>
         <value type="link" count="1">
            <indicator type="tag">
               <tag>td</tag>
               <attribute name="valign" value="TOP"/>
               <attribute name="bgcolor" value="#eeeeee"/>
            </indicator>
         </value>
      </abstract>
      

      This specification indicates that the abstract can be found from the course page by following the first link after the title. It also specifies that the crawler can find the abstract within the <td> tag with attributes that match valign="top" and bgcolor="#eeeeee" exactly on the liked page.

      An XML specification is useful because I would only need one web crawler to traverse all of the pages and, also, if the page layout changes, no code would need to be modified; only the XML configuration file would need to be updated. In addition, a series of tags could be used to indicate how to transform received data to be consistent with what is stored in the warehouse. While the XML configuration files would allow a high level of flexibility, they would also add complexity, especially in developing a schema for them which captures all possible data representations. Once I have scraped the necessary data from the sites, I would represent it in relations in the warehouse to allow it to be queried.

      We could run the following three queries:

      1. Retrieve the syllabi for all courses offered in the field of Religion.
      2. Which school offers the least expensive course in the field of Medicine?
      3. Which school offers the most online courses?

      To answer query one, I would:

      1. Cull the websites at night using the web-crawler and XML configuration files and store the syllabi in the warehouse. Also, the web crawler would store the field to which the courses relate.
      2. Display all of the courses and syllabi being offered in the field of Religion at any school represented in the warehouse.

  4. (15 points) Read one of the articles in the section on data warehouses in the December 2001 issue of IEEE Computer (if you are seeking access from a computer outside of the UD network, you made need to may need to authenticate yourself here) and write a 2 page (11pt Times font, single-spaced, 1" margins) analysis of it. Your analysis should contain about a page of text summarizing the article. Your summary should answer the following questions: What problem does the paper address? What solution approach was taken and why? What is the most central contribution to the database community made in the paper? In addition, your analysis should address the advantages and disadvantages of the solution, how it scales, and how it aligns with existing approaches. Since these articles are nearly 5 years old now, briefly discuss what new developments in the field of database systems have augmented/superseded this work. Lastly, identify and explain what new things you learned, beyond what we have covered class, by reading this article.

      (one summary and critique, courtesy Matt Mize)
      I read the article titled `Multidimensional Database Technology,' by T. B. Pedersen and C. S. Jensen. The article provided a brief introduction to the uses of multidimensional database technology, how multidimensional database technology works, and where it needs to go in the near future. The authors' approach was simply to discuss, at a medium-high level, multidimensional technology. The most central contribution to the database community provided by the article was to give people who might be unfamiliar with multidimensional database technology a first look at it.

      First, the authors discuss the uses and intended uses of the multidimensional data model. They specify that it should be used when the objective is to analyze data as opposed to running online transactions. Data should be treated as multidimensional when it can be divided into facts and measures. Facts represent the data, and measures represent how we might want to access that data. For example, a company might track their sales, where a purchase is a fact, and what was bought, where it was bought, and when it was bought represent measures.

      The authors then went on to explain that spreadsheets and relations are inadequate techniques for storing this type of data. Spreadsheets are inadequate because they tie the way in which data is stored too closely to the way data is presented. When the number of measures increase, spreadsheets no longer work. Relations fail because, although there is greater flexibility in data storage, it is difficult or impossible to formulate many of the queries that the end-user might consider most interesting, e.g., querying for the top-10 selling products.

      Once they present the reasons to use multidimensional databases, the authors discuss the structure of the data itself beginning with the data cube. The data cube is first not necessarily a cube. It can have more than three dimensions. Although when it begins to have more than 10 to 15 dimensions performance problems can arise. However, inside each cube are cells that correspond to an intersection of the different measures, also known as dimensions. At the intersection of the dimensions the fact that was mentioned earlier is stored. Using the example above, it becomes very easy to find out how many of a certain item was bought at a specific place at a specific time. The item type, where it was bought, and time it was bought all become dimensions of the cube. We can simply look at the intersection of those three dimensions to the desired information.

      After describing the layout of the cube, the authors expound on dimensions. Here, the most interesting observation is that in multidimensional databases, some redundancy is tolerable. This is primarily because the data in the cubes is derived from another source and that source typically eliminates redundancy on its own.

      After discussing dimensions, the authors move on to facts. They point out that in most multidimensional data models, facts are implicitly defined by the combination of the dimension values. The facts can be customized to certain levels of granularity. Typically, the smaller the granularity, the more detailed the report. However, performance can degrade with a large number of small facts.

      The authors discusses measures next. A measure has two components: a fact and a formula representing how that fact can be aggregated to provide higher-level data. The authors then discuss the three types of measures: additive, semiadditive, and nonadditive. An additive measure can be taken along any of the cube's dimensions. A semiadditive measure can only be taken along certain dimensions. A nonadditive measure cannot be taken along any of the cube's dimensions.

      Next, the authors discuss types of operations that can be performed on a multidimensional database. These operations include slice-and-dice (reduces the cube's dimensions), drill-down and roll-up (aggregates the dimensions), drill-across (combines several cubes that share a dimension), ranking (returns the top n or bottom n cells), and rotating (groups the data by a different dimension).

      The authors cover the two primary implementations of multidimensional databases: multidimensional online analytical processing (MOLAP) and relational online analytical processing (ROLAP). MOLAP uses specialized multidimensional structures while ROLAP uses traditional relational database technology.

      Next, the authors discuss how the time needed to answer queries can be reduced by storing aggregates in the multidimensional database and then constructing the answers to other queries using these pre-built aggregates (this is reminiscent of dynamic programming). They also discuss the difficulties that may arise as the data stored in the cubes becomes more complex. Finally, the authors conclude with a summary of the future directions of multidimensional technology.

      Since this article provides an overview of a technology rather than a solution to a specific problem, it is difficult to analyze the advantages and disadvantages of the approach. The article provided an effective overview of how multidimensional databases work. I think that authors could have included more discussion on the uses of multidimensional databases beyond OLAP.

      Again, since this article was more of an overview, it is difficult to address work that has superseded it. The authors did not discuss formal data cubes. They mentioned the importance of pre-computing aggregates to enable faster real-time query processing, but did not refer to these pre-aggregates and associated facts as a formal data cube.

      One concept that the authors included that we did not cover in class is that of measure. While I knew that we could aggregate the data in a cube along a dimension to provide a new view, I had never heard the term measure.

      Overall, I found the article to be satisfactory. It provided a nice overview of multidimensional database technology and the motivation for it. I do think, however, that had I not had some background from class, I would have had a difficult time understanding everything, especially since the examples were scarce and understanding this type of database structure without any real-world examples is challenging. However, given what I had learned in class, this article provided an interesting second perspective through which I can begin to develop a more comprehensive understand of how and why data cubes work.

  5. (10 points) Required only for CPS 562 students
    We studied the usefulness of an SIX mode for multiple-granularity locking. Explain why an exclusive and intension-shared (XIS) mode is of no use.

      (courtesy Tom Epps)
      An XIS lock is of no use because the X lock dominates the IS (as well as any other) lock. However, the S lock does not dominate the IX and the IX lock does not dominate the S, rendering the existence of the SIX lock useful. If a transaction has an X lock on an object, the transaction has exclusive access to that object (which includes all of its children). Thus, providing support for that transaction to declare its intention to read at a finer level of granularity would be pointless.


Return Home