April 19, 2006
Coverage:
[DBCB] Chapter 4, pp. 173-185 and Chapter 20, pp. 1099-1096
Review of Mining Association Rules
- 4 transactions:
| Tx# | Items |
| 1 | a, c, d |
| 2 | b, c, e |
| 3 | a, b, c, e |
| 4 | b, e |
- AIS algorithm (1993) adds a `d'
- these are the initials of the authors
- first is Agrawal
- a-priori algorithm does not need a `d'
- query to get F1
| Item set | Frequency |
| a | 2 |
| b | 3 |
| c | 3 |
| e | 3 |
- query F1, F1
to get C2
| Candidate Item set |
| a, b |
| a, c |
| a, e |
| b, c |
| b, e |
| c, e |
- query C2 to get F2
| Item set | Frequency |
| a, c | 2 |
| b, c | 2 |
| b, e | 3 |
| c, e | 2 |
- query F2, F2
to get C3
| Candidate Item set |
| a, b, c |
| a, b, e |
| a, c, e |
| b, c, e |
- query C3 to get F3
| Item set | Frequency |
| b, c, e | 3 |
Pruning Criterion for Confidence?
- can we develop a pruning criterion for
confidence like we did for support?
- no single search space for entire set of association rules
as there for frequent item sets
- one space per frequent item set
- if confidence(AB -> CD) = 60%,
support(ABCD)
support(AB)
- then confidence(ABC -> D) >= 60%
support(ABCD)
support(ABC)
- take any confident rule, move items to the left, and you
get another confident rule
Semistructured Data
- semistructured data has a loose, self-describing structure
and, in this sense, is schemaless
- consider how individuals present their contact information on their
webpage
- some will give cell number while others will not
- some will give home number while others will not
- some will give both cell and home numbers
- some will present the number as a string, e.g., `937-229-4079',
and others as an integer, e.g., 9372294079
- some will list address as (street, city, state, zip) while
others might use (city, state, zip), and still others might use
(street, building, city, state, zip)
- in short, data will vary in structure, size, and type
- schema is thus implicit in semistructured data, i.e., the
schema and the data are woven into each other
- the semistructured data model tries to capture these
impurities
- and thus, its hallmark is flexibility
- contrast with other data models, such as the relational, E/R, and
object-relational models
- other models trade flexibility for efficiency
- can create indices
- speed with which queries can be answered
- semistructured data approaches are popular in small-scale
systems where it is practical to trade efficiency for flexibility
- we don't have to look too far for an information system which
uses semistructured data models
- our good ole Lotus Notes system here at UD does!
- semistructured data can be used to integrate information
from heterogeneous sources
Semistructured Data Model
- semistructured data is typically
modeled by an edge-labeled, directed graph
- atomic data is represented in the leaves
- labeled edges describe
- attributes names, or
- relationships, or,
in other words, schema
- one of the first semistructured data models was called
the Object Exchange Model (OEM) and was developed
as part of the Tsimmis
(The Stanford-IBM Manager of Multiple Information Sources)
system at Stanford
Example of Semistructured Data
Introduction to XML
XML (eXtensible Markup Language) is a W3C
(World Wide Web Consortium) initiative which complements HTML
to standardize data exchange on the web. XML is simply a data exchange format.
Like HTML, it is a text format derived from SGML (Standard Generalized
Markup Language).
XML was designed with the following simple objectives:
- to provide a standard for data representation on the web
- to facilitate the publication of electronic data with structure
- to provide a simple syntax which is both human and machine readable
While HTML was designed to describe the presentation of a document,
XML is designed to describe the content of a document.
Therefore, as opposed to HTML, which is a predefined markup language with
a fixed set of tags, XML is a language for designing markup languages or
in order words, a metalanguage (hence extensible).
Below you will notice that there is a close
relationship between XML and semistructured data;
XML is ideally suited to represent semistructured data.
For more information, see:
HTML versus XML
The most salient difference between HTML and XML is that HTML describes
presentation and XML describes content. An HTML document
rendered in a web browser is human readable. XML aims to be
both human and machine readable.
Consider the following HTML.
<html>
<head><title>Books</title><head>
<body>
<h2>Books</h2>
<hr>
<em>Sense and Sensibility</em>, <b>Jane Austen</b>, 1811<br>
<em>Pride and Prejudice</em>, <b>Jane Austen</b>, 1813<br>
<em>Alice in Wonderland</em>, <b>Lewis Carroll</b>, 1866<br>
<em>Through the Looking Glass</em>, <b>Lewis Carroll</b>, 1872<br>
</body>
</html>
|
It is rendered in a browser as follows.
The HTML above describes how bibliography information is to be
presented and formatted for a human to view in a web browser. Knowing that
Sense and Sensibility is enclosed in italic tags does not however help
a program determine that it is the title of a book. XML attempts to
describe web data to address this void.
The following XML describes the contents of the books HTML page above.
<books>
<book>
<title>Sense and Sensibility</title>
<author>Jane Austen</author>
<year>1811</year>
</book>
<book>
<title>Pride and Prejudice</title>
<author>Jane Austen</author>
<year>1813</year>
</book>
<book>
<title>Alice in Wonderland</title>
<author>Lewis Carroll</author>
<year>1866</year>
</book>
<book>
<title>Through the Looking Glass</title>
<author>Lewis Carroll</author>
<year>1872</year>
</book>
</books>
|
A program parsing this data can take advantage of the fact that all book
titles are enclosed in <title>
tags. Where would a program
find such information? An XML document may contain an optional
description of its grammar that describes which tags are
used in the XML document and how such tags can be nested. A grammar
is a schema or road map for the XML document and necessary for a program
to process the document automatically.
Originally an XML grammar
was specified in a DTD
(Document Type Definition). However, a new
standard,
XSchema
(XML Schema), has been adopted
to address some of the limitations of DTDs.
As can be seen above, XML does not contain any information indicating
how the document should be rendered in a browser. Thus, XML
factors data from presentation. The beauty of this feature is that
the same data can be presented in a variety of ways without having to
replicate any data, e.g., consider tailoring presentation
to multiple devices with different
output capabilities.
XML Syntax
All XML documents
must contain the following preamble.
The document's body follows.
The building blocks of XML documents are elements and attributes.
Each document has an all encompassing or top-level element. This
element
corresponds to the root of the semistructured data graph being modeled.
XML Elements
An element, the most basic XML component, is text surrounded by matching
tags, including the tags. An element contains content.
Content can be
- raw text, called PCDATA (parsed character data) in XML lingo, e.g.,
<title>Sense and Sensibility</title> |
- other elements (sub-elements), e.g.,
<book>
<title>Sense and Sensibility</title>
<author>Jane Austen</author>
<year>1811</year>
</book>
|
- a mixture of the two, e.g.,
<book>
The book <title>Sense and Sensibility</title> is interesting.
</book>
|
XML also has an empty element.
When would one ever want to use an empty element?
XML Attributes
Attributes are associated with elements and add more semantic information to
the element. They are defined like HTML attributes as (name, value) pairs.
Consider the following XML containing attributes.
<book>
<title language="English">Sense and Sensibility</title>
<author format="first last">Jane Austen</author>
<year style="yyyy">1811</year>
</book>
|
XML attributes reveal XML's origin as a document markup language in that
they introduce ambiguity in data exchange. One must decide whether to
represent information as elements or attributes. For example,
consider the following two alternatives.
<book>
<title>Sense and Sensibility</title>
<author>Jane Austen</author>
<year>1811</year>
</book>
|
<book title="Sense and Sensibility" author="Jane Austen" year="1811"/>
|
Tags versus Attributes
| Tags | Attributes |
| An instance can be repeated within a tag |
An instance cannot be repeated within an attribute |
| Contents can be one of three types (raw, subelements, or mixture) |
Associated value must be a string |
Well-formed vs. Valid XML
We have presented few constraints on XML thus far. To be well-formed
XML,
the XML must be properly nested and contain unique attributes.
Well-formedness only ensures that the document will parse into a labeled
tree. It does not imply that the document conforms to its
grammar; such a document is called valid XML.
Other XML Constructs
How XML syntax differs from HTML
- new tags may be defined at will
- tags may be nested to arbitrary depth
- may contain an optional description of its grammar
Representing the Instructor-Course Database in XML
<?xml version="1.0" standalone="yes"?>
<instructor-course-data>
<instructor><name>Buckley</name><bldg>AN</bldg><no>135</no></instructor>
<instructor>
<name>Perugini</name>
<office><bldg>AN</bldg><no>145</no></office>
<address><street>Patterson</street><city>Dayton</city></address>
</instructor>
<course>
<dept>CPS</dept><no>430</no><credits>3</credits>
</course>
</instructor-course-data>
|
- how can we represent who teaches what?
<instructor>
<name>Perugini</name>
<office><bldg>AN</bldg><no>145</no></office>
<address><street>Patterson</street><city>Dayton</city></address>
<teaches>
<course><dept>CPS</dept><no>430</no><credits>3</credits><course>
<course><dept>CPS</dept><no>444</no><credits>3</credits><course>
</teaches>
</instructor>
|
- but this will lead to redundancy, right?
- mismatch: models for semistructured data are
graph-based while an XML document has a tree structure
Using ID's and IDREF's
Use attributes to help represent semistructured data that
does not have tree form.
<instructor-course-data>
<instructor iid="jb" teaches="430, 432">
<name>Buckley</name><bldg>AN</bldg><no>135</no>
</instructor>
<instructor iid="sp" teaches="430, 432, 444, 445">
<name>Perugini</name>
<office><bldg>AN</bldg><no>145</no></office>
<address><street>Patterson</street><city>Dayton</city></address>
</instructor>
<course cid="430" taught_by="jb, sp">
<dept>CPS</dept><no>430</no><credits>3</credits>
</course>
</instructor-course-data>
|
Files Used In Class
- books.xml (books data)
- icd.xml (instructor and course data)
- icd2.xml (captures who teaches what using
replication)
- icd3.xml (captures who teaches what
using ID's and IDREF's)
References
| [AIS] |
R. Agrawal, T. Imielinski, and A. Swami. Mining Association
Rules Between Sets of Items in Large Databases. In Proceedings
of the ACM International Conference on Management of Data (SIGMOD),
pp. 207-216, 1993. |
| [DBCB] |
H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems:
The Complete Book. Prentice Hall, 2002. |
| [DMXD] |
S. S. Chawathe. Describing and Manipulating XML Data.
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,
Vol. 22, No. 3, pp. 3-9, September 1999.
|