April 24, 2006
Coverage:
[DBCB] Chapter 4, pp. 180-185 and other references listed below
Demo of Scheme Implementation of the A-Priori Algorithm for
Mining Association Rules
- primarily involves two functions:
- getCk: joins Fk-1 with itself to
compute the candidate item sets of size k
- getFk: queries Ck to
compute the frequent items sets of size k
- Full source code is available here
DTDs
- a
DTD
(Document Type Definition) is a grammar for
an XML document
- * (0 or more), + (one or more),
? (optional), and | (or) are DTD metacharacters
- DTD for instructor-course data
<!ELEMENT instructor-course-data (instructor | course)*>
<!ELEMENT instructor (name, ((office, address+) | (bldg, no)))>
<!ATTLIST instructor
iid ID #IMPLIED
teaches IDREFS #IMPLIED>
<!ELEMENT name (#PCDATA)>
<!ELEMENT office (bldg, no)>
<!ELEMENT address (street?, city, state, zip?)>
<!ELEMENT street (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT state (#PCDATA)>
<!ELEMENT zip (#PCDATA)>
<!ELEMENT course (dept, no)>
<!ATTLIST course
cid ID #IMPLIED
taught-by IDREFS #IMPLIED>
<!ELEMENT dept (#PCDATA)>
<!ELEMENT no (#PCDATA)>
|
-
use comma to specify a sequence
e.g., instructor
and course elements may appear in any order in
instructor-course-data while no must follow
bldg in office
a DTD can either be inlined into an XML document or factored into
a separate file
notice that a DTD is more expressive, and, thus, flexible,
than a relational database schema
to keep the DTD in a separate file, use the
following preamble to your XML document:
<?xml version="1.0"?>
<!DOCTYPE instructor-course-data SYSTEM "icd.dtd">
|
XSchema
-
XSchema
(XML Schema) is a new standard for
specifying XML document grammars
- it addresses some of the limitations of
DTDs
- in particular, XSchema
- has support for data types, and
- uses XML syntax (why would this be helpful?)
- XSchema for instructor-course data
<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="http://www.w3schools.com"
xmlns="http://www.w3schools.com"
elementFormDefault="qualified">
<xs:element name="instructor-course-data">
<xs:complexType>
<xs:sequence>
<xs:element name="instructor">
<xs:attribute name="iid" type="xs:integer"/>
<xs:attribute name="teaches" type="xs:string"/>
<xs:complexType>
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="office">
<xs:complexType>
<xs:sequence>
<xs:element name="bldg" type="xs:string"/>
<xs:element name="no" type="xs:integer"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="address">
<xs:complexType>
<xs:sequence>
<xs:element name="street" type="xs:string"/>
<xs:element name="city" type="xs:string"/>
<xs:element name="state" type="xs:string"/>
<xs:element name="zip" type="xs:integer"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:element name="course">
<xs:complexType>
<xs:sequence>
<xs:element name="dept" type="xs:string"/>
<xs:element name="no" type="xs:integer"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
|
Yacc for XML?
- DTDs and XSchemas are grammars for XML documents
- we could develop a program generator which
takes a grammar as input and automatically produces a program to
process XML documents conforming to that grammar
- since an XSchema is an XML document, we can develop our
program generator as an XSLT stylesheet and use an XSLT
processor to facilitate the program generation
- XSLT to come ....
- program generation is a powerful idea (and abstraction)
which applies to many areas of computer science
- akin to YACC
Mining for Schema
- another application of data mining
- mining for schema in semistructured data [NAM]
- given a set of XML documents, mine for
a DTD or XSchema
- XTRACT is a system developed at Bell Laboratories
which mines a DTD from a set of XML documents [XTRACT]
- given n mined schemas, how can we evaluate which
is best?
- develop metrics like excess and deficit [NAM]
XML Suite of Technologies and Tools
XML has matured from simple text markup for data
interchange to a mature technology with a rich suite of associated tools.
While these tools are beyond the scope of this lecture, below we provide
references to a few and encourage you to explore them.
- SAX and DOM
- SAX (Simple API for XML) is a
syntax-driven standard for parsing XML data
- it provides a simple, easy to use API for manipulating XML
- DOM (Document
Object Model) is also an API for XML
- DOM is object-oriented and
reads the entire XML document and construct a parse tree from it
- Cascading Stylesheets (CSS)
- XSL
(eXtensible Stylesheet Language) is a language for expressing stylesheets
- it consists of three parts:
-
XSLT (XSL Transformations)
- XPath (the XML Path language)
- XSL FO (XSL
Formatting Objects)
- RDF (Resource Description Framework) is
a proposal for representing metadata in XML
- it has its own data model and syntax
- a way to make declarations about a resource and connect
those declarations to derive meaning
- has been a key ingredient in the W3C's vision of the
semantic web
- the structure and content of the
Open Directory Project, a popular
human-compiled directory of the web, are
described in RDF
- what is RDF?
- XHTML
(Extensible HyperText Markup Language)
- HTML is not XML
- while all tags in XML must be closed, some tags in HTML need
not be closed (e.g.,
<br>, <li>, or <p>)
Furthermore, HTML is not case-sensitive; XML is case-sensitive.
- a W3C initiative for HTML in XML
- a language which describes document presentation in XML
- XML Tools
- XML Editors
Cascading Stylesheets (CSS)
XML is not very useful in isolation. In order to view XML, it needs to
be converted into HTML. There are two methods for this conversion: CSS
(Cascading Style Sheets) and XSL (eXtensible Stylesheet Language).
CSS does not actually
perform an explicit conversion from XML to HTML.
It is rather a lightweight method which associates each element with a
style of presentation. Click here
for a nice page on the differences between CSS and XSLT.
The following CSS code specifies how a browser
should render the books.xml
(courtesy O'Reilly Mozilla DevCenter).
books {
display:block;
height:200px;
width:280px;
border:1px solid #000;
overflow:auto;
background-color:#eee;
font: 12px verdana;
}
book {
display: block;
padding:10px;
margin-bottom:10px;
border-top:1px solid #ccc;
border-bottom:1px solid #ccc;
background-color:#fff;
}
|
CSS permits multiple presentations for the same document.
To associate an XML document with a CSS,
use the following element after the <?xml version="1.0"?>
in the XML document.
<?xml-stylesheet href="books.css" type="text/css"?>
|
To associate an HTML document with a CSS, use the following element
within the <head> element of the HTML document:
<link rel="stylesheet" href="homepage.css" type="text/css">
|
CSS can now be used to do some pretty slick rendering.
Illustration of HTML list where list items appear to be
images.
Simple Example of Adding a CSS to HTML
homepage.css
body {
background-color: E0F7F0;
}
h1 {font-weight: bold;
text-align: left;
color: black;
font-family: "Verdana", Arial, sans-serif}
/* unvisited links */
a:link {color: #CC9933;
size: 3;
font-family: Verdana, Arial, Helvetica, sans-serif}
|
index.html
<html>
<head>
<title>A simple webpage</title>
<link rel="stylesheet" href="homepage.css" type="text/css">
</head>
<h1>Header 1</h1>
<a href="http://espn.com">ESPN.com</a>
</html>
|
Files Used In Class
- icd.dtd (DTD for instructor-course data)
- icd4.xml (instructor-course data with
reference to its DTD)
- icd.xsd (XSchema for instructor-course
data)
- icd5.xml (instructor-course data with
reference to its XSchema)
- books.css (CSS for books XML data)
- books.xml (books data with reference
to a CSS)
- homepage.css (CSS for simple
webpage)
- index.html (simple webpage with
reference to a CSS)
References
| [CSS] |
S. Callihan. Cascading Style Sheets (CSS) by Example. Que, 2001.
|
| [DBCB] |
H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems:
The Complete Book. Prentice Hall, 2002. |
| [DMXD] |
S. S. Chawathe. Describing and Manipulating XML Data.
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering,
Vol. 22, No. 3, pp. 3-9, September 1999. |
| [DOM] |
J. Marini. The Document Object Model:
Processing Structured Documents . Osborne/McGraw-Hill, 2002.
|
| [LXML] |
E. T. Ray. Learning XML. O'Reilly, Second edition, 2003.
|
| [NAM] |
S. Nestorov, S. Abiteboul, and R. Motwani. Extracting
Schema from Semistructured Data.
In Proceedings
of the ACM International Conference on Management of Data (SIGMOD),
pp. 295-306, 1998. |
| [XTRACT] |
M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: A System for Extracting Document Type Descriptors from
XML Documents.
In Proceedings
of the ACM International Conference on Management of Data (SIGMOD),
pp. 165-176, 2000. |