April 19, 2006



Coverage: [DBCB] Chapter 4, pp. 173-185 and Chapter 20, pp. 1099-1096


Review of Mining Association Rules


Pruning Criterion for Confidence?


Semistructured Data


Semistructured Data Model


Example of Semistructured Data


Introduction to XML

XML (eXtensible Markup Language) is a W3C (World Wide Web Consortium) initiative which complements HTML to standardize data exchange on the web. XML is simply a data exchange format. Like HTML, it is a text format derived from SGML (Standard Generalized Markup Language).

XML was designed with the following simple objectives:

While HTML was designed to describe the presentation of a document, XML is designed to describe the content of a document. Therefore, as opposed to HTML, which is a predefined markup language with a fixed set of tags, XML is a language for designing markup languages or in order words, a metalanguage (hence extensible).

Below you will notice that there is a close relationship between XML and semistructured data; XML is ideally suited to represent semistructured data. For more information, see:


HTML versus XML

The most salient difference between HTML and XML is that HTML describes presentation and XML describes content. An HTML document rendered in a web browser is human readable. XML aims to be both human and machine readable.

Consider the following HTML.

<html>
<head><title>Books</title><head>

<body>

<h2>Books</h2>
<hr>

<em>Sense and Sensibility</em>, <b>Jane Austen</b>, 1811<br>
<em>Pride and Prejudice</em>, <b>Jane Austen</b>, 1813<br>
<em>Alice in Wonderland</em>, <b>Lewis Carroll</b>, 1866<br>
<em>Through the Looking Glass</em>, <b>Lewis Carroll</b>, 1872<br>

</body>
</html>

It is rendered in a browser as follows.

The HTML above describes how bibliography information is to be presented and formatted for a human to view in a web browser. Knowing that Sense and Sensibility is enclosed in italic tags does not however help a program determine that it is the title of a book. XML attempts to describe web data to address this void.

The following XML describes the contents of the books HTML page above.

<books>
   <book>
      <title>Sense and Sensibility</title>
      <author>Jane Austen</author>
      <year>1811</year>
   </book>

   <book>
      <title>Pride and Prejudice</title>
      <author>Jane Austen</author>
      <year>1813</year>
   </book>

   <book>
      <title>Alice in Wonderland</title>
      <author>Lewis Carroll</author>
      <year>1866</year>
   </book>

   <book>
      <title>Through the Looking Glass</title>
      <author>Lewis Carroll</author>
      <year>1872</year>
   </book>
</books>

A program parsing this data can take advantage of the fact that all book titles are enclosed in <title> tags. Where would a program find such information? An XML document may contain an optional description of its grammar that describes which tags are used in the XML document and how such tags can be nested. A grammar is a schema or road map for the XML document and necessary for a program to process the document automatically. Originally an XML grammar was specified in a DTD (Document Type Definition). However, a new standard, XSchema (XML Schema), has been adopted to address some of the limitations of DTDs.

As can be seen above, XML does not contain any information indicating how the document should be rendered in a browser. Thus, XML factors data from presentation. The beauty of this feature is that the same data can be presented in a variety of ways without having to replicate any data, e.g., consider tailoring presentation to multiple devices with different output capabilities.


XML Syntax

All XML documents must contain the following preamble.

<?xml version="1.0"?>

The document's body follows. The building blocks of XML documents are elements and attributes. Each document has an all encompassing or top-level element. This element corresponds to the root of the semistructured data graph being modeled.


XML Elements

An element, the most basic XML component, is text surrounded by matching tags, including the tags. An element contains content. Content can be

XML also has an empty element.

<paperback></paperback>
can be abbreviated to
<paperback/>

When would one ever want to use an empty element?


XML Attributes

Attributes are associated with elements and add more semantic information to the element. They are defined like HTML attributes as (name, value) pairs. Consider the following XML containing attributes.

<book>
   <title language="English">Sense and Sensibility</title>
   <author format="first last">Jane Austen</author>
   <year style="yyyy">1811</year>
</book>

XML attributes reveal XML's origin as a document markup language in that they introduce ambiguity in data exchange. One must decide whether to represent information as elements or attributes. For example, consider the following two alternatives.

<book>
   <title>Sense and Sensibility</title>
   <author>Jane Austen</author>
   <year>1811</year>
</book>

<book title="Sense and Sensibility" author="Jane Austen" year="1811"/>


Tags versus Attributes

TagsAttributes
An instance can be repeated within a tag An instance cannot be repeated within an attribute
Contents can be one of three types (raw, subelements, or mixture) Associated value must be a string


Well-formed vs. Valid XML

We have presented few constraints on XML thus far. To be well-formed XML, the XML must be properly nested and contain unique attributes. Well-formedness only ensures that the document will parse into a labeled tree. It does not imply that the document conforms to its grammar; such a document is called valid XML.


Other XML Constructs


How XML syntax differs from HTML


Representing the Instructor-Course Database in XML


Using ID's and IDREF's

Use attributes to help represent semistructured data that does not have tree form.

<instructor-course-data>
   <instructor iid="jb" teaches="430, 432">
      <name>Buckley</name><bldg>AN</bldg><no>135</no>
   </instructor>

   <instructor iid="sp" teaches="430, 432, 444, 445">
      <name>Perugini</name>
      <office><bldg>AN</bldg><no>145</no></office>
      <address><street>Patterson</street><city>Dayton</city></address>
   </instructor>

   <course cid="430" taught_by="jb, sp">
      <dept>CPS</dept><no>430</no><credits>3</credits>
   </course>
</instructor-course-data>


Files Used In Class


References



Return Home