CIS 97YT - Validity and DTDs

CIS 97YT Index > Validity and DTDs

Validity and DTDs

Just because a document is well-formed doesn't mean it's a good document. For example, consider this HTML fragment:

<body>
    <p>Example of Strange HTML</p>
    <head>
        <title author="Joe Doakes">Example</title>
    </head>
    <zorko>How did this get here?</zorko>
</body>

This fragment is 100% well-formed, but it is (at least to us) clearly not correct. We know that the <head> can't be inside the <body>, there's no such thing as an author attribute in a <title> nor a zorko element.

We know this is invalid HTML because we have been taught how HTML elements can and cannot be put together. Browsers have this “knowledge” wired into them, and are taught to ignore unknown elements and attributes, and to do their best with badly placed elements.

In the case of the catalog in part B of assignment one, you saw elements like these:

<department name="Office Supplies">
<price amt="8.95"/>

<department code="CP24">
Computer Peripherals
<price units="USD">10.95</price>

So which set is correct? Is the department name an attribute or text that comes within the <department> element? Is <price> an empty element or not? Is the units attribute optional or not? Of course, you don't know the answers to these questions, because you have never seen this particular markup before.

More to the point, an XML processor can't tell you which of the catalog elements is correct unless we have some way of telling it which combinations of tags and attributes are valid. We obviously don't want to “wire in” the rules for the catalog markup language. Then the XML processor would be good only for this particular markup, just as browsers are good only for HTML. What we need is some general machine-readable notation that describes a markup language's “grammar” and a way to point the XML processor to the appropriate grammar for the document we're parsing.

Specifying a Grammar

There are several different notations for specifying the valid combinations of elements and attributes:

DTD: The Document Type Definition, a direct descendant of its SGML counterpart. Has its own special notation; it's the one we'll be covering in this session and the next.
XML Schema: From the World Wide Web Consortium, this is an XML-based markup language for describing grammar. It's very complicated.
RELAX: Written by Dr. Murata Makoto, RELAX is a simpler notation for describing the grammar of a markup language. It's also XML-based.
TREX: Yet another simple, concise notation, originated by James Clark, one of the big names in the SGML and XML world.
RELAX NG: RELAX Next Generation; Clark and Murata realized they were both headed in the same direction and combined their efforts. This is the other notation that we'll be learning.
Schematron: Written by Rick Jelliffe, Schematron works by finding patterns in the XML document rather than by specifying a grammar.

DTDs

We will start out with DTDs, simply because they are the most common form of modeling a document (specifying a grammar).

Connecting a DTD to a Document

In order to say that you want a document to be validated with a certain DTD, you name the DTD in a <!DOCTYPE> declaration, which has one of these forms:

<!DOCTYPE root-element SYSTEM uri-of-dtd>
<!DOCTYPE root-element PUBLIC public-identifier uri-of-dtd>

The first form is used for a “local” DTDs that are on a server or in a file on your machine. The second form is for markup languages that are widely known and advertised. The public-identifier can be used when the DTD is “wired in” to the XML processor; if it can't be found, then the URI is used. In both cases, the root-element is, of course, the root element of the document to be validated.

Here are examples of <!DOCTYPE> declarations. The first is a local one for the catalog example, the second one is the one that was used at the top of this document, which is written in XHTML.

<!DOCTYPE catalog SYSTEM "/usr/local/deanza/cis97yt/catalog.dtd">
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

DTD Basics

Read pages 143-157 of Learning XML, stopping at the section labelled NOTATION.

Case Study

Presume that you are in charge of a database of amateur wrestling clubs for the state of California. The state is divided into several associations, each covering a geographic area of the state; there's the Northern Area Wrestling Association, the Southern California Wrestling Association, etc. Each club is affiliated with one of these associations. For each club, we need to keep track of:

The club name
The club's code (assigned by the State office)
The year the club was chartered
Location
Person or persons to contact, each of whom has:
- Phone number(s) for home, work, fax, and cell
- email address(es)
Age group(s) that the club serves; can be any combination of Kids, Cadets, Juniors, and Open
Additional club info, such as costs, practice times, etc.

Note that this English-language description give us some great clues as to what the markup and its DTD should look like. It will also be a basis for documentation for people who will enter data in the master document of clubs.

Elements or Attributes

When we design our markup, we will have to decide which information should be specified as elements with content, and which should be specified as attributes. In general, as Learning XML says on page 172, an element holds content that is part of the document; an attribute modifies the behavior of an element. In the example at hand, everything is going to be an element except the association name, club code, type of phone number, and age groups served. Our refined description with element and attribute names is as follows:

The <club-database> element contains one or more <association> elements.
An <association> contains one or more <club> elements.
A <club> element has a unique id attribute.
Each <club> contains the following elements in order:
- A <charter> year
- The club <name>
- The club's <location>
- A <contact-list> element (see below). If there's only one contact person, then a single <contact> element.
- An empty <age-groups> element. It has a type attribute whose value consists of the letters K, C, J, and O; thus, a club for Cadets and Juniors would be specified as <age-groups type="CJ"/>.
- An optional <info> element that gives extra information about the club.

The <contact-list> element contains one or more <contact> elements, each of which contains:

A <person> element which contains a person's name.
One or more <phone> elements, each of which contains a phone number. This element has a type attribute which has one of the following values: "home", "work", "fax", or "cell". If not specified, the default is "home"
Zero or more <email> elements, each of which contains an email address.

Here is a sample file created according to this description:

<?xml version="1.0" encoding="UTF-8"?>
<club-database>
<association id="SCVWA">
<club id="H23">
    <charter>2000</charter>
    <name>California Gold</name>
    <location>San Jose</location>
    <contact-list>
        <contact>
            <person>Donald Morton</person>
            <phone>408-555-0102</phone>
        </contact>
    </contact-list>
    <age-groups type="KCJ"/>
    <info>
        Practice on Mondays and Wednesdays 6-7:30pm at Glen Park HS.
        Cost is $45.00, includes USA membership card and t-shirt.
    </info>
</club>
<club id="H26">
    <charter>2002</charter>
    <name>Campbell Bulldogs</name>
    <location>San Jose</location>
    <contact-list>
        <contact>
            <person>John Moreson</person>
            <phone type="home">408-555-1092</phone>
            <phone type="work">650-555-7442</phone>
            <email>j_moreson@anyco.org</email>
        </contact>
        <contact>
            <person>Roger McClarty</person>
            <phone type="work">408-555-0960 x3251</phone>
            <email>mcclarty_r@someschool.edu</email>
        </contact>
    </contact-list>
    <age-groups type="KCJO"/>
</club>
</association>
</club-database>

Building the DTD

We build the DTD from the top down; from outermost to innermost elements, just as the description was written. This top-down approach is usually the best, since that's probably how you designed your markup. It's also possible to work from top-down and fill in details from bottom-up, although it's not a method I'd recommend. Here are the high-level elements. Notice that the descriptions use lots of whitespace, and that the attributes are written immediately after the element.

<!ELEMENT   club-database   (association+) >
<!ELEMENT    association     (club+) >
<!ATTLIST    association
    id      ID          #REQUIRED
>

Next, the description for a club:

<!ELEMENT   club
    (charter, name, location, (contact | contact-list), age-groups, info?)
>
<!ATTLIST    club
    id      ID          #REQUIRED >

<!ELEMENT    charter     (#PCDATA) >
<!ELEMENT    name        (#PCDATA) >
<!ELEMENT    location    (#PCDATA) >
<!ELEMENT    contact-list (contact+) >
<!ELEMENT    age-groups  EMPTY >
<!ATTLIST    age-groups
    type    CDATA       #REQUIRED >
<!ELEMENT    info        (#PCDATA) >

And finally, an individual contact

<!ELEMENT   contact     (person, phone+, email*) >
<!ELEMENT    person      (#PCDATA) >
<!ELEMENT    phone       (#PCDATA) >
<!ATTLIST    phone
    type    ( home | work | fax | cell )    "home" >
<!ELEMENT    email       (#PCDATA) >