Just because a document is well-formed doesn't mean it's a good document. For example, consider this HTML fragment:
<body> <p>Example of Strange HTML</p> <head> <title author="Joe Doakes">Example</title> </head> <zorko>How did this get here?</zorko> </body>
This fragment is 100% well-formed, but it is (at least to us)
clearly not correct. We know that the <head>
can't be inside the
<body>
, there's no such thing as an
author
attribute in a <title>
nor a zorko
element.
We know this is invalid HTML because we have been taught how HTML elements can and cannot be put together. Browsers have this “knowledge” wired into them, and are taught to ignore unknown elements and attributes, and to do their best with badly placed elements.
In the case of the catalog in part B of assignment one, you saw elements like these:
<department name="Office Supplies"> <price amt="8.95"/> <department code="CP24"> Computer Peripherals <price units="USD">10.95</price>
So which set is correct? Is the department name an attribute or text that
comes within the <department>
element? Is
<price>
an empty element or not? Is the
units
attribute optional or not? Of course, you don't
know the answers to these questions,
because you have never seen this particular markup before.
More to the point, an XML processor can't tell you which of the catalog elements is correct unless we have some way of telling it which combinations of tags and attributes are valid. We obviously don't want to “wire in” the rules for the catalog markup language. Then the XML processor would be good only for this particular markup, just as browsers are good only for HTML. What we need is some general machine-readable notation that describes a markup language's “grammar” and a way to point the XML processor to the appropriate grammar for the document we're parsing.
There are several different notations for specifying the valid combinations of elements and attributes:
We will start out with DTDs, simply because they are the most common form of modeling a document (specifying a grammar).
In order to say that you want a document to be validated
with a certain DTD, you name the DTD in a
<!DOCTYPE>
declaration, which has one of these forms:
<!DOCTYPE root-element SYSTEM uri-of-dtd> <!DOCTYPE root-element PUBLIC public-identifier uri-of-dtd>
The first form is used for a “local” DTDs that are on a server or in a file on your machine. The second form is for markup languages that are widely known and advertised. The public-identifier can be used when the DTD is “wired in” to the XML processor; if it can't be found, then the URI is used. In both cases, the root-element is, of course, the root element of the document to be validated.
Here are examples of <!DOCTYPE>
declarations. The
first is a local one for the catalog example, the second one is the
one that was used at the top of this document, which is written in
XHTML.
<!DOCTYPE catalog SYSTEM "/usr/local/deanza/cis97yt/catalog.dtd"> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
Read pages 143-157 of Learning XML, stopping at the section
labelled NOTATION
.
Presume that you are in charge of a database of amateur wrestling clubs for the state of California. The state is divided into several associations, each covering a geographic area of the state; there's the Northern Area Wrestling Association, the Southern California Wrestling Association, etc. Each club is affiliated with one of these associations. For each club, we need to keep track of:
Note that this English-language description give us some great clues as to what the markup and its DTD should look like. It will also be a basis for documentation for people who will enter data in the master document of clubs.
When we design our markup, we will have to decide which information should be specified as elements with content, and which should be specified as attributes. In general, as Learning XML says on page 172, an element holds content that is part of the document; an attribute modifies the behavior of an element. In the example at hand, everything is going to be an element except the association name, club code, type of phone number, and age groups served. Our refined description with element and attribute names is as follows:
<club-database>
element contains one
or more <association>
elements.<association>
contains one or more
<club>
elements.<club>
element has a unique id
attribute.<club>
contains the following elements
in order:
<charter>
year<name>
<location>
<contact-list>
element (see
below).
If there's only one contact person, then a
single <contact>
element.<age-groups>
element. It has a
type
attribute whose value consists of the
letters K, C, J, and O; thus,
a club for Cadets and Juniors would be specified as
<age-groups type="CJ"/>
.<info>
element that
gives extra information about the club.
The <contact-list>
element contains one or
more <contact>
elements, each of which contains:
<person>
element which contains a person's name.<phone>
elements, each of which
contains a phone number. This element has a type
attribute which has one of the following values:
"home"
, "work"
, "fax"
,
or "cell"
. If not specified, the default is
"home"
<email>
elements, each of which
contains an email address.Here is a sample file created according to this description:
<?xml version="1.0" encoding="UTF-8"?> <club-database> <association id="SCVWA"> <club id="H23"> <charter>2000</charter> <name>California Gold</name> <location>San Jose</location> <contact-list> <contact> <person>Donald Morton</person> <phone>408-555-0102</phone> </contact> </contact-list> <age-groups type="KCJ"/> <info> Practice on Mondays and Wednesdays 6-7:30pm at Glen Park HS. Cost is $45.00, includes USA membership card and t-shirt. </info> </club> <club id="H26"> <charter>2002</charter> <name>Campbell Bulldogs</name> <location>San Jose</location> <contact-list> <contact> <person>John Moreson</person> <phone type="home">408-555-1092</phone> <phone type="work">650-555-7442</phone> <email>j_moreson@anyco.org</email> </contact> <contact> <person>Roger McClarty</person> <phone type="work">408-555-0960 x3251</phone> <email>mcclarty_r@someschool.edu</email> </contact> </contact-list> <age-groups type="KCJO"/> </club> </association> </club-database>
We build the DTD from the top down; from outermost to innermost elements, just as the description was written. This top-down approach is usually the best, since that's probably how you designed your markup. It's also possible to work from top-down and fill in details from bottom-up, although it's not a method I'd recommend. Here are the high-level elements. Notice that the descriptions use lots of whitespace, and that the attributes are written immediately after the element.
<!ELEMENT club-database (association+) > <!ELEMENT association (club+) > <!ATTLIST association id ID #REQUIRED >
Next, the description for a club:
<!ELEMENT club (charter, name, location, (contact | contact-list), age-groups, info?) > <!ATTLIST club id ID #REQUIRED > <!ELEMENT charter (#PCDATA) > <!ELEMENT name (#PCDATA) > <!ELEMENT location (#PCDATA) > <!ELEMENT contact-list (contact+) > <!ELEMENT age-groups EMPTY > <!ATTLIST age-groups type CDATA #REQUIRED > <!ELEMENT info (#PCDATA) >
And finally, an individual contact
<!ELEMENT contact (person, phone+, email*) > <!ELEMENT person (#PCDATA) > <!ELEMENT phone (#PCDATA) > <!ATTLIST phone type ( home | work | fax | cell ) "home" > <!ELEMENT email (#PCDATA) >