CIS 97YT - Lecture Notes

Markup comes from the bad old days before word processors. If you needed a brochure, you'd type it on a typewriter, and then literally mark it up with a red pen to tell the typesetter what you wanted it to look like. The typesetter would follow your instructions and return a finished document to you:

How to Buy a Wrench

There are two kinds of wrenches: wrenches with fixed size, and adjustable wrenches.

In this instance, we're using markup not only to show how text should be presented (italic versus normal text), but also to tell how the document is structured: some of the words form a heading, the other words are just ordinary text.

The idea of using markup to impose structure on otherwise anonymous data is such a good one that people came up with a standardized way to create markups for general use. This method was called the Standard Generalized Markup Language, or SGML. SGML really isn't a language in and of itself, it is more of a “rulebook” that tells you how to develop these markup languages. Any markup that follows the SGML rulebook is called an application of SGML.

The most widely known application of SGML is a language used to mark up text for delivery and presentation on the World Wide Web. That language is HTML, the HyperText Markup Language. In HTML, we can mark up the example above to send to a web browser instead of a typesetter:

<h3>How to Buy a Wrench</h3>
<p>
There are two kinds of wrenches: wrenches with fixed size, and
<i>adjustable</i> wrenches.
</p>

There are many other applications of SGML, but they're mostly found in large corporations and government agencies. That's because the SGML rulebook is very complex, which makes it hard to learn. For example, SGML allows optional opening and closing tags. Quick: is </li> required or not? How about <body>? Additionally, it's difficult (and expensive!) to develop tools that can manage data that's marked up according to those rules.

HTML Doesn't Do It All

While HTML is a good thing, it doesn't solve all our problems. Consider the following two tables. While the data is structured into rows and cells, there's nothing to tell you (other than your intuition) that the first table gives maximum and minimum temperatures, while the second table gives current and maximum capacities for water reservoirs.

<table border="1">
<tr>
  <td>Chicago</td><td>13</td><td>6</td>
</tr>
<tr>
  <td>Dallas</td><td>60</td><td>20</td>
</tr>
</table>

<table border="1">
<tr>
  <td>Calero</td><td>5538</td><td>10050</td>
</tr>
<tr>
  <td>Uvas</td><td>6095</td><td>9935</td>
</tr>
</table>

XML Solves the Problems

To solve the complexity issue, XML was designed as a subset of SGML. It eliminates the features that make SGML difficult to learn and parse while retaining 90% of the power of SGML. Tools that analyze and display XML are easier to write, and are widespread and inexpensive. Since XML is a subset of SGML, it lets you devise any set of tags you wish, thus solving the problem of differentiating what would be otherwise be anonymous numbers:

<temperatures>
<city name="Chicago">
    <max>13</max><min>6</min>
</city>
<city name="Dallas">
    <max>60</max><min>20</min>
</city>
</temperatures>

<water-banks>
<reservoir name="Calero">
   <current>5538</current><capacity>10050</capacity>
</reservoir>
<reservoir name="Dallas">
   <current>6095</current><capacity>9935</capacity>
</reservoir>
</water-banks>

An XML Document and its Terminology

<p>Here is some <b>important</b> and
<i>useful</i> information.</p>

Each of these children is the sibling of the other children. Note that the <b> and <i> elements also have children.

Lecture 1 Notes

Markup

How to Buy a Wrench

HTML Doesn't Do It All

XML Solves the Problems

An XML Document and its Terminology

Rules of Well-Formedness

Additional reading