CIS 97YT Index > Lecture 4 > Regular Expressions

Regular Expressions

Relax NG lets you create patterns of elements, attributes, and content and then tests them against documents. Regular expressions let you create patterns of characters and test strings against them. For example, anything that matches three digits, a dash, three more digits, another dash, and four digits is considered to be a valid US phone number. But let's start out with simple regular expressions.

Ordinary Strings

If I specify a pattern with plain characters in it, the pattern will match that string, and only that string. Thus <param name="pattern">bat</param> will be valid only if the content or attribute value to which it's attached is exactly the word bat. Note to Perl programmers: Relax NG patterns are presumed to be anchored with ^ at the start and $ at the end.

Note: from here on in, we'll leave out the <param> tags; they're the same in all the examples. We'll just show the pattern itself.

Character Classes

Now let's say I wanted to match any of the words bat, bet, bit, or but. I would specify this pattern: b[aeiu]t. The characters in the square brackets are called a character class, and they tell the pattern matcher to match any one of those characters. That's one character. The preceding pattern will not match beat, beet, bait, or beaut. We'll see how to get around this problem later.

Let's say I wanted a pattern that matched any uppercase letter followed by any even digit. I could write this pattern: [ABCDEFGHIJKLMNOPQRSTUVWXYZ][02468], but there's an easier way to specify a contiguous range: [A-Z][02468]. You can have as many ranges as you like within a single set of square brackets. The following character class will match any uppercase letter A through G, lowercase letter r through u, or the letter m: [A-Gr-um]. Order doesn't matter; we could equally well have specified it as [mr-uA-G] If you want to match a hyphen as part of a character class, put it at the beginning or at the end. The following will match the letter A, C, or a hyphen: [-AC], but this one matches A, B, or C and no hyphen: [A-C].

If you want to match anything except a vowel, you use an up arrow at the beginning of the character class: [^aeiou]. Note that this will match anything except those five letters; it will match the letter b as well as a comma or the digit 7.

Abbreviations for Character Classes

Some character classes are so useful that abbreviations have been developed for them:

Their inverses are specified by \S, \D, \W, \I, and \C.

Quantifiers

We could write the pattern for that phone number as \d\d\d-\d\d\d-\d\d\d\d, but we can also use quantifiers to make our job easier. You can specify a quanitifer by following a character (or character class) by one of:

{n} exactly n occurrences
{n,m} at least n but no more than m occurrences
{n,} n or more occurrences
+ one or more occurrences (same as {1,})
? zero or one occurrences (same as {0,1})
* zero or more occurrences (same as {0,})

This information lets us rewrite the phone number as \d{3}-\d{3}-\d{4}. The pattern b[aeiou]+t will match bat, beet, beaut, and beaieueaot. Hey, it's not a perfect world.

Grouping and Alternates

Let's say we want to make the first three digits and hyphen of a phone number optional. We have to group those items together and follow them with a question mark: (\d{3}-)?\d{3}-\d{4}

You use the vertical bar | symbol to specify choices. For example, let's say you want to specify that the src attribute of an <img> element must end with .jpg, .gif, or .png

<attribute name="src">
    <data type="string">
        <param name="pattern">.+\.(jpg|gif|png)</param>
    </data>
</attribute>

Analyzing the pattern one section at a time: the attribute must match one or more of any kind of character (.+) followed by a period (\.). Notice that we need a backslash to indicate that this isn't the dot-that-means-any-character-at-all, but a real dot. If you want to match to an actual parenthesis, star, question mark, plus sign, or vertical bar, or backslash, those must also be preceded by a backslash.

Finally, we have a group of alternatives separated by vertical bars.

This is by no means an exhaustive explanation of regular expressions, but it should be enough to get you started, and is certainly enough to let you construct useful and non-trivial patterns.