CIT 042 Index > Regular Expressions

Regular Expressions

If there’s one thing that humans do well, it’s pattern matching. You can categorize the numbers in the following list with barely any thought:

321-40-0909
302-555-8754
3-15-66
95124-0448

You can tell at a glance which of the following words can’t possibly be valid English words by the pattern of consonants and vowels:

grunion vortenal pskov trebular elibm talus

Regular expressions are Perl’s method of letting your program look for patterns:

A fraction is a series of digits followed by a slash, followed by another series of digits.
A valid name consists of a series of letters, a comma followed by zero or more spaces, followed by another series of letters.
A simple condition consists of a variable name or number, followed by one of < <= > >= == or !=, followed by another variable name or number.
A valid number consists of an optional minus sign, a run of digits, and an optional decimal point which is possibly followed by more digits.

The Simplest Patterns

The simplest pattern to look for is a single letter. If you want to see if a variable $x contains the letter e, for example, you can use this code:

$x = <STDIN>;
if ($x =~ m/e/)
{
    print "$x contains the letter e.\n";
}
else
{
    print "$x does not contain the letter e.\n";
}

The =~ means "contains the pattern"; the pattern itself is enclosed in slashes after the m operator, which stands for match. Note that you do not put quote marks around your pattern!

Of course, you can put more than one letter in your pattern. You can look for the word eat anywhere in a word:

$x =~ m/eat/

This will successfully match the words eat, heater, and treat, but won’t match easy, metal, or hat. You may be saying, "So what? I can do the same thing with the index string function." Yes, you can, but now let’s do something that isn’t so easy to do with index:

Matching any single character

Let’s make a pattern that will match the letter e followed by any character at all, followed by the letter t. To say "any character at all", you use a dot. Here’s the pattern:

$x =~ m/e.t/

This will match better, either, and best (the dot will match the t, i, and s in those words). It will not match beast (two letters between the e and t), ketch (no letters between the e and t), or crease (no letter t at all!).

Matching classes of characters

Now let’s find out how to narrow down the field a bit. We’d like to be able to find a pattern consisting of the letter b, any vowel (a, e, i, o, or u), followed by the letter t. To say "any one of a certain series of characters", you enclose them in square brackets:

$x =~ m/b[aeiou]t/

This matches words like bat, bet, rabbit, robotic, and abutment. It won’t match boot, because there are two letters between the b and t, and the class matches only a single character. (We’ll see how to check for multiple vowels later.)

There are abbreviations for establishing a series of letters: [a-f] is the same as [abcdef]; [A-Gm-p] is the same as [ABCDEFGmnop]; [0-9] matches a single digit (same as [0123456789]).

You may also complement (negate) a class; you can look for the letter e followed by anything except a vowel, followed by the letter t; or any character except a capital letter:

$x =~ m/e[^aeiou]t/
$x =~ m/[^A-Z]/

There are some classes that are so useful that Perl provides quick and easy abbrevations:

Abbreviation	Means	Same as
`\d`	a digit	`[0-9]`
`\w`	a "word" character; uppercase letter, lowercase letter, digit, or underscore. This is actually more like a variable name character, but let’s not quibble.	`[A-Za-z0-9_]`
`\s`	a "whitespace" character	`[ \r\t\n\f]`
And their complements...
`\D`	a non-digit	`[^0-9]`
`\W`	a non-word character	`[^A-Za-z0-9_]`
`\S`	a non-whitespace character	`[^ \r\t\n\f]`

Thus, this pattern matches a Social Security number; again, we’ll see a shorter way later on.

$x =~ m/\d\d\d-\d\d-\d\d\d\d/

Anchors

All the patterns we’ve seen so far will find a match anywhere within a string, which is usually - but not always - what we want. For example, we might insist on a capital letter, but only as the very first character in the string. Or, we might say that an employee ID number has to end with a digit. Or, we might want to find the word go only if it is at the beginning of a word, so that we will find it in You met another, and pfft you was gone., but we won’t mistakenly find it in I forgot my umbrella. This is the purpose of an anchor; to make sure that we are at a certain boundary before we continue the match. Unlike character classes, which match individual characters in a string, these anchors do not match any character; they simply establish that we are on the correct boundaries.

The up-arrow ^ matches the beginning of a line, and the dollar sign $ matches the end of a line. Thus, ^[A-Z] matches a capital letter at the beginning of the line. Note that if we put the ^ inside the square brackets, that would mean something entirely different! A pattern \d$ matches a digit at the end of a line. These are the boundaries you will use most often; sometimes you can have a string with multiple lines in it (since it will contain \n newlines). In that case, you may want to use \A and \Z to indicate that the next characters must be at the beginning or end of the entire string.

The other two anchors are \b and \B, which stand for a "word boundary" and "non-word boundary". For example, if we want to find the word met at the beginning of a word, we write the pattern /\bmet/, which will match The metal plate and The metropolitan lifestyle, but not Wear your bike helmet. The pattern /ing\b/ will match Hiking is fun and Reading, writing, and arithmetic, but not Gold ingots are heavy. Finally,the pattern /\bhat\b/ matches only the The hat is red but not That is the question or she hates anchovies or the shattered glass.

While \b is used to find the breakpoint between words and non-words, \B finds pairs of letters or nonletters; /\Bmet/ and /ing\B/ match the opposite examples of the preceding paragraph; /\Bhat\B/ matches only the shattered glass.

Repetition

All of these classes match only one character; what if we want to match three digits in a row, or an arbitrary number of vowels? You can follow any class or character by a repetition count:

Pattern	Matches
`/b[aeiou]{2}t/`	`b` followed by two vowels, followed by `t`
`/A\d{3,}/`	The letter `A` followed by 3 or more digits
`/[A-Z]{,5}/`	Zero to five capital letters
`/\w{3,7}/`	Three to seven word characters

This lets us rewrite our social security number pattern match as /\d{3}-\d{2}-\d{4}/.

There are three repetitions that are so common that Perl has special symbols for them: * means "zero or more," + means "one or more," and ? means "zero or one". Thus, if you want to look for lines consisting of last names followed by a first initial, you’d use this pattern:

/^\w+,\s*[A-Z]$/

This matches, starting at the beginning of the line, a word of one or more characters followed by an optional comma, zero or more spaces, and a single capital letter, which must be at the end of the line.

Grouping

So far so good, but what if we want to scan for a last name, followed by an optional comma-whitespace-initial; thus matching only a last name like "Smith" or a full "Smith, J"? We need to put the comma, whitespace, and initial into a unit with parentheses:

/^\w+(,\s*[A-Z])?$/

There’s a side effect of grouping - whenever we use parentheses to group something, the match operation stores the matched area in a buffer which we can access later on. Let’s put a group around the last name as well:

/^(\w+)(,\s*[A-Z])?$/

The last name that is matched goes into buffer number 1, and the comma-and-initial go into buffer number 2. We access them after the match with variables $1 and $2.

print "Enter name: ";
while ($info = <STDIN>)
{
    chomp $info;
    if ($info =~ m/^(\w+)(,\s*[A-Z])?$/)
    {
        print "Last name is $1\n";
        if ($2 ne "")
        {
            print "Initial is $2\n";
        }
    }
    else
    {
        print "Name not in proper format.\n";
    }
    print "Next name: ";
}

Here’s a sample run of this program:

Enter name: Smith
Last name is Smith
Next name: Smith, J
Last name is Smith
Initial is , J
Next name: Smith John
Name not in proper format.

Oops. That second one isn’t what we want. The group stores the entire matched substring, which includes the comma. We’d like to store only the initial. We can do this two ways. First, we can include yet another set of parentheses:

m/^(\w+)(,\s*([A-Z]))?$/

If we do it this way, then the capital letter is stored in $3 and the entire comma-and-initial is stored in $2. The other way to do this is to say that the outer parentheses should group, but not store any result, and we do that with a question mark and colon.

m/^(\w+)(?:,\s*([A-Z]))?$/

In this case, the initial is in $2, since the second open parentheses doesn’t use up one of the buffers. As you can see, patterns can very quickly become difficult to read.

There’s another way to store the buffers found by a match. Let’s say we want to match a phone number and find the area code, prefix, and number. Note that when we want to match to a real parenthesis, we have to precede it with a backslash to make it "not part of a group". We can do it this way:

$data =~ m/\((\d{3})\)\s*\d{3}-\d{4}/;
$area_code = $1;
$prefix = $2;
$number = $3;
print "Area code is $area_code\n";

Or we can assign the results of the match to a list on the left hand side of an equal sign:

($area_code, $prefix, $number) =
   ($data =~ m/\((\d{3})\)\s*\d{3}-\d{4}/);
print "Area code is $area_code\n";

Modifiers

You may follow a pattern by a modifier letter; the two that we’ll examine here are i and g. The i modifier gives a case-insensitive match. Thus, this pattern will match the word fish in any combination of upper and lower case, even FiSh

m/fish/i

The other useful modifier is the g modifier, which finds all the matches in a string. You use this in conjunction with arrays. The following statement will find all the sets of capital letter followed by an optional dash and a single digit, and store them in the array @results:

@results = ($info=~m/([A-Z]-?\d)/g);

Matching this pattern against the string:

"Insert tabs B3, D-7, and C6 into slot A9."

would fill the @results array with the strings "B3", "D-7", "C6" and "A9".