Chapter 9: Pattern Matching with Regular Expressions

After learning everything you've learned so far, you may think you've got a pretty good foundation in programming Perl, since you'd already be a good way through most of the concepts many other languages entail. But if you put down this book today and did nothing else with Perl beyond what I've already taught you, you'd miss one of the most powerful and flexible aspects of Perl—that of pattern matching using a technique called regular expressions. Pattern matching is more than just searching for some set of characters in your data; it’s a way of looking at data and processing that data in a manner that can be incredibly efficient and amazingly easy to program. Learning Perl without learning regular expressions is like trying to understand snowboarding without ever encountering snow. In other words, don't stop now—you're just getting to the good part!

Today, we'll dive deep into regular expressions, why they're useful, how they're built, and how they work. Tomorrow we'll continue the discussion and cover more advanced uses of regular expressions. Today, specifically, you'll learn:

The Whys and Wherefores of Pattern Matching

Pattern matching is the technique of searching a string containing text or binary data for some set of characters based on a specific search pattern. When you search for a string of characters in a file using the Find command in your word processor, or when you use a search engine to look for something on the Web, you're using a simple version of pattern matching: your criteria is "find these characters." In those environments, you can often customize your criteria in particular ways, for example, to search for this or that, to search for this or that but not the other thing, to search for whole words only, or to search only for those words that are 12 points and underlined. Pattern matching in Perl, however, can be even more complicated than that. Using Perl, you can define an incredibly specific set of search criteria, and do it in an incredibly small amount of space using a pattern-definition mini-language called regular expressions.

Perl's regular expressions, often called just regexes or REs, borrow from the regular expressions used in many Unix tools, such as grep(1) and sed(1). As with many other features Perl has borrowed from other places, however, Perl includes slight changes and lots of added capabilities. If you're used to using regular expressions, you'll be able to pick up Perl's regular expressions fairly easily, since most of the same rules apply (although there are some gotchas to be aware of, particularly if you've used sophisticated regular expressions in the past).

Note: The term regular expressions may seem sort of nonsensical. They don't really seem to be expressions, nor is it easy to figure out what's regular about them. Don't get hung up on the term itself; regular expression is a term borrowed from mathematics that refers to the actual language with which you write patterns for pattern matching in Perl.

I used the example of the search engine and the Find command earlier to describe the sorts of things that pattern matching can do. It’s important for you not to get hung up on thinking that pattern matching is only good for plain old searching. The sorts of things regular expressions can do in Perl include:

This is only a partial list, of course—you can apply Perl's regular expressions to all kinds of tasks. Generally, if there's a task for which you'd want to iterate over a string or over your data in another language, that task is probably better solved in Perl using regular expressions. Many of the operations you learned about yesterday for finding bits of strings can be better done with patterns.

Pattern Matching Operators and Expressions

To use pattern matching in Perl, you figure out what you want to find, you write a regular expression to find it, and then you stick that pattern in a situation where the result of finding (or not finding) that pattern makes sense. As with other aspects of Perl, where you put a pattern and what context you use it in determines how that pattern is used.

We'll start with a fairly simple case—patterns in a boolean scalar context, where if a string contains the pattern, the expression returns true.

To construct patterns in this way, you use two operators: the regular expression operator m// and the pattern-match operator =~, like this:

if ($string =~ m/foo/) {
# do something...
}

What that test inside the if says is: if the string contained in $string contains the pattern foo, return true. Note that the =~ operator is not an assignment operator, even though it looks like one. =~ is used exclusively for pattern matching, and means, effectively, "find the pattern on the right somewhere in the string on the left." You'll sometimes find =~ called the binding operator.

The pattern itself is contained between the slashes in m//. This particular pattern is one of the simplest patterns you can create—it’s just three specific characters in sequence (you'll learn more about what constitutes a match and what doesn't later on). The pattern could just as easily be m/.*\d+/ or m/^[+-]?\d+\.?\d*$/ or some other seemingly incomprehensible set of characters (don't panic yet; you'll learn how to decipher those patterns soon).

For these sorts of patterns, the m is optional and can be left off the pattern itself (and usually is). In addition, you can leave off the variable and the =~ if you want to search the contents of the default variable $_. Commonly in Perl, you'll see shorthand pattern matching like this one:

if (/^\d+/) { # ...

Which is equivalent to

if ($_ =~ m/^\d+/) { # ...

You've already learned a simple case of this yesterday with the grep function, which can use patterns to find a bit of a string inside the $_ list element:

@foothings = grep /foo/, @strings;

That line, in turn, is equivalent to this long form:

@foothings = grep { $_ =~ /foo/ } @strings;

As we work through today's lesson, you'll learn different ways of using patterns in different contexts and for different reasons. Much of the work of learning pattern matching, however, involves actually learning the regular expression syntax to build patterns, so let's stick with this one situation for now.

Simple Patterns

We'll start with some of the most simple and basic patterns you can create: patterns that match specific sequences of characters, patterns that match only at specific places in a string, or combining patterns using what's called alternation.

Character Sequences

One of the simplest patterns is just a sequence of characters you want to match, like this:

/foo/

/this or that/

/ /

/Laura/

/patterns that match specific sequences/

All of these patterns will match if the data contains those characters in that order. All the characters must match, including spaces. The word or in the second pattern doesn't have any special significance (it’s not a logical or); that pattern will only match if the data contains the string this or that somewhere inside it.

Note that characters in patterns can be matched anywhere in a string. Word boundaries are not relevant for these patterns—the pattern /if/ will match in the string "if wishes were horses" and in the string "there is no difference." The pattern /if /, however, because it contains a space, will only match in the first string where the characters i, f, and the one space occur in that order.

Upper- and lowercase are relevant for characters: /kazoo/ will only match kazoo and not Kazoo or KAZOO. To make a particular search case-insensitive, you can use the i option after the pattern itself (the i indicates ignore case), like this:

/kazoo/i # search for any upper and lowercase versions

Alternately, you can also create patterns that will search for either upper- or lowercase letters, as you'll learn about in the next section.

You can include most alphanumeric characters in patterns, including string escapes for binary data (octal and hex escapes). There are a number of characters that you cannot match without escaping them. These characters are called metacharacters and refer to bits of the pattern language and not to the literal character. These are the metacharacters to watch out for in patterns:

^

$

.

+

?

*

{

(

)

\

/

|

[

 

If you want to actually match a metacharacter in a string—for example, search for an actual question mark—you can escape it using a backslash, just as you would in a regular string:

/\?/ # matches question mark

Matching at Word or Line Boundaries

When you create a pattern to match a sequence of characters, those characters can appear anywhere inside the string and the pattern will still match. But sometimes you want a pattern to match those characters only if they occur at a specific place—for example, match /if/ only when it’s a whole word, or /kazoo/ only if it occurs at the start of the line (that is, the beginning of the string).

Note: I'm making an assumption here that the data you're searching is a line of input, where the line is a single string with no embedded newline characters. Given that assumption, the terms string, line, and data are effectively interchangeable. Tomorrow, we'll talk about how patterns deal with newlines.

To match a pattern at a specific position, you use pattern anchors. To anchor a pattern at the start of the string, use ^:

/^Kazoo/ # match only if Kazoo occurs at the start of the line

To match at the end of the string, use $:

/end$/ # match only if end occurs at the end of the line

Once again, think of the pattern as a sequence of things in which each part of the pattern must match the data you're applying it to. The pattern matching routines in Perl actually begin searching at a position just before the first character, which will match ^. Then it moves to each character in turn until the end of the line, where $ matches. If there's a newline at the end of the string, the position marked by $ is just before that newline character.

So, for example, let's see what happens when you try to match the pattern /^foo/ to the string "to be or not to be" (which, obviously, won't match, but let's try it anyhow). Perl starts at the beginning of the line, which matches the ^ character. That part of the pattern is true. It then tests the first character. The pattern wants to see an f there, but it got a t instead, so the pattern stops and returns false.

What happens if you try to apply the pattern to the string "fob"? The match will get farther—it'll match the start of the line, the f and the o, but then fail at the b. And keep in mind that /^foo/ will not match in the string " foo"—the foo is not at the very start of the line where the pattern expects it to be. It will only match when all four parts of the pattern match the string.

Some interesting but potentially tricky uses of ^ and $—can you guess what these patterns will match?

/^/

/^1$/

/^$/

The first pattern matches any strings that have a start of the line. It would be very weird strings indeed that didn't have the start of a line, so this pattern will match any string data whatsoever, even the empty string.

The second one wants to find the start of the line, the numeral 1, and then the end of the line. So it'll only match if the string contains 1 and only 1—it won't match "123" or "foo 1" or even " 1 ".

The third pattern will match only if the start of the line is immediately followed by the end of the line—that is, if there is no actual data. This pattern will only match an empty line. Keep in mind that because $ occurs just before the newline character, this last pattern will match both "" and "\n".

Another boundary to match is a word boundary—where a word boundary is considered the position between a word character (a letter, number, or underscore) and some other character such as whitespace or punctuation. A word boundary is indicated using a \b escape. So /\bif\b/ will match only when the whole word "if" exists in the string—but not when the characters i and f appear in the middle of a word (as in "difference."). You can use \b to refer to both the start and end of a word; /\bif/, for example, will match in both "if I were king" and "that result is iffy," and even in "As if!", but not in "bomb the aquifer" or "the serif is obtuse."

You can also search for a pattern not in a word boundary using the \B escape. With this, /\Bif/ will match only when the characters i and f occur inside a word and not at the start of a word.

Matching Alternatives

Sometimes, when you're building a pattern, you may want to search for more than one pattern in the same string and then test based on whether all the patterns were found, or perhaps any of the set of patterns was found. You could, of course, do this with the regular Perl logical expressions for boolean AND (&& or and) and OR (|| or or) with multiple pattern-matching expressions, something like this:

if (($in =~ /this/) || ($in =~ /that/)) { ...

Then, if the string contains /this/ or if it contains /that/, the whole test will return true.

In the case of an OR search (match this pattern or that pattern—either one will work), however, there is a regular expression metacharacter you can use: the pipe character (|). So, for example, the long if test in that example could just be written as:

if ($in =~ /this|that/) { ...

Using the | character inside a pattern is officially known as alternation because it allows you to match alternate patterns. A true value for the pattern occurs if any of the alternatives match.

Any anchoring characters you use with an alternation character apply only to the pattern on the same side of the pipe. So, for example, the pattern /^this|that/ means "this at the start of the line" or "that anywhere," and not "either this or that at the start of a line." If you wanted the latter form you could use /^this|^that/, but a better way is to group your patterns using parentheses:

/^(this|that)/

For this pattern, Perl first matches the start of the line, and then tries and matches all the characters in "this." If it can't match "this", it'll then back up to the start of the line and try to match "that." For a pattern line /^this|that/, it'll first try and match everything on the left side of the pipe (start of line, followed by this), and if it can't do that, it'll back up and search the entire string for "that".

An even better version would be to group only the things that are different between the two patterns, not just the ^ to match the beginning of the line, but also the th characters, like this:

/^th(is|at)/

This last version means that Perl won't even try the alternation unless th has already been matched at the start of the line, and then there will be a minimum of backing up to match the pattern. With regular expressions, the less work Perl has to do to match something, the better.

You can use grouping for any kinds of alternation within a pattern. For example, /(1st|2nd|3rd|4th) time/ will match "1st time", "2nd time", and so on—as long as the data contains one of the alternations inside the parentheses and the string " time" (note the space).

Matching Groups of Characters

So far, so good? The regular expressions we've been building so far shouldn't strike you as being that complex, particularly if you look at each pattern in the way that Perl looks at it, character by character and alternate by alternate, taking grouping into effect. Now we're going to start looking at some of the shortcuts that regular expressions provide for describing and grouping various kinds of characters.

Character Classes

Say you had a string, and you wanted to match one of five words in that string: pet, get, met, set, and bet. You could do this:

/pet|get|met|set|bet/

That would work. Perl would search through the whole string for pet, then search through the whole string for get, then do the same thing for met, and so on. A shorter way—both for number of characters for you to type and for Perl—would be to group characters so that we don't duplicate the et part each time:

/(p|g|m|s|b)et/

In this case, Perl searches through the entire string for p, g, m, s, or b, and if it finds one of those, it'll try to match et just after it. Much more efficient!

This sort of pattern—where you have lots of alternates of single characters, is such a common case that there's regular expression syntax for it. The set of alternating characters is called a character class, and you enclose it inside brackets. So, for example, that same pet/get/met pattern would look like this using a character class:

/[pgmsb]et/

That's a savings of at least a couple of characters, and it’s even slightly easier to read. Perl will do the same thing as the alternation character, in this case: it'll look for any of the characters inside the character class before testing any of the characters outside it.

The rules for the characters that can appear inside a character class are different from those that can appear outside of one—most of the metacharacters become plain ordinary characters inside a character class (the exception being a right-bracket, which needs to be escaped for obvious reasons, a caret (^), which can't appear first, or a hyphen, which has a special meaning inside a character class). So, for example, a pattern to match on punctuation at the end of a sentence (punctuation after a word boundary and before two spaces) might look like this:

/\b[.!?] /

Whereas . and ? have special meanings outside the character class, here they're plain old characters.

Ranges

What if you wanted to match, say, all the lowercase characters a through f (as you might in a hexadecimal number, for example). You could do:

/[abcdef]/

Looks like a job for a range, doesn't it? You can do ranges inside character classes, but you don't use the range operator .. that you learned about on Day 4. Regular expressions use a hyphen for ranges instead (which is why you have to backslash it if you actually want to match a hyphen). So, for example, lowercase a through f looks like this:

/[a-f]/

You can use any range of numbers or characters, as in /[0-9]/, /[a-z]/ or /[A-Z]/. You can even combine them: /[0-9a-z]/ will match the same thing as /[0123456789abcdefghijklmnopqrstuvwxyz]/.

Negated Character Classes

Brackets define a class of characters to match in a pattern. You can also define a set of characters not to match using negated character classes—just make sure the first character in your character class is a caret (^). So, for example, to match anything that isn't an A or a B, use:

/[^AB]/

Note that the caret inside a character class is not the same as the caret outside one. The former is used to create a negated character class, and the latter is used to mean the beginning of a line.

If you want to actually search for the caret character inside a character class, you're welcome to—just make sure it’s not the first character or escape it (it might be best just to escape it either way to cut down on the rules you have to keep track of):

/[\^?.%]/ # search for ^, ?, ., %

You most likely end up using a lot of negated character classes in your regular expressions, so keep this syntax in mind. Note one subtlety: negated characters classes don't negate the entire value of the pattern. If /[12]/ means "return true if the data contains 1 or 2", /[^12]/ does not mean "return true if the data doesn't contain 1 or 2." If that were the case, you'd get a match even if the string in question was empty. What negated character classes really mean is "match any character that's not these characters." There must be at least one actual character to match for a negated character class to work.

Special Classes

If character class ranges are still too much for you to type, there are also special character classes (and negated character classes) that have their own escape codes. You'll see these a lot in regular expressions, particularly those that match numbers in specific formats. Note that these special codes don't need to be enclosed between brackets; you can use them all by themselves to refer to that class of characters.

Table 9.1 shows the list of special character class codes:

Table 9.1. Character Class Codes.
Code Equivalent character class What it means

\d

[0-9]

Any digit

\D

[^0-9]

Any character not a digit

\w

[0-9a-zA-z_]

Any "word character"

\W

[^0-9a-zA-z_]

Any character not a word character

\s

[ \t\n\r\f]

whitespace (space, tab, newline, carriage return, form feed)

\S

[^ \t\n\r\f]

Any non-whitespace character

Word characters (\w and \W) is a bit mystifying—why is an underscore considered a word character, but punctuation isn't? In reality, word characters have little to do with words, but are the valid characters you can use in variable names: numbers, letters, and underscores. Any other characters are not considered word characters.

You can use these character codes anywhere you need a specific type of character. For example, the \d code to refers to any digit. With \d, you could create patterns that match any three digits /\d\d\d/, or, perhaps, any three digits, a dash, and any four digits, to represent a phone number such as 555-1212: /\d\d\d-\d\d\d\d/. All this repetition isn't necessarily the best way to go, however, as you'll learn in a bit when we cover quantifiers.

Matching Any character with . (dot)

The broadest possible character class you can get is to match based on any character whatsoever. For that, you'd use the dot character (.). So, for example, the following pattern will match lines that contain one character and one character only:

/^.$/

You'll use the dot more often in patterns with quantifiers (which you'll learn about next), but the dot can be used to indicate fields of a certain width, for example:

/^..:/

This pattern will match only if the line starts with two characters and a colon.

More about the dot operator after we pause for an example.

An Example: Optimizing Numspeller

Remember the numspeller script from yesterday? This was the script that took a single-digit number and converted it into a word. You many remember when I described the numspeller script that I mentioned it was easier to write using regular expressions. So, now that you know something of regular expressions, let's rewrite the script to use regular expressions instead of all those if statements.

And, while we're at it, why don't we revise the part of number speller that verifies the input. We can do a lot more in terms of input validation with regular expressions, to the point of absurdity. In fact, we'll approach absurdity with the input validation in this script. This version tests for a number of things that could be entered, and replies with various comments (many of them sarcastic):

% numspeller2.pl
Enter the number you want to spell(0-9): foo
You can't fool me. There are letters in there.
Enter the number you want to spell(0-9): 45foo
You can't fool me. There are letters in there.
Enter the number you want to spell(0-9): ###
huh? That *really* doesn't look like a number
Enter the number you want to spell(0-9): -45
That's a negative number. Positive only, please!
Enter the number you want to spell(0-9): 789
Too big! 0 through 9, please.
Enter the number you want to spell(0-9): 4
Thanks!
Number 4 is four
Try another number (y/n)?: x
y or n, please
Try another number (y/n)?: n
%

Instead of showing you this script and then working through it line by line, let's go in the reverse direction: I'm going to show you sections from both the old and new versions of numspeller, explain them, and then at the end, I'll list the whole thing so you can get the big picture.

Let's start with the loop that accepts a number as input. This is what the loop looked like in the old version of numspeller:

while () {
    print 'Enter the number you want to spell: ';
    chomp($num = <STDIN>);
    if ($num gt "9" ) { # test for strings
        print "No strings.  0 through 9 please..\n";
        next;
    }
    if ($num > 9) { # numbers w/more than 1 digit
        print "Too big. 0 through 9 please.\n";
        next;
    }
    if ($num < 0) { # negative numbers
        print "No negative numbers.  0 through 9 please.\n";
        next;
    }
    last;
}

We can easily replace the three tests in this loop with regular expressions that make more sense—and we can also test for more sophisticated kinds of things. Our new loop will test for three major groups of things:

That second test can then be broken into sub-tests for things like alphabetic characters, negative numbers (starting with -), floating-point numbers (with a decimal point), or totally bizarre characters. Here's the new version of our loop, which also makes use of the $_ variable to save us some typing in the pattern matching tests:

1: while () {
2:     print 'Enter the number you want to spell(0-9): ';
3:     chomp($_ = <STDIN>);
4:     if (/^\d$/) {  # correct input
5:         print "Thanks!\n";
6:         last;
7:     } elsif (/^$/) {
8:         print "You didn't enter anything.\n";
9:     } elsif (/\D/) { # nonnummbers
10:        if (/[a-zA-z]/) { # letters
11:            print "You can't fool me.  There are letters in there.\n";
12:        } elsif (/^-\d/) { # negative numbers
13:            print "That's a negative number.  Positive only, please!\n";
14:        } elsif (/\./) { # decimals
15:            print "That looks like it could be a floating-point number.\n";
16:            print "I can't spell a floating-point number.  Try again.\n";
17:        } elsif (/[\W_]/) {  # other chars
18:            print "huh?  That *really* doesn't look like a number\n";
19:        }
20:    } elsif ($_ > 9) {
21:        print "Too big!  0 through 9, please.\n";
22:    }
23:  }

Let's look at those regular expressions, line by line, so you know what's getting matched here:

The next part of the old numspeller script was a set of if...elsif loops that compared the input value to a number string. Using regular expressions, the default variable $_, and logical expressions used as conditionals, we can reduce the nested ifs that looked like this:

if ($num == 1) { print 'one'; }
    elsif ($num == 2) { print 'two'; }
    elsif ($num == 3) { print 'three'; }
    elsif ($num == 4) { print 'four'; }
    # ... other numbers removed for space
}

Into a set of logicals that look like this:

/1/ && print 'one';
/2/ && print 'two';
/3/ && print 'three';
/4/ && print 'four';
# ... and so on

Cool, eh? It’s almost switch-like, and, arguably, easier to read.

Finally, we'll rewrite our little yes-or-no loop to repeat the entire script. The old version looked like this:

while () {
    print 'Try another number (y/n)?: ';
    chomp ($exit = <STDIN>);
    $exit = lc $exit;
    if ($exit ne 'y' && $exit ne 'n') {
        print "y or n, please\n";
    }
    else { last; }
}

There's actually nothing terribly wrong with this version, but since this is the pattern matching lesson, let's use pattern matching here, too:

while () {
        print 'Try another number (y/n)?: ';
        chomp ($exit = <STDIN>);
        $exit = lc $exit;
        if ($exit =~ /^[yn]/) {
            last;
        }
        else {
            print "y or n, please\n";
        }
    }

Note the differences between this loop and the input loop. In the input loop, we stored the input in the $_ variable, so we could just put the pattern into the test itself. Here we're matching against the string in the $exit variable, so we have to use the =~ operator instead. In the pattern itself, we test to see if what was typed was either y or n (Y an N will get converted to lowercase with the lc function), and if so, exit the loop and return to the outer loop, which repeats the script if necessary.

Note: In this example, I've used quite a few regular expressions, many of them gratuitous. It’s worth mentioning at this point that you shouldn't necessarily use regular expressions everywhere simply because they're cool. The Perl regular expression engine is really powerful for really powerful things, but there is some overhead in terms of efficiency if you use it for simple things. Simple tests and if statements will often execute faster than regular expressions. If you're concerned about the efficiency of your code, keep that in mind.

Listing 9.1 shows the full code for the new version of numspeller.pl:

Listing 9.1. The numspeller2.pl Script.

#!/usr/bin/perl -w
# numberspeller:  prints out word approximations of numbers
# simple version, only does single-digits

$exit = "";  # whether or not to exit the script.

while ($exit ne "n") {

    while () {
        print 'Enter the number you want to spell(0-9): ';
        chomp($_ = <STDIN>);
        if (/^\d$/) {
            print "Thanks!\n";
            last;
        } elsif (/^$/) {
            print "You didn't enter anything.\n";
        } elsif (/\D/) {        # nonnummbers
            if (/[a-zA-z]/) { # letters
                print "You can't fool me.  There are letters in there.\n";
            } elsif (/^-\d/) { # negative numbers
                print "That's a negative number.  Positive only, please!\n";
            } elsif (/\./) { # decimals
                print "That looks like it could be a floating-point number.\n";
                print "I can't spell a floating-point number.  Try again.\n";
            } elsif (/[\W_]/) {  # other chars
                print "huh?  That *really* doesn't look like a number\n";
            }
        } elsif ($_ > 9) {
            print "Too big!  0 through 9, please.\n";
        }
    }

    print "Number $_ is ";
    /1/ && print 'one';
    /2/ && print 'two';
    /3/ && print 'three';
    /4/ && print 'four';
    /5/ && print 'five';
    /6/ && print 'six';
    /7/ && print 'seven';
    /8/ && print 'eight';
    /9/ && print 'nine';
    /0/ && print 'zero';
    print "\n";

    while () {
        print 'Try another number (y/n)?: ';
        chomp ($exit = <STDIN>);
        $exit = lc $exit;
        if ($exit =~ /^[yn]/) {
            last;
        }
        else {
            print "y or n, please\n";
        }
    }
}

Matching Multiple Instances of Characters

Ready for more? The second group of regular expression syntax to explore is that of quantifiers. Whereas the patterns you've seen up to now refer to individual things or groups of individual things, quantifiers allow you to indicate multiples instances of things—or potentially no things. These regular expression metacharacters are called quantifiers, since they indicate some quantity of characters or groups of characters in the pattern you're looking for.

Perl's regular expressions include three quantifier metacharacters: ?, *, and +. Each refers to some multiple of the character or group that appears just before it in the pattern.

Optional Characters with ?

Let's start with ?, which matches a sequence that may or may not have the character immediately preceding it (that is, it matches zero or one instance of that character). So, for example, take this pattern:

/be?ar/

The question mark in that pattern refers to the character preceding it (e). This pattern would match with the string "step up to the bar" and with the string "grin and bear it"—because both "bar" and "bear" will match this pattern. The string you're searching must have the b, the a, and the r, but the e is optional.

Once again, think in terms of how the string is processed. The b is matched first. Then the next character is tested. If it’s an e, no problem, we move on to the next character both in the string and in the pattern (the a). If it’s not an e, that's still no problem, we move onto the next character in the pattern to see if it matches instead.

You can create groups of optional characters with parentheses:

/bamboo(zle)?/

The parentheses make that whole group of characters (zle) optional—this pattern will match both bamboo or bamboozle The thing just before the ? is the optional thing, be it a single character or a group.

Note: Why bother creating a pattern like this? It would seem that the (zle) part of this pattern is irrelevant, and that just plain /bamboo/ would work just as well, with fewer characters. In these easy cases, where we're just trying to find out whether something matches, yes or no, it doesn't matter. Tomorrow, when you learn how to extract the thing that matched and create more complex patterns, the distinction will be more important.

You can also use character classes with ?:

/thing \d?/

This pattern will match the strings "thing 1", "thing 9," and so on, but will also match "thing " (note the space). Any character in the character class can appear either zero or one times for the pattern to match.

Multiple Characters with *

A second form of multiplier is the *, which works similarly to the ? except that * allows zero or any number of the preceding character to appear—not just zero or one instance as ? does. Take this pattern:

/xy*z/

In this pattern, the x and the z are required, but the y can appear any number of times including not at all. This pattern will match xyz, xyyz, xyyyyyyyyyyyyyyyyz, or just plain old xz without the y.

As with ?, you can use groups or character classes before the *. One use of * is to use it with the dot character—which means that any number of any characters could appear at that position:

/this.*/

This pattern matches the strings "thisthat", "this is not my sweater. The blue one with the flowers is mine," or even just "this"—remember, the character at the end doesn't have to exist for there to be a match.

A common mistake is to forget that * stands for "zero or more instances," and to use it like this:

if (/^[0-9]*$/) {
# contains numbers
}

The intent here is to create a pattern that matches only if the input contains numbers and only numbers. And this pattern will indeed match "7," "1540," "15443" and so on. But it'll also match the empty string—because the * means that no numbers whatsoever will also produce a match. Usually, when you want to require something to appear at least once, you want to use + instead of *.

Note also that "match zero or more numbers," as that example would imply, does not mean that it will match any string that happens to have zero numbers—it won't match the string "lederhosen", for example. Matching zero or more numbers does not imply any other matches; if you want it to match characters than numbers, you'll need to include those characters in the pattern. With regular expressions, you have to be very specific about what you want to match.

Requiring at Least Once Instance with +

The + metacharacter works identically to *, with one significant difference; instead of allowing zero or more instances of the given character or group, requires that character or group to appear at least once ("one or more instances."). So given a pattern like the one we used for *:

/xy+z/

This pattern will match "xyz", "xyyz," xyyyyyyyyyyz", but it will not match "xz." The y must appear at least once.

As with * and ?, you can use groups and character classes with +.

Restricting the Number of Instances

For both * and + the given character or group can appear any number of times—there is no upper limit (characters with ? can appear only once). But what if you want to match a specific number of instances? What if the pattern you're looking for does require a lower or upper limit, and any more or less than that won't match? You can use the optional curly bracket metacharacters to set limits on the quantity, like this:

/\d{1,4} /

This pattern matches if the data includes one digit, two digits, three digits, or four digits, any of them followed by a space; it won't match any more digits than that, nor will it match if there aren't any digits whatsoever. The first number inside the brackets is the minimum number of instances to match; the second is the maximum. Or you can match an exact number by just including the number itself:

/a{5}b/

This pattern will only match if it can find five as in a row followed by one b—no more, no less. It’s exactly equivalent to /aaaaab/. A less specific use of {} for an exact number of instances might be something like this:

/\$\d+\.\d{2}/

Can you work through this pattern and figure out what it matches? It uses a number of escaped characters, so it might be confusing. First, it matches a dollar sign (\$), then one or more decimals (\d+), then it matches a decimal point (.), and finally, it matches only if that pattern is followed by two decimals and no more. Put it all together and this pattern matches monetary input—$45.23 would match just fine, as would $0.45 or $15.00, but $.45 and $34.2 would not. This pattern requires at least one number on the left side of the decimal, and a maximum of two numbers on the right.

Back to the curly brackets. You can set a lower bound on the match, but not an upper bound, by leaving off the maximum number but keeping the comma:

/ba{4,}t/

This pattern matches b, at least four or more instances of the letter a, and then t. Three instances of a in a row won't match, but twenty as will.

Note that you could represent +, * and ? in curly bracket format:

/x{0,1}/ # same as /x?/

/x{0,}/ # same as /x*/

/x{1,}/ # same as /x+/

More About Building Patterns

We started this lesson with a basic overview of how to use patterns in your Perl scripts using an if test and the =~ operator—or, if you're searching in $_, you can leave off the =~ part altogether. Now that you know something of constructing patterns with regular expression syntax, let's return to Perl, and look at some different ways of using patterns in your Perl scripts, including interpolating variables into patterns and using patterns in loops.

Patterns and Variables

In all the examples so far, we've used patterns as hard-coded sets of characters in the test of a Perl script. But what if you want to match different things based on some sort of input? How do you change the search pattern on the fly?

Easy. Patterns, like quotes, can contain variables, and the value of the variable is substituted into the pattern:

$pattern = "^\d{3}$";

if (/$pattern/) { ...

The variable in question can contain a string with any kind of pattern, including metacharacters. You can use this technique to combine patterns in different ways, or to search for patterns based on input. For example, here's a simple script that prompts you for both a pattern and some data to search, and then returns true or false if there's a match:

#!/usr/bin/perl -w

print 'Enter the pattern: ';
chomp($pat = <STDIN>);

print 'Enter the string: ';
chomp($in = <STDIN>);

if ($in =~ /$pat/) { print "true\n"; }
else { print "false\n"; }

You may find this script (or one like it) useful yourself, as you learn more about regular expressions.

Patterns and Loops

One way of using patterns in Perl scripts is to use them as tests, as we have up to this point. In this context (a scalar boolean context), they evaluate to true or false based on whether the pattern matches the data. Another way to use a pattern is as the test in a loop, with the /g option at the end of the pattern, like this:

while (/pattern/g) {
# loop
}

The /g option is used to match all the patterns in the given string (here, $_, but you can use the =~ operator to match somewhere else). In an if test, the /g option won't matter, case the test will return true at the first match it finds. In the case of while (or a for loop), however, the /g will cause the test to return true each time the pattern occurs in the string—and the statements in the block will execute that number of times as well.

Note: We're still talking about using patterns in a scalar context, here; the /g just causes interesting things to happen in loops. We'll get to using patterns in list context tomorrow.

Another Example: Counting

Here's an example of a script that makes use of that patterns-in-loops feature I just mentioned to work through a file (or any numbers of files) and count the incidences of some pattern in that file. With this script you could, for example, count the number of times your name occurs in a file, or find out how many hits to your Web site came from America Online (aol.com). I ran it on a draft of this lesson and found that I've used the word pattern 184 times so far.

Listing 9.2 shows this simple script:

Listing 9.2. count.pl

1:  #!/usr/bin/perl -w
2:
3:  $pat = ""; # thing to search for
4:  $count = 0; # number of times it occurs
5:
6:  print 'Search for what? ';
7:  chomp($pat = <STDIN>);
8:  while (<>) {
9:      while (/$pat/g) {
10:         $count++;
11:     }
12: }
13:
14: print "Found /$pat/ $count times.\n";

As with all the scripts we've built that cycle through files using <>, you'll have to call this one on the command line with the name of a file:

% count.pl logfile
Search for what? aol.com
Found /aol.com/ 3456 times.
%

Nothing in Listing 9.2 should look overly surprising, although there are a few points to note. Remember that using while with the file input characters (<>) sets each line of input to the default variable $_. Since patterns will also match with that value by default, we don't need a temporary variable to hold each line of input. The first while loop (line 8), then, reads each line from the input files. The second while loop searches that single line of input repeatedly and increments $count each time it finds the pattern in each line. This way, we can get the total number of instances of the given pattern, both inside each line and for all the lines in the input.

One other important thing to note about this script: if you have it search for a phrase instead of a single word—for example, find all instances of both a first and last name—then there is a possibility that that phrase could fall across multiple lines. This script will miss those instances, since neither line will completely match the pattern. Tomorrow, you'll learn how to search for a pattern that can fall on multiple lines.

Pattern Precedence

Back in Day 2, you may remember we had a little chart that showed the precedence of the various operators, and allowed you to figure out which parts of an expression would evaluate first in a larger expression. Metacharacters in patterns have the same sort of precedence rules, so you can figure out which characters or groups of characters those metacharacters refer to. Table 9.2 shows that precedence, where characters closer to the top of the table group tighter than those lower down.

Table 9.2. Pattern metacharacter Precedence.
Character Meaning

( )

grouping and memory

? + * { }

quantifiers

x \x $ ^ (?= ) (?!)

characters, anchors, look-ahead

|

alternation

As with expressions, you can group characters with () to force them to be evaluated as a sequence.

Note: You haven't learned about all these metacharacters yet. Tomorrow, we'll explore more of them.

Going Deeper

In this lesson, I've given you the basics of regular expressions so you can get started, and tomorrow you'll learn even more uses of regular expressions. For more information about any of these things, the perlre man page can be quite enlightening. For this section, let's look at a few other features I haven't discussed elsewhere in this lesson.

More Uses of Patterns

At the start of this lesson, you learned about the =~ for matching patterns to scalar variables other than $_. In addition to =~, you can also use !~, like this:

 $thing !~ = /pattern/;

!~ is the logical not version of =~; in other words, it will return true only if the pattern is NOT found in $thing.

Another useful function for patterns is the pos function, which works similarly to the index function, except with patterns. You can use the this function to find out the exact position inside the pattern where a match was made using m//g, or to start a pattern-match at a specific position in a string. The pos function takes a scalar value (often a variable) as an argument, and returns the offset of the character after the last character of the match. For example:

$find = "123 345 456 346";
while ($find =~ /3/g) {
@positions = (@positions, pos $find);
}

This code snippet builds an array of all the positions inside the string $find where the number 3 appears (3, 5, 13) in this case.

For more information on the pos function, see the perlfunc man page.

Pattern Delimiters and Escapes

All the patterns we've seen so far began and ended with slashes, with everything in between the characters or metacharacters to match. The slashes are themselves metacharacters, which means that if you want to actually search for a slash, you must backslash it. This can be problematic for patterns that actually contain lots of slashes—for example, Unix path names, which are all separated by slashes. You can easily end up with a pattern that looks something like this:

/\/usr(\/local)*\/bin\//;

That's rather difficult to read (more so than many other regular expressions). Fortunately, Perl has a way around this: you don't have to use // to surround a pattern—you can use any non-alphanumeric character you want to. The only catch is that if you use a different character you must include the m on the m// expression (you can also replace the delimiters for substitution, but you have to use the s/// for that anyhow). You'll also have to escape uses of those delimiters inside the pattern itself. For example, the above expression could be written like this:

m%/usr(/local)*/bin/%;

Alternately, if you're creating a search pattern for a number of non-alphanumeric characters that are also pattern metacharacters, you may end up blackslashing an awful lot of those characters, making the pattern difficult to read. Using the \Q escape you can essentially turn off pattern processing for a set of characters, and the use \E to turn them back on again. For example, if you were searching for a pattern containing the characters {(^*)} (for whatever reason), this pattern would search for those literal characters:

/\Q{(^*)}\E/;

Using \Q to turn off pattern processing is also useful for variable interpolation inside patterns, to prevent unusual results in search pattern input:

/From:\s*\Q$from\E/;

Summary

Pattern matching and regular expressions are, arguably, Perl's most powerful feature. Whereas other languages may provide regular expression libraries or functions, pattern matching is intrinsic to Perl's operation and tightly bound to many other aspects of the language. Perl without regular expressions is just another funny-looking language. Perl with regular expressions is incredibly useful.

Today you learned all about patterns: building them, using them, saving bits of them, and putting them together with other parts of Perl. You learned about the various metacharacters you can use inside regular expressions: metacharacters for anchoring a pattern (^, $, \B, \b), for creating a character class ([] and [^]), for alternating between different patterns (|) and for matching multiples of characters (+, *, ?).

With that language for creating patterns, you can then apply those patterns to strings using the m// expression. By default, patterns affect the string stored in the $_ variable, unless you use the =~ operator to apply the pattern to any variable.

Tomorrow, we'll expand on what you've learned here, building on the patterns you've already learned with additional patterns and more and better ways to use those patterns.

Q&A

Q.What's the difference between m// and just //?

A.Nothing, really. The m is optional, unless you're using a different character for the pattern delimiter. They both do the same thing.

Q.Alternation produces a logical OR situation in the pattern. How do I do a logical AND?

A.The easiest way is simply to use multiple patterns and the && or and operators, like this:

/pat1/ && /pat2/;

If you know the order in which the two patterns will appear, you can just do something like this:

/pat1.*pat2/

Q.I've got a pattern that searches for numbers: /\d*/. It matches for numbers, all right, but it also matches for all other strings. What am I doing wrong?

A.You're using * when you mean +. Remember that * means "zero or more instances." That means if your string has no numbers whatsoever, it'll still match—you've got zero instances. + is used for at least one instance.

Workshop

The workshop provides quiz questions to help you solidify your understanding of the material covered and exercises to give you experience in using what you've learned. Try and understand the quiz and exercise answers before you go on to tomorrow's lesson.

Quiz

1.Define the terms pattern matching and regular expressions.

2.What sort of tasks is pattern matching useful for? Name three.

3.What do each of the following patterns do?

/ice\s*cream/

/\d\d\d/

/^\d+$/

/ab?c[,.:]d/

/xy|yz+/

/[\d\s]{2,3}/

/"[^"]"/

4.Assume that $_ contains the value 123 kazoo kazoo 456. What is the result of the following expressions?

if (/kaz/) { # true or false?

while (/kaz/g) { # what happens?

if (/^\d+/) { # true or false?

if (/^\d?\s/) { # true or false?

if (//d{4}/) { # true or false?

 

Exercises

1.Write patterns to match the following things:

2. BUG BUSTER: What's wrong with this code?

print 'Enter a string: ';
chomp($input = <STDIN>);
print 'Search for what? ';
chomp($pat = <STDIN>);

if (/$pat/) {
   # pattern found, handle it
}

3.BUG BUSTER: How about this one?

print 'Search for what? ';
chomp($pat = <STDIN>);
while (<>) {
    while (/$pat/) {
        $count++;
    }
}

4. Yesterday, we created a script called morenames.pl that let you sort a list of names and search for different parts. The searching part used a rather convoluted mechanism of each and grep to find the pattern. Rewrite that part of the script to use patterns instead.

Answers

Here are the answers to the Workshop questions in the previous section.

Quiz Answers

1.pattern matching is the concept on Perl of writing a pattern which is then applied to a string or a set of data. Regular expressions are the language you use to write patterns.

2.There are many uses of pattern matching—you are limited only by your imagination. A few of them include:

a. Input validation

b. Counting the number of things in a string

c. Extracting data from a string based on certain criteria

d. Splitting a string into different elements

e. Replacing a specific pattern with some other string

f. Finding regular (or irregular) patterns in a data set

3.The answers are as follows:

a. This pattern matches the characters "ice" and "cream", separated by zero or more whitespace characters.

b. This pattern matches three digits in a row

c. This pattern matches one or more digits on a line by themselves with no other characters or whitespace.

d. This pattern matches 'a', and optional 'b', a c, one of a comma, period, or colon, and a 'd'. "ac.d" will match, as will "acb,d", but not "abcd"

e. This pattern will match either 'xy' or 'y' with one or more 'z's.

f. This pattern will match either a digit or a whitespace character appearing at least two but no more than three times.

g. This pattern matches all the characters in between opening and closing quotes.

4.The answers are:

a. True

b. The loop repeats for every instance of 'kaz' in the string (twice, in this case)

c. True. The pattern matches one or more digits at the start of a line.

d. False. This pattern matches 0 or one digits at the start of the line, followed by whitespace. It doesn't match the three digits we have in this string.

e. False. This pattern matches four digits in a row; we only have three digits here.

Exercise Answers

1.As with all Perl, there are different ways of doing different things. Here are some possible solutions:

/[.!?"]\s+[A-Z]\w+\b/

/d+%/

/[+-]\d+\.?\d+/

/([a-zA-z]{3})\s*\1/

2.There's a mismatch between where the pattern is trying to match and where the actual data is. The pattern in the if statement is trying to match the pattern against $_, but as per the second line, the actual input is in $input. Use this if test instead:

if ($input =~ /$pat/) {

3.This one's sneaky, because there's nothing syntactically wrong with this statement. The second while loop, the one with the pattern in it, will look for the pattern in $_, which is correct. But the test is a simple true and false test: does that pattern exist? The first line that has that pattern in it will register as true, and then $count will get incremented. But then the test will occur again, and it'll still be true, and the counter will get incremented again, and again, infinitely. There's nothing here to stop the loop from iterating.

The /g option to the pattern in that while loop is what sets up the special case where the while loop will loop only as many times as the pattern was matched in the string and then stop. If you're using patterns inside loops, don't forget the /g.

4.The only part that needs to change are the lines that build the @keys array (the ones that use grep to search for the pattern). Here we'll use a foreach loop and a test in both the key and value. We'll also add the /i option to make is case insensitive, and also reset the @keys list to the empty list so that it doesn't build up between searches. Here's the new version of option 3:

} elsif ($in eq '3') {      # find a name (1 or more)

   print "Search for what? ";
   chomp($search = <STDIN>);

   @keys = ();
   foreach (keys %names) {
       if (/$search/i or $names{$_} =~ /$search/i) {
           push @keys, $_;
       }
   }

   if (@keys) {
       print "Names matched: \n";
       foreach $name (sort @keys) {
           print "   $names{$name} $name\n";
       }
   } else {
       print "None found.\n";
   }
}