NAME

proctut5 - Thinking "Regex"

SYNOPSIS

This tutorial describes how to think like a regular expression engine would. Once your brain becomes accustomed to thinking like a regular expression engine, reading and writing regular expressions becomes a much easier and less error-prone experience.

DISCLAIMER

First of all, I should probably point out that there have been many excellent books written on the art of writing regular expressions. The finest, in my opinion, is still Jeffrey Friedl's Mastering Regular Expressions (O'Reilly), to which I have referred elsewhere. Friedl clearly explains how the several kinds of regular expression engines work, covers efficiency considerations, and many other useful aspects of regular expressions.

I don't do that. This is just an introduction into one kind of regular expression (namely procmail regular expressions) and space and time are my largest considerations, not completeness. So if this article leaves you hungry for more, go get some real food; otherwise, I hope that this will serve at least as a means whereby you can say, "golly, I never understood that before and now I do."

On to the discussion!

DESCRIPTION

Probably the kindest thing I can do for you is to disarm you of any ill feelings you may have toward regular expressions. By now (assuming you've read the previous tutorials), you should be comfortable using asterisks, plus signs, dots, and their friends.

What remains for you to become a master regular expression reader and writer is to actually read and write regular expressions. I'm just here to look over your shoulder and read these first few patterns to you in English so that you can do them for yourself later on.

We're going to discuss how regular expression engines work, in a nutshell, especially procmail's regular expression engine (since it behaves a little differently than most regular expression engines do). Along the way, we'll cover a few examples and hopefully demystify how stuff (namely procmail's regex matching) works.

Regex Engines Are Stoopid

With monotonous regularity, a regular expression engine will match precisely what it is told to over and over in exactly the same way. The behavior is predictable and easily understandable by humans if a couple of simple rules are adhered to. These rules are:

  1. leftmost match
  2. shortest match
  3. longest match to the right of \/

We'll discuss each of these rules in turn.

Leftmost Match

Procmail does have something in common with other regular expression programs: it tries to match as far left as possible. This means that it begins its examination at the left end and tries to make the pattern match. It only moves right if it can't match. An illustration may help make things clearer.

Let's pretend we have the following pattern:

    subjects

and following line:

    Subject: Messages should have clear subjects

The regular expression engine examines the pattern character by character:

    Line:    Subject: Messages should have clear subjects
    Pattern: s

The regex engine found a match right off the bat! Our pattern's first character ("s") is also the first character of the line we're examining. It continues to match as much of the pattern as it can:

    Line:    Subject: Messages should have clear subjects
    Pattern: subject

So far so good. But now the next character of our pattern (also an "s") does not match the next character in the line (a colon, ":"). The regular expression engine now aborts, moves right, and starts over, comparing the first character in the pattern with the second character in the line:

    Line:    Subject: Messages should have clear subjects
    Pattern:  s

Nope.

    Line:    Subject: Messages should have clear subjects
    Pattern:   s

Nope.

    Line:    Subject: Messages should have clear subjects
    Pattern:    s

Nope. And so forth until it finds another "s":

    Line:    Subject: Messages should have clear subjects
    Pattern:            s

But since the second character of the pattern ("u") doesn't match the next character of the line ("s"), we keep on moving right until we get finally to this point:

    Line:    Subject: Messages should have clear subjects
    Pattern:                                     s

at which point we are able to match the entire pattern:

    Line:    Subject: Messages should have clear subjects
    Pattern:                                     subjects

Procmail prefers to match things at the left end of the string simply because that's where it starts from, but it prefers even more to have the pattern match completely, so it moves right as needed.

This behavior is what is meant when we say "leftmost match".

Shortest match

Unlike most regular expression engines (e.g., perl, egrep, etc.), procmail is not greedy, rather, it is stingy. This means that it finds the shortest match that will satisfy the pattern supplied and then calls it quits. You may come across the mantra "leftmost-longest" when referring to other programs that do regular expressions, but not with procmail[1].

For example, consider the following pattern:

    ^Subject: .*dog

and the following subject line:

    Subject: are dogs mentioned in the pseudographia?

In a traditional regular expression engine, the following pattern will match like this:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: .............................*dog

See how the dot-star pattern is greedy? The asterisk wants to match as much as it possibly can before quitting. It could have left off at the first mention of "dog" but instead continued until it found another "dog" in the sentence, allowing the asterisk to consume as much of the line as it could. In procmail, however, the asterisk and plus sign are stingy: they want to match as little as possible:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: ...*dog

And because, often, all we want out of a condition is that it be either true or false, this minimal matching behavior is sufficient: we don't care whether the dot-star (or dot-plus) pattern matched a little or a lot. The asterisk and plus sign quantifiers are kind of like slack in a rope, giving as much or little as needed, but preferring to give little.

This behavior is what is meant when we say "shortest match".

Longest Match to the right of \/

We just described how procmail is stingy when it matches with quantifiers: it makes the shortest match it can. Under some circumstances, however, procmail quantifiers will be greedy: they will match as much as they can. Not to worry! Under most circumstances it doesn't really matter whether procmail is stingy or greedy. The single time procmail quantifiers are greedy is explained here.

Most of the time when we write a procmail condition, we're only concerned whether the condition is true or false. Consider the case where we simply want to see where a word occurs in a header:

    * H ?? foo

This will match the word "foo" in any of the headers of an email message:

    From: football@sports.tld
    Subject: don't be a fool!
    To: "Tom Foo" <tom@foo-family.com>

It doesn't matter to us at all whether procmail is stingy or greedy. It just has to find "foo" somewhere in the headers for the condition to be true.

Consider a second case. We want to find the name of the MIME multipart boundary so that we can do some processing later. MIME boundary names look something like this:

    Content-Type: multipart/alternative;
        boundary="----=_NextPart_Foo"

Our problem is that the name is the part between double-quotes after 'boundary='; we want to extract or save everything between the quotes into a variable so we can use it in a recipe later. To accomplish this, we have to use the match operator. The match operator is a backslash and a slash together, like this:

    \/

The procmailrc(5) manpage indicates the purpose of the \/ operator:

    \/ splits the expression in two parts. Everything matching the
    right part will be assigned to the MATCH environment variable.

The match operator tells procmail to save everything to the right of the operator in a procmail variable named $MATCH. Our task is to "extract" a string from a line of text and the \/ operator is perfectly suited for this. Were we simply wanting to match a boundary line, we might write something like this:

    * H ?? boundary=".+"

But this will not preserve anything for us. We want to save the name of the boundary, so we use the \/ operator:

    * H ?? boundary="\/.+

Wait a minute--we understand that the asterisk operator is stingy, right? So under normal circumstances, the dot-plus combo will only match one character. That isn't enough since the boundary name is longer than that.

Enter greed. Greed is our friend when it comes to saving text we've matched. In the above example, the dot-plus combo will become greedy and match as much as it can to the end of the line (since the dot matches any character except newlines). This means that instead of matching a single character, the dot-plus combo will match this:

    ----=_NextPart_Foo"

The above string will be stored in a variable called '$MATCH'. We could print its contents right now:

    LOG="$MATCH
    "

The only flaw with this regular expression is that it matches just a little too much. We don't really want the trailing double-quote, so we'll need to use something besides the dot for our wildcard:

    * H ?? boundary="\/[^"]+

Perfect! We create a character class that includes everything that is not a double-quote (that is, it matches everything a dot does, except double-quote characters). This, because it is greedy to the right of the \/ operator, will match the entire boundary name without the final double-quote.

Let's read the pattern in English:

    Match with word "boundary=", followed by a double-quote, followed
    by one or more non-double-quote characters.

    Everything matching the pattern to the right of '\/' (i.e., all
    consecutive characters that are not double-quotes) is matched
    "greedy" and is saved in the variable $MATCH.

This construct might appear in a recipe like this:

    :0
    * H ?? boundary="\/[^"]+
    {
        LOG="Found boundary name: ${MATCH}
    "
    }

Now let's revisit our previous example. Here is our regular expression, except that we've added the \/ operator:

    ^Subject: .*\/dog

and here is our line:

    Subject: are dogs mentioned in the pseudographia?

Previously, and without the \/ operator, we remember that our pattern matched the line minimally like this:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: ...*dog

How will it match now? Let's walk through it in English:

    Match "Subject: " followed by zero or more characters minimally
    (we're still left of \/!). Match the word "dog" and save it in
    $MATCH.

This means that we still match like above:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: ...*dog

and the $MATCH variable contains the word 'dog'. Now let's move our dot-star pattern to the right of the \/ operator, like this:

    ^Subject: \/.*dog

Now what does the $MATCH variable contain? Let's walk through it in English again:

    Match "Subject: ". Greedily match zero or more characters (.*)
    followed by "dog".

Now, because the dot-star is greedy (on the right of \/), our $MATCH variable contains:

    are dogs mentioned in the pseudog

because the dot-star pattern matched like this:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: .............................*dog

This is the difference between minimal (stingy) and maximal (greedy) matching.

SUMMARY

Procmail finds the "leftmost-shortest" match that it can, according to the CAVEATS section of procmailsc(5) manpage. Most of the time it doesn't matter whether we match shortest or longest, since we're only looking for a "true-false" match. However, when we want to "save" or "extract" a string out of a line being scanned, we use the match operator \/, which also has the side-effect of making procmail find the "leftmost-longest" match in patterns that appear to the right of \/.

NOTES

Note 1

Procmail quantifiers are always stingy except when they are to the right of the matching operator (\/) , in which case quantifiers are greedy. See the CAVEATS section of procmailsc(5).

PREVIOUS

Simple Regular Expressions, Part III

NEXT

proctut6

SEE ALSO

procmail(1), procmailrc(5), procmailex(5), procmailsc(5), egrep(1), Jeffrey Friedl's Mastering Regular Expressions (O'Reilly), regex(3).

AUTHOR

Scott Wiersdorf <scott@perlcode.org>

COPYRIGHT

Copyright (c) 2003 Scott Wiersdorf. All rights reserved.

REVISION

$Id: proctut5.pod,v 1.1 2003/10/18 03:15:46 deep Exp $