proctut5 - Thinking "Regex"
This tutorial describes how to think like a regular expression engine would. Once your brain becomes accustomed to thinking like a regular expression engine, reading and writing regular expressions becomes a much easier and less error-prone experience.
First of all, I should probably point out that there have been many excellent books written on the art of writing regular expressions. The finest, in my opinion, is still Jeffrey Friedl's Mastering Regular Expressions (O'Reilly), to which I have referred elsewhere. Friedl clearly explains how the several kinds of regular expression engines work, covers efficiency considerations, and many other useful aspects of regular expressions.
I don't do that. This is just an introduction into one kind of regular expression (namely procmail regular expressions) and space and time are my largest considerations, not completeness. So if this article leaves you hungry for more, go get some real food; otherwise, I hope that this will serve at least as a means whereby you can say, "golly, I never understood that before and now I do."
On to the discussion!
Probably the kindest thing I can do for you is to disarm you of any ill feelings you may have toward regular expressions. By now (assuming you've read the previous tutorials), you should be comfortable using asterisks, plus signs, dots, and their friends.
What remains for you to become a master regular expression reader and writer is to actually read and write regular expressions. I'm just here to look over your shoulder and read these first few patterns to you in English so that you can do them for yourself later on.
We're going to discuss how regular expression engines work, in a nutshell, especially procmail's regular expression engine (since it behaves a little differently than most regular expression engines do). Along the way, we'll cover a few examples and hopefully demystify how stuff (namely procmail's regex matching) works.
With monotonous regularity, a regular expression engine will match precisely what it is told to over and over in exactly the same way. The behavior is predictable and easily understandable by humans if a couple of simple rules are adhered to. These rules are:
\/We'll discuss each of these rules in turn.
Procmail does have something in common with other regular expression programs: it tries to match as far left as possible. This means that it begins its examination at the left end and tries to make the pattern match. It only moves right if it can't match. An illustration may help make things clearer.
Let's pretend we have the following pattern:
subjects
and following line:
Subject: Messages should have clear subjects
The regular expression engine examines the pattern character by character:
Line: Subject: Messages should have clear subjects
Pattern: s
The regex engine found a match right off the bat! Our pattern's first character ("s") is also the first character of the line we're examining. It continues to match as much of the pattern as it can:
Line: Subject: Messages should have clear subjects
Pattern: subject
So far so good. But now the next character of our pattern (also an "s") does not match the next character in the line (a colon, ":"). The regular expression engine now aborts, moves right, and starts over, comparing the first character in the pattern with the second character in the line:
Line: Subject: Messages should have clear subjects
Pattern: s
Nope.
Line: Subject: Messages should have clear subjects
Pattern: s
Nope.
Line: Subject: Messages should have clear subjects
Pattern: s
Nope. And so forth until it finds another "s":
Line: Subject: Messages should have clear subjects
Pattern: s
But since the second character of the pattern ("u") doesn't match the next character of the line ("s"), we keep on moving right until we get finally to this point:
Line: Subject: Messages should have clear subjects
Pattern: s
at which point we are able to match the entire pattern:
Line: Subject: Messages should have clear subjects
Pattern: subjects
Procmail prefers to match things at the left end of the string simply because that's where it starts from, but it prefers even more to have the pattern match completely, so it moves right as needed.
This behavior is what is meant when we say "leftmost match".
Unlike most regular expression engines (e.g., perl, egrep, etc.), procmail is not greedy, rather, it is stingy. This means that it finds the shortest match that will satisfy the pattern supplied and then calls it quits. You may come across the mantra "leftmost-longest" when referring to other programs that do regular expressions, but not with procmail[1].
For example, consider the following pattern:
^Subject: .*dog
and the following subject line:
Subject: are dogs mentioned in the pseudographia?
In a traditional regular expression engine, the following pattern will match like this:
Line: Subject: are dogs mentioned in the pseudographia?
Pattern: ^Subject: .............................*dog
See how the dot-star pattern is greedy? The asterisk wants to match as much as it possibly can before quitting. It could have left off at the first mention of "dog" but instead continued until it found another "dog" in the sentence, allowing the asterisk to consume as much of the line as it could. In procmail, however, the asterisk and plus sign are stingy: they want to match as little as possible:
Line: Subject: are dogs mentioned in the pseudographia?
Pattern: ^Subject: ...*dog
And because, often, all we want out of a condition is that it be either true or false, this minimal matching behavior is sufficient: we don't care whether the dot-star (or dot-plus) pattern matched a little or a lot. The asterisk and plus sign quantifiers are kind of like slack in a rope, giving as much or little as needed, but preferring to give little.
This behavior is what is meant when we say "shortest match".
We just described how procmail is stingy when it matches with quantifiers: it makes the shortest match it can. Under some circumstances, however, procmail quantifiers will be greedy: they will match as much as they can. Not to worry! Under most circumstances it doesn't really matter whether procmail is stingy or greedy. The single time procmail quantifiers are greedy is explained here.
Most of the time when we write a procmail condition, we're only concerned whether the condition is true or false. Consider the case where we simply want to see where a word occurs in a header:
* H ?? foo
This will match the word "foo" in any of the headers of an email message:
From: football@sports.tld
Subject: don't be a fool!
To: "Tom Foo" <tom@foo-family.com>
It doesn't matter to us at all whether procmail is stingy or greedy. It just has to find "foo" somewhere in the headers for the condition to be true.
Consider a second case. We want to find the name of the MIME multipart boundary so that we can do some processing later. MIME boundary names look something like this:
Content-Type: multipart/alternative;
boundary="----=_NextPart_Foo"
Our problem is that the name is the part between double-quotes after 'boundary='; we want to extract or save everything between the quotes into a variable so we can use it in a recipe later. To accomplish this, we have to use the match operator. The match operator is a backslash and a slash together, like this:
\/
The procmailrc(5) manpage indicates the purpose of the \/ operator:
\/ splits the expression in two parts. Everything matching the
right part will be assigned to the MATCH environment variable.
The match operator tells procmail to save everything to the right of the operator in a procmail variable named $MATCH. Our task is to "extract" a string from a line of text and the \/ operator is perfectly suited for this. Were we simply wanting to match a boundary line, we might write something like this:
* H ?? boundary=".+"
But this will not preserve anything for us. We want to save the name of the boundary, so we use the \/ operator:
* H ?? boundary="\/.+
Wait a minute--we understand that the asterisk operator is stingy, right? So under normal circumstances, the dot-plus combo will only match one character. That isn't enough since the boundary name is longer than that.
Enter greed. Greed is our friend when it comes to saving text we've matched. In the above example, the dot-plus combo will become greedy and match as much as it can to the end of the line (since the dot matches any character except newlines). This means that instead of matching a single character, the dot-plus combo will match this:
----=_NextPart_Foo"
The above string will be stored in a variable called '$MATCH'. We could print its contents right now:
LOG="$MATCH
"
The only flaw with this regular expression is that it matches just a little too much. We don't really want the trailing double-quote, so we'll need to use something besides the dot for our wildcard:
* H ?? boundary="\/[^"]+
Perfect! We create a character class that includes everything that is not a double-quote (that is, it matches everything a dot does, except double-quote characters). This, because it is greedy to the right of the \/ operator, will match the entire boundary name without the final double-quote.
Let's read the pattern in English:
Match with word "boundary=", followed by a double-quote, followed
by one or more non-double-quote characters.
Everything matching the pattern to the right of '\/' (i.e., all
consecutive characters that are not double-quotes) is matched
"greedy" and is saved in the variable $MATCH.
This construct might appear in a recipe like this:
:0
* H ?? boundary="\/[^"]+
{
LOG="Found boundary name: ${MATCH}
"
}
Now let's revisit our previous example. Here is our regular expression, except that we've added the \/ operator:
^Subject: .*\/dog
and here is our line:
Subject: are dogs mentioned in the pseudographia?
Previously, and without the \/ operator, we remember that our pattern matched the line minimally like this:
Line: Subject: are dogs mentioned in the pseudographia?
Pattern: ^Subject: ...*dog
How will it match now? Let's walk through it in English:
Match "Subject: " followed by zero or more characters minimally
(we're still left of \/!). Match the word "dog" and save it in
$MATCH.
This means that we still match like above:
Line: Subject: are dogs mentioned in the pseudographia?
Pattern: ^Subject: ...*dog
and the $MATCH variable contains the word 'dog'. Now let's move our dot-star pattern to the right of the \/ operator, like this:
^Subject: \/.*dog
Now what does the $MATCH variable contain? Let's walk through it in English again:
Match "Subject: ". Greedily match zero or more characters (.*)
followed by "dog".
Now, because the dot-star is greedy (on the right of \/), our $MATCH variable contains:
are dogs mentioned in the pseudog
because the dot-star pattern matched like this:
Line: Subject: are dogs mentioned in the pseudographia?
Pattern: ^Subject: .............................*dog
This is the difference between minimal (stingy) and maximal (greedy) matching.
Procmail finds the "leftmost-shortest" match that it can, according to the CAVEATS section of procmailsc(5) manpage. Most of the time it doesn't matter whether we match shortest or longest, since we're only looking for a "true-false" match. However, when we want to "save" or "extract" a string out of a line being scanned, we use the match operator \/, which also has the side-effect of making procmail find the "leftmost-longest" match in patterns that appear to the right of \/.
\/) , in which case quantifiers are greedy. See the CAVEATS section of procmailsc(5).Simple Regular Expressions, Part III
proctut6
procmail(1), procmailrc(5), procmailex(5), procmailsc(5), egrep(1), Jeffrey Friedl's Mastering Regular Expressions (O'Reilly), regex(3).
Scott Wiersdorf <scott@perlcode.org>
Copyright (c) 2003 Scott Wiersdorf. All rights reserved.
$Id: proctut5.pod,v 1.1 2003/10/18 03:15:46 deep Exp $