=head1 NAME

proctut5 - Thinking "Regex"

=head1 SYNOPSIS

This tutorial describes how to think like a regular expression engine
would. Once your brain becomes accustomed to thinking like a regular
expression engine, reading and writing regular expressions becomes
a much easier and less error-prone experience.

=head1 DISCLAIMER

First of all, I should probably point out that there have been many
excellent books written on the art of writing regular expressions. The
finest, in my opinion, is still Jeffrey Friedl's I<Mastering Regular
Expressions> (O'Reilly), to which I have referred elsewhere. Friedl
clearly explains how the several kinds of regular expression engines
work, covers efficiency considerations, and many other useful aspects
of regular expressions.

I don't do that. This is just an introduction into one kind of regular
expression (namely procmail regular expressions) and space and time
are my largest considerations, not completeness. So if this article
leaves you hungry for more, go get some real food; otherwise, I hope
that this will serve at least as a means whereby you can say, "golly,
I never understood that before and now I do."

On to the discussion!

=head1 DESCRIPTION

Probably the kindest thing I can do for you is to disarm you of any
ill feelings you may have toward I<regular expressions>. By now
(assuming you've read the previous tutorials), you should be
comfortable using asterisks, plus signs, dots, and their friends.

What remains for you to become a master regular expression reader and
writer is to actually read and write regular expressions. I'm just
here to look over your shoulder and read these first few patterns to
you in English so that you can do them for yourself later on.

We're going to discuss how regular expression engines work, in a
nutshell, especially procmail's regular expression engine (since it
behaves a little differently than most regular expression engines do).
Along the way, we'll cover a few examples and hopefully demystify how
stuff (namely procmail's regex matching) works.

=head2 Regex Engines Are Stoopid

With monotonous regularity, a regular expression engine will match
precisely what it is told to over and over in exactly the same way.
The behavior is predictable and easily understandable by humans if a
couple of simple rules are adhered to. These rules are:

=over

=item 1

leftmost match

=item 2

shortest match

=item 3

longest match to the right of C<\/>

=back

We'll discuss each of these rules in turn.

=head2 Leftmost Match

Procmail does have something in common with other regular expression
programs: it tries to match as far left as possible. This means that
it begins its examination at the left end and tries to make the
pattern match. It only moves right if it can't match. An illustration
may help make things clearer.

Let's pretend we have the following pattern:

    subjects

and following line:

    Subject: Messages should have clear subjects

The regular expression engine examines the pattern character by
character:

    Line:    Subject: Messages should have clear subjects
    Pattern: s

The regex engine found a match right off the bat! Our pattern's first
character ("s") is also the first character of the line we're
examining. It continues to match as much of the pattern as it can:

    Line:    Subject: Messages should have clear subjects
    Pattern: subject

So far so good. But now the next character of our pattern (also an
"s") does I<not> match the next character in the line (a colon, ":").
The regular expression engine now aborts, moves right, and starts
over, comparing the first character in the pattern with the second
character in the line:

    Line:    Subject: Messages should have clear subjects
    Pattern:  s

Nope.

    Line:    Subject: Messages should have clear subjects
    Pattern:   s

Nope.

    Line:    Subject: Messages should have clear subjects
    Pattern:    s

Nope. And so forth until it finds another "s":

    Line:    Subject: Messages should have clear subjects
    Pattern:            s

But since the second character of the pattern ("u") doesn't match the
next character of the line ("s"), we keep on moving right until we get
finally to this point:

    Line:    Subject: Messages should have clear subjects
    Pattern:                                     s

at which point we are able to match the entire pattern:

    Line:    Subject: Messages should have clear subjects
    Pattern:                                     subjects

Procmail I<prefers> to match things at the left end of the string
simply because that's where it starts from, but it prefers even
I<more> to have the pattern match completely, so it moves right as
needed.

This behavior is what is meant when we say "leftmost match".

=head2 Shortest match

Unlike most regular expression engines (e.g., perl, egrep, etc.),
procmail is not I<greedy>, rather, it is I<stingy>. This means that
it finds the shortest match that will satisfy the pattern supplied and
then calls it quits. You may come across the mantra "leftmost-longest"
when referring to other programs that do regular expressions, but not
with procmailL<[1]|proctut5.pod/Note_1>.

For example, consider the following pattern:

    ^Subject: .*dog

and the following subject line:

    Subject: are dogs mentioned in the pseudographia?

In a traditional regular expression engine, the following pattern will
match like this:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: .............................*dog

See how the dot-star pattern is I<greedy>? The asterisk wants to match
as much as it possibly can before quitting. It could have left off at
the first mention of "dog" but instead continued until it found
another "dog" in the sentence, allowing the asterisk to consume as
much of the line as it could. In procmail, however, the asterisk and
plus sign are I<stingy>: they want to match as little as possible:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: ...*dog

And because, often, all we want out of a condition is that it be
either true or false, this minimal matching behavior is sufficient: we
don't care whether the dot-star (or dot-plus) pattern matched a little
or a lot. The asterisk and plus sign quantifiers are kind of like
slack in a rope, giving as much or little as needed, but preferring
to give little.

This behavior is what is meant when we say "shortest match".

=head2 Longest Match to the right of \/

We just described how procmail is I<stingy> when it matches with
quantifiers: it makes the shortest match it can. Under some
circumstances, however, procmail quantifiers will be I<greedy>: they
will match as much as they can. Not to worry!  Under most
circumstances it doesn't really matter whether procmail is I<stingy>
or I<greedy>. The single time procmail quantifiers are greedy is
explained here.

Most of the time when we write a procmail condition, we're only
concerned whether the condition is true or false. Consider the case
where we simply want to see where a word occurs in a header:

    * H ?? foo

This will match the word "foo" in any of the headers of an email
message:

    From: football@sports.tld
    Subject: don't be a fool!
    To: "Tom Foo" <tom@foo-family.com>

It doesn't matter to us at all whether procmail is stingy or greedy.
It just has to find "foo" somewhere in the headers for the condition
to be true.

Consider a second case. We want to find the name of the MIME multipart
boundary so that we can do some processing later. MIME boundary names
look something like this:

    Content-Type: multipart/alternative;
        boundary="----=_NextPart_Foo"

Our problem is that the name is the part between double-quotes after
'boundary='; we want to I<extract> or I<save> everything between the
quotes into a variable so we can use it in a recipe later. To
accomplish this, we have to use the B<match> operator. The match
operator is a backslash and a slash together, like this:

    \/

The procmailrc(5) manpage indicates the purpose of the C<\/> operator:

    \/ splits the expression in two parts. Everything matching the
    right part will be assigned to the MATCH environment variable.

The match operator tells procmail to save everything to the right of
the operator in a procmail variable named B<$MATCH>. Our task is to
"extract" a string from a line of text and the C<\/> operator is
perfectly suited for this. Were we simply wanting to match a boundary
line, we might write something like this:

    * H ?? boundary=".+"

But this will not preserve anything for us. We want to I<save> the
name of the boundary, so we use the C<\/> operator:

    * H ?? boundary="\/.+

Wait a minute--we understand that the asterisk operator is stingy,
right? So under normal circumstances, the dot-plus combo will only
match one character. That isn't enough since the boundary name is
longer than that.

Enter I<greed>. Greed is our friend when it comes to saving text we've
matched. In the above example, the dot-plus combo will become greedy
and match as much as it can to the end of the line (since the dot
matches any character except newlines). This means that instead of
matching a single character, the dot-plus combo will match this:

    ----=_NextPart_Foo"

The above string will be stored in a variable called '$MATCH'. We
could print its contents right now:

    LOG="$MATCH
    "

The only flaw with this regular expression is that it matches just a
little too much. We don't really want the trailing double-quote, so
we'll need to use something besides the dot for our wildcard:

    * H ?? boundary="\/[^"]+

Perfect! We create a character class that includes everything that is
not a double-quote (that is, it matches everything a dot does, except
double-quote characters). This, because it is greedy to the right of
the C<\/> operator, will match the entire boundary name without the
final double-quote.

Let's read the pattern in English:

    Match with word "boundary=", followed by a double-quote, followed
    by one or more non-double-quote characters.

    Everything matching the pattern to the right of '\/' (i.e., all
    consecutive characters that are not double-quotes) is matched
    "greedy" and is saved in the variable $MATCH.

This construct might appear in a recipe like this:

    :0
    * H ?? boundary="\/[^"]+
    {
        LOG="Found boundary name: ${MATCH}
    "
    }

Now let's revisit our previous example. Here is our regular
expression, except that we've added the C<\/> operator:

    ^Subject: .*\/dog

and here is our line:

    Subject: are dogs mentioned in the pseudographia?

Previously, and without the C<\/> operator, we remember that our
pattern matched the line minimally like this:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: ...*dog

How will it match now? Let's walk through it in English:

    Match "Subject: " followed by zero or more characters minimally
    (we're still left of \/!). Match the word "dog" and save it in
    $MATCH.

This means that we I<still> match like above:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: ...*dog

and the $MATCH variable contains the word 'dog'. Now let's move our
dot-star pattern to the I<right> of the C<\/> operator, like this:

    ^Subject: \/.*dog

Now what does the $MATCH variable contain? Let's walk through it in
English again:

    Match "Subject: ". Greedily match zero or more characters (.*)
    followed by "dog".

Now, because the dot-star is greedy (on the right of C<\/>), our
$MATCH variable contains:

    are dogs mentioned in the pseudog

because the dot-star pattern matched like this:

    Line:     Subject: are dogs mentioned in the pseudographia?
    Pattern: ^Subject: .............................*dog

This is the difference between minimal (stingy) and maximal (greedy)
matching.

=head1 SUMMARY

Procmail finds the "leftmost-shortest" match that it can, according
to the CAVEATS section of procmailsc(5) manpage. Most of the time it
doesn't matter whether we match shortest or longest, since we're only
looking for a "true-false" match. However, when we want to "save" or
"extract" a string out of a line being scanned, we use the match
operator C<\/>, which also has the side-effect of making procmail find
the "leftmost-longest" match in patterns that appear to the right of
C<\/>.

=head1 NOTES

=over 4

=item Note 1

Procmail quantifiers are always stingy except when they are to the
right of the matching operator (C<\/>) , in which case quantifiers are
greedy. See the CAVEATS section of procmailsc(5).

=back

=head1 PREVIOUS

L<Simple Regular Expressions, Part III|proctut4.pod>

=head1 NEXT

I<proctut6>

=head1 SEE ALSO

procmail(1), procmailrc(5), procmailex(5), procmailsc(5), egrep(1),
Jeffrey Friedl's I<Mastering Regular Expressions> (O'Reilly), regex(3).

=head1 AUTHOR

Scott Wiersdorf <scott@perlcode.org>

=head1 COPYRIGHT

Copyright (c) 2003 Scott Wiersdorf. All rights reserved.

=head1 REVISION

$Id: proctut5.pod,v 1.1 2003/10/18 03:15:46 deep Exp $