=head1 NAME

proctut2 - Simple Regular Expressions, Part I

=head1 SYNOPSIS

Procmail regular expressions are used chiefly in condition lines and
are integral to writing successful, succinct L<recipes|proctut1.pod>.
A knowledge of regular expressions carries over into other domains,
including shell scripting, general programming, and other text
processing utilities.

Familiarity with regular expressions will allow you to read and
understand other people's regular expressions as well as craft your
own to make your mail processing precise and efficient.

=head1 DESCRIPTION

Procmail uses I<regular expressions> to "find things" in email
messages.  For example, here is a simple procmail recipe that uses a
regular expression in the condition for finding the phrase "table
tennis":

    :0 HB:
    * table tennis
    /var/mail/pingpong

Any email message with the phrase "table tennis" appearing in either
the header or the body will match this condition and the recipe's
action line will trigger (delivering the mail to F</var/mail/pingpong>).

"But wait," you say, "that's just a simple string match."

Yes; simple string matching is but a taste of what regular expressions
can accomplish. For the remainder of this document we will freely
interchange the phrase "regular expression" with the more economical
"regex"L<[1]|proctut2.pod/Note_1> and "regexen" as the mildly comedic
plural. We will also use the term "pattern" to mean a part or a whole
regular expression.

=head1 Funny Characters

You may have noticed funny characters in procmail recipes. For
example, does this sort of recipe give you happy thoughts or
bewilderment?

    :0:
    * ^Content-Type: multipart/[^;]+;[	]*boundary="?\/[^"]+
    mime

This condition contains a variety of regular expression
I<tokens>L<[2]|proctut2.pod/Note_2>, including: B<^>, B<.>, B<[>,
B<]>, B<+>, B<*>, and B<?>. These characters have special meaning to
procmail; they don't just match their literal ASCII
equivalentsL<[3]|proctut2.pod/Note_3>. Sometimes we call a regular
expression a I<regex> or I<pattern>.

In the following sections, we'll cover the basics of some of these
regular expression tokens. By the end, you should be familiar with (if
not comfortable) using them in your recipes.

To avoid distraction, we will I<not> include the leading '*' that
denotes a procmail recipe condition. This means instead of this:

    * nice regex!

we will simply type:

    nice regex!

And now, the funny characters.

=head2 The Dot

The dot (".") or period (for you grammarian purists) is the most
commonly seen used regular expression character. It represents
anyL<[4]|proctut2.pod/Note_4> character.  Consider following regular expression:

    My ...'s name is Larry

This expression will match the following sentences:

    My dog's name is Larry
    My cat's name is Larry
    My ear's name is Larry
    My @ B's name is Larry

The dot matches I<any> characterL<[4]|proctut2.pod/Note_4>. Three dots matches any
three characters, as our example illustrates. The last example in our
list contains an 'at' symbol and a space; these are also matched by
the dot.

A dot's power is extended greatly when combined with quantifiers.

=head2 Quantifiers

Having the dot is useful because we can match a variety of things:

    What the .... is going on here

will match a variety of four letter words:

    What the heck is going on here
    What the fish is going on here
    What the toot is going on here

But now we want to match 3 letter words too:

    What the ... is going on here

Wouldn't it be nice if we could write a regex that would match both 3
and 4 letter words without having to write two expressions? With
quantifiers we can:

    What the ....? is going on here

A I<quantifier> means "how many" of whatever character the quantifier
follows. In this example, the question mark serves as a quantifier for
the dot immediately preceeding it. More on quantifiers below.

=over 4

=item The Question Mark: B<?>

This introduces our first quantifier: the question mark ("?"). As with
all quantifiers, the question mark by itself does not match anything;
it is a I<quantifier>.  It only has meaning when placed I<after>
another characterL<[5]|proctut2.pod/Note_5>.

The question mark indicates that the preceeding character may match
I<zero or one times>. In our example, the question mark is applied to
the last dot in our expression: I<....?>

If we were to read this regex in English, it would say: "Match any
character, followed by any character, followed by any character,
I<optionally> followed by any character" or more briefly, "Match any
three characters followed by zero or one additional characters."

Now our expression matches:

    What the sam is going on here
    What the hill is going on here

We can use multiple question marks to get exactly what we want:

    What the ..?.?.?.?.?.?.? is going on here

can be read: "Any character followed by zero to seven additional
characters," that is to say one, two, three, four, five, six, seven,
or eight letter words:

    What the samhill is going on here
    What the gol durn is going on here
    What the c is going on here

The question mark quantifier is sometimes forgotten because of its
limited application compared to the other quantifiers, but this should
not be.  There are entire classes of regular expression problems that
can only be solved with a question mark.

For example, if we want to match a sentence where a plural may or may
not occur, we need a question mark:

    Please let the dogs? out

This will match:

    Please let the dogs out

and:

    Please let the dog out

It will not match more than one 's':

    Please let the dogsss out

The question mark means "zero or one of the preceeding character"
(sometimes it is read "with an optional x"). The power of the question
mark is extended if the preceeding character is a special regular
expression character (e.g., the dot).

=item The Asterisk: B<*>

The asterisk ("*") is the next quantifier we will discuss. The
asterisk means "zero or more" of the preceeding character. This way,
we can match "whatever". In fact, most regular expressions we will
encounter will have I<.*> in them, meaning "and anything else (except
newlines)".

Here is a pattern that will match Subject lines that have 'ADV' and
'mortgage' in them:

    ^Subject: ADV.* mortgage

Let's read it in English: The word "Subject:"L<[6]|proctut2.pod/Note_6>, followed by
a space, followed by a the three letters ADV, followed by anything
else (including nothing else), followed by the word "mortgage".

This is useful because it will match subjects like thisL<[7]|proctut2.pod/Note_7>:

    Subject: ADV mortgage rates have dropped
    Subject: ADV          mortgages have never been lower!
    Subject: Advertisement: lower your mortgage payment
    Subject: advanced mortgage seminars

In the first example, I<.*> matches nothing; the asterisk assumes its
ability to match zero of the preceeding characters (a dot, in this
case). In the second example, I<.*> matches all the spaces after 'ADV'
(except the space preceeding "mortgages" because we put that literal
space in our pattern).  In the third example, I<.*> matches
"ertisement: lower your". In the fourth example, I<.*> matches "anced"
as part of "advanced".

B<Another example>

We can see how powerful the dot and asterisk can be together; but the
asterisk needn't apply only to dots.  It can apply to literal
characters as well. Consider the following regular expression:

    Subject: .*!!!*

This regex makes use of the familiar "dot-star" pattern we've seen
before. We match the word "Subject:" followed by a space, followed by
"anything" (there's our dot-star combo), followed by two exclamation
points, followed by zero or more exclamation points.

So, our pattern will match the following lines:

    Subject: dude!!
    Subject: Hey!!!!!

In the first case, the dot-star (I<.*>) in our pattern matches
"dude" and the next two exclamation points in our pattern match the
two exclamation points following "dude". The final I<!*> of the regex
matches I<nothing>, because there are no more exclamation points that
haven't been matched already.

In the second case, the dot-star (I<.*>) in our pattern matches
"Hey!!!". The next two exclamation points in our pattern match the
final two exclamation points that the dot-star didn't catch; the final
I<!*> in our pattern matches nothing again.

Was that a surprise? Ah! Likely, we forgot that a dot matches
I<anything> and when combined with a quantifier it matches as much as
it needs to for the expression to be "true".

Ok, so these weren't very good examples of how to apply the asterisk
to a literal character (but we learned something useful anyway). We'll
have a better example ready as we discuss the next quantifier.

=item The Plus Sign: B<+>

The plus ("+") is the last quantifier we'll discuss in this tutorial.
The plus is closely related to the asterisk, but it means "one or
more" rather than "zero or more". This means that there must be at
least one instance of the preceeding character. Consider this example:

    Subject: .*!!!+

This is similar to our previous example with asterisk, except that
we've exchanged the final asterisk for a plus. If we were to
reconsider our previous example lines:

    Subject: dude!!
    Subject: Hey!!!!!

We will get different results than with asterisk. For starters, the
first line will not match our pattern. The dot-star (I<.*>) matches
"dude" and then we have two literal exclamation points. The final
I<.+> of our regex has nothing left to match, and since the plus must
match I<at least one> exclamation point, our regex fails for this
line.

The second example I<does> match, but it does so in a different way
than the asterisk example did.  The leading dot-star matches "Hey!!",
instead of "Hey!!!" (one fewer exclamation points). The next two
literal exclamation points match, and the final bang-plus (I<!+>) of
our regex matches the trailing exclamation point.

Why does the first dot-star in our regex only match "Hey!!" instead
of "Hey!!!" like it did earlier with the asterisk? Quantifiers match
as much as they need to to allow subsequent parts of the pattern to
match also: the regex engine I<wants> to find matches. In order to
allow the bang-plus to match, the dot-star had to match less than it
did before.

=back

As a final example, we'll cover the three quantifiers one last time.

=over 4

=item B<Question Mark>

Let's take our pattern and make it more explicit:

    Subject: Hey!?

This will now match I<only> the following:

    Subject: Hey
    Subject: Hey!

The question mark when applied to the exclamation point means "match
zero or one exclamation points".

=item B<Asterisk>

Now we'll use the asterisk:

    Subject: Hey!*

This will match:

    Subject: Hey
    Subject: Hey!
    Subject: Hey!!
    Subject: Hey!!!!!
    etc.

The asterisk applied to the exclamation point means "match zero or
more exclamation points".

=item B<Plus>

Finally, the plus:

    Subject: Hey!+

This matches:

    Subject: Hey!
    Subject: Hey!!
    Subject: Hey!!!!!
    etc.

It will not match:

    Subject: Hey

because the plus sign indicates that there I<must> be at least one of
the preceeding character (in this case, when applied to the
exclamation point, it means "match one or more exclamation points").

=back

=head1 SUMMARY

Procmail regular expressions are much like other common egrep-like
regular expression languages. Regular expressions are simply string
matches with certain characters meaning special things, for example,
the dot (".") matches any characterL<[4]|proctut2.pod/Note_4>. Combined with
I<quantifiers>, the dot makes a potent regular expression character.

Quantifiers do not match anything by themselves; they only determine
"how many" of the preceeding character to match. The most common
I<quantifers> are the question mark ("?") which matches zero or one
of the preceeding character; the asterisk ("*") matches zero or more
of the preceeding character; and the plus ("+") matches one or more
of the preceeding character.

=head1 NOTES

=over 4

=item Note 1

"Regex" is pronounced "reg-ex" as in "REGular EXpression" with a hard
'G'. You may come across "regexp" in your personal studies.  This
contraction is notoriously difficult (and highly discouraged) to say
aloud without emitting spittle and is likewise frowned upon in
writing. The opinions in this article reflect exactly those views of
the author and, for reasons of world peace, should be considered
authoritative.

=item Note 2

For a complete list, see L<promailrc(5)/Extended Regular Expressions>.

=item Note 3

There is a way to match the literal tokens by I<escaping> them with a
backslash. Here, for example, is how to match a literal period
character: I<\.>

=item Note 4

Er, almost any character. It doesn't match a newline character.  See
L<procmailrc(5)/Extended Regular Expressions>.

=item Note 5

Yet another half-truth. Quantifiers have meaning when placed after
I<characters>, I<groups>, and I<character classes>. We can call these
things collectively "entities", "objects", "units", or my personal
preference (betraying my Perl background), "thingies".  In this
tutorial, we simply say "character", but we really mean "thingies".

=item Note 6

The caret means "at the beginning of the line." We'll cover that more
thoroughly in another tutorial)

=item Note 7

Procmail, it should be mentioned, matches I<case-insensitively>, that
is, without regard to upper or lowercase letters. For case-sensitive
matching, you'll need to enable the B<D> flag in your
L<recipe|proctut1.pod>.

=back

=head1 PREVIOUS

L<Anatomy of a Procmail Recipe, Part I|proctut1.pod>

=head1 NEXT

L<Simple Regular Expressions, Part II|proctut3.pod>

=head1 SEE ALSO

procmail(1), procmailrc(5), procmailex(5), procmailsc(5), egrep(1),
Jeffrey Friedl's I<Mastering Regular Expressions> (O'Reilly)

=head1 REALLY SEE ALSO

Jeffrey Friedl's I<Mastering Regular Expressions> (O'Reilly) is a
masterful and thorough work on regular expressions. Anyone serious
about becoming a competent regular expression writer should read this
book. It covers a lot of technical ground, but the examples are
excellent and most of it applies directly to procmail regular
expressions (one important exception being procmail's lack of a
numerical range quantifier like Perl's {n,m} syntax).

=head1 AUTHOR

Scott Wiersdorf <scott@perlcode.org>

=head1 COPYRIGHT

Copyright (c) 2003 Scott Wiersdorf. All rights reserved.

=head1 REVISION

$Id: proctut2.pod,v 1.12 2003/10/15 04:35:46 deep Exp $