=head1 NAME proctut2 - Simple Regular Expressions, Part I =head1 SYNOPSIS Procmail regular expressions are used chiefly in condition lines and are integral to writing successful, succinct L. A knowledge of regular expressions carries over into other domains, including shell scripting, general programming, and other text processing utilities. Familiarity with regular expressions will allow you to read and understand other people's regular expressions as well as craft your own to make your mail processing precise and efficient. =head1 DESCRIPTION Procmail uses I to "find things" in email messages. For example, here is a simple procmail recipe that uses a regular expression in the condition for finding the phrase "table tennis": :0 HB: * table tennis /var/mail/pingpong Any email message with the phrase "table tennis" appearing in either the header or the body will match this condition and the recipe's action line will trigger (delivering the mail to F). "But wait," you say, "that's just a simple string match." Yes; simple string matching is but a taste of what regular expressions can accomplish. For the remainder of this document we will freely interchange the phrase "regular expression" with the more economical "regex"L<[1]|proctut2.pod/Note_1> and "regexen" as the mildly comedic plural. We will also use the term "pattern" to mean a part or a whole regular expression. =head1 Funny Characters You may have noticed funny characters in procmail recipes. For example, does this sort of recipe give you happy thoughts or bewilderment? :0: * ^Content-Type: multipart/[^;]+;[ ]*boundary="?\/[^"]+ mime This condition contains a variety of regular expression IL<[2]|proctut2.pod/Note_2>, including: B<^>, B<.>, B<[>, B<]>, B<+>, B<*>, and B. These characters have special meaning to procmail; they don't just match their literal ASCII equivalentsL<[3]|proctut2.pod/Note_3>. Sometimes we call a regular expression a I or I. In the following sections, we'll cover the basics of some of these regular expression tokens. By the end, you should be familiar with (if not comfortable) using them in your recipes. To avoid distraction, we will I include the leading '*' that denotes a procmail recipe condition. This means instead of this: * nice regex! we will simply type: nice regex! And now, the funny characters. =head2 The Dot The dot (".") or period (for you grammarian purists) is the most commonly seen used regular expression character. It represents anyL<[4]|proctut2.pod/Note_4> character. Consider following regular expression: My ...'s name is Larry This expression will match the following sentences: My dog's name is Larry My cat's name is Larry My ear's name is Larry My @ B's name is Larry The dot matches I characterL<[4]|proctut2.pod/Note_4>. Three dots matches any three characters, as our example illustrates. The last example in our list contains an 'at' symbol and a space; these are also matched by the dot. A dot's power is extended greatly when combined with quantifiers. =head2 Quantifiers Having the dot is useful because we can match a variety of things: What the .... is going on here will match a variety of four letter words: What the heck is going on here What the fish is going on here What the toot is going on here But now we want to match 3 letter words too: What the ... is going on here Wouldn't it be nice if we could write a regex that would match both 3 and 4 letter words without having to write two expressions? With quantifiers we can: What the ....? is going on here A I means "how many" of whatever character the quantifier follows. In this example, the question mark serves as a quantifier for the dot immediately preceeding it. More on quantifiers below. =over 4 =item The Question Mark: B This introduces our first quantifier: the question mark ("?"). As with all quantifiers, the question mark by itself does not match anything; it is a I. It only has meaning when placed I another characterL<[5]|proctut2.pod/Note_5>. The question mark indicates that the preceeding character may match I. In our example, the question mark is applied to the last dot in our expression: I<....?> If we were to read this regex in English, it would say: "Match any character, followed by any character, followed by any character, I followed by any character" or more briefly, "Match any three characters followed by zero or one additional characters." Now our expression matches: What the sam is going on here What the hill is going on here We can use multiple question marks to get exactly what we want: What the ..?.?.?.?.?.?.? is going on here can be read: "Any character followed by zero to seven additional characters," that is to say one, two, three, four, five, six, seven, or eight letter words: What the samhill is going on here What the gol durn is going on here What the c is going on here The question mark quantifier is sometimes forgotten because of its limited application compared to the other quantifiers, but this should not be. There are entire classes of regular expression problems that can only be solved with a question mark. For example, if we want to match a sentence where a plural may or may not occur, we need a question mark: Please let the dogs? out This will match: Please let the dogs out and: Please let the dog out It will not match more than one 's': Please let the dogsss out The question mark means "zero or one of the preceeding character" (sometimes it is read "with an optional x"). The power of the question mark is extended if the preceeding character is a special regular expression character (e.g., the dot). =item The Asterisk: B<*> The asterisk ("*") is the next quantifier we will discuss. The asterisk means "zero or more" of the preceeding character. This way, we can match "whatever". In fact, most regular expressions we will encounter will have I<.*> in them, meaning "and anything else (except newlines)". Here is a pattern that will match Subject lines that have 'ADV' and 'mortgage' in them: ^Subject: ADV.* mortgage Let's read it in English: The word "Subject:"L<[6]|proctut2.pod/Note_6>, followed by a space, followed by a the three letters ADV, followed by anything else (including nothing else), followed by the word "mortgage". This is useful because it will match subjects like thisL<[7]|proctut2.pod/Note_7>: Subject: ADV mortgage rates have dropped Subject: ADV mortgages have never been lower! Subject: Advertisement: lower your mortgage payment Subject: advanced mortgage seminars In the first example, I<.*> matches nothing; the asterisk assumes its ability to match zero of the preceeding characters (a dot, in this case). In the second example, I<.*> matches all the spaces after 'ADV' (except the space preceeding "mortgages" because we put that literal space in our pattern). In the third example, I<.*> matches "ertisement: lower your". In the fourth example, I<.*> matches "anced" as part of "advanced". B We can see how powerful the dot and asterisk can be together; but the asterisk needn't apply only to dots. It can apply to literal characters as well. Consider the following regular expression: Subject: .*!!!* This regex makes use of the familiar "dot-star" pattern we've seen before. We match the word "Subject:" followed by a space, followed by "anything" (there's our dot-star combo), followed by two exclamation points, followed by zero or more exclamation points. So, our pattern will match the following lines: Subject: dude!! Subject: Hey!!!!! In the first case, the dot-star (I<.*>) in our pattern matches "dude" and the next two exclamation points in our pattern match the two exclamation points following "dude". The final I of the regex matches I, because there are no more exclamation points that haven't been matched already. In the second case, the dot-star (I<.*>) in our pattern matches "Hey!!!". The next two exclamation points in our pattern match the final two exclamation points that the dot-star didn't catch; the final I in our pattern matches nothing again. Was that a surprise? Ah! Likely, we forgot that a dot matches I and when combined with a quantifier it matches as much as it needs to for the expression to be "true". Ok, so these weren't very good examples of how to apply the asterisk to a literal character (but we learned something useful anyway). We'll have a better example ready as we discuss the next quantifier. =item The Plus Sign: B<+> The plus ("+") is the last quantifier we'll discuss in this tutorial. The plus is closely related to the asterisk, but it means "one or more" rather than "zero or more". This means that there must be at least one instance of the preceeding character. Consider this example: Subject: .*!!!+ This is similar to our previous example with asterisk, except that we've exchanged the final asterisk for a plus. If we were to reconsider our previous example lines: Subject: dude!! Subject: Hey!!!!! We will get different results than with asterisk. For starters, the first line will not match our pattern. The dot-star (I<.*>) matches "dude" and then we have two literal exclamation points. The final I<.+> of our regex has nothing left to match, and since the plus must match I exclamation point, our regex fails for this line. The second example I match, but it does so in a different way than the asterisk example did. The leading dot-star matches "Hey!!", instead of "Hey!!!" (one fewer exclamation points). The next two literal exclamation points match, and the final bang-plus (I) of our regex matches the trailing exclamation point. Why does the first dot-star in our regex only match "Hey!!" instead of "Hey!!!" like it did earlier with the asterisk? Quantifiers match as much as they need to to allow subsequent parts of the pattern to match also: the regex engine I to find matches. In order to allow the bang-plus to match, the dot-star had to match less than it did before. =back As a final example, we'll cover the three quantifiers one last time. =over 4 =item B Let's take our pattern and make it more explicit: Subject: Hey!? This will now match I the following: Subject: Hey Subject: Hey! The question mark when applied to the exclamation point means "match zero or one exclamation points". =item B Now we'll use the asterisk: Subject: Hey!* This will match: Subject: Hey Subject: Hey! Subject: Hey!! Subject: Hey!!!!! etc. The asterisk applied to the exclamation point means "match zero or more exclamation points". =item B Finally, the plus: Subject: Hey!+ This matches: Subject: Hey! Subject: Hey!! Subject: Hey!!!!! etc. It will not match: Subject: Hey because the plus sign indicates that there I be at least one of the preceeding character (in this case, when applied to the exclamation point, it means "match one or more exclamation points"). =back =head1 SUMMARY Procmail regular expressions are much like other common egrep-like regular expression languages. Regular expressions are simply string matches with certain characters meaning special things, for example, the dot (".") matches any characterL<[4]|proctut2.pod/Note_4>. Combined with I, the dot makes a potent regular expression character. Quantifiers do not match anything by themselves; they only determine "how many" of the preceeding character to match. The most common I are the question mark ("?") which matches zero or one of the preceeding character; the asterisk ("*") matches zero or more of the preceeding character; and the plus ("+") matches one or more of the preceeding character. =head1 NOTES =over 4 =item Note 1 "Regex" is pronounced "reg-ex" as in "REGular EXpression" with a hard 'G'. You may come across "regexp" in your personal studies. This contraction is notoriously difficult (and highly discouraged) to say aloud without emitting spittle and is likewise frowned upon in writing. The opinions in this article reflect exactly those views of the author and, for reasons of world peace, should be considered authoritative. =item Note 2 For a complete list, see L. =item Note 3 There is a way to match the literal tokens by I them with a backslash. Here, for example, is how to match a literal period character: I<\.> =item Note 4 Er, almost any character. It doesn't match a newline character. See L. =item Note 5 Yet another half-truth. Quantifiers have meaning when placed after I, I, and I. We can call these things collectively "entities", "objects", "units", or my personal preference (betraying my Perl background), "thingies". In this tutorial, we simply say "character", but we really mean "thingies". =item Note 6 The caret means "at the beginning of the line." We'll cover that more thoroughly in another tutorial) =item Note 7 Procmail, it should be mentioned, matches I, that is, without regard to upper or lowercase letters. For case-sensitive matching, you'll need to enable the B flag in your L. =back =head1 PREVIOUS L =head1 NEXT L =head1 SEE ALSO procmail(1), procmailrc(5), procmailex(5), procmailsc(5), egrep(1), Jeffrey Friedl's I (O'Reilly) =head1 REALLY SEE ALSO Jeffrey Friedl's I (O'Reilly) is a masterful and thorough work on regular expressions. Anyone serious about becoming a competent regular expression writer should read this book. It covers a lot of technical ground, but the examples are excellent and most of it applies directly to procmail regular expressions (one important exception being procmail's lack of a numerical range quantifier like Perl's {n,m} syntax). =head1 AUTHOR Scott Wiersdorf =head1 COPYRIGHT Copyright (c) 2003 Scott Wiersdorf. All rights reserved. =head1 REVISION $Id: proctut2.pod,v 1.12 2003/10/15 04:35:46 deep Exp $