Loading ...
Sorry, an error occurred while loading the content.

Regular Expressions

Expand Messages
  • Jody
    Hi Jody, This is from my ISP s daily newletter. Thought you might be interested. If not, there s always the recycle bin. -Wren
    Message 1 of 2 , May 18, 2002
    • 0 Attachment
      Hi Jody,

      This is from my ISP's daily newletter. Thought you might be interested.
      If not, there's always the recycle bin.

      -Wren


      ===========================================================================
      Today on The World Vol. 4 #264 Wednesday, November 4, 1998
      ===========================================================================

      This may at first seem like a another UNIX-specific topic, but it turns out
      that the subject under discussion (regexps) is relevant to many Windows
      and Macintosh programs for accessing Usenet, screening mail, or editing
      text files.

      INTRO: Regular Expressions ("regexps")

      ---------------------------------------------------------------------------

      INTRO: Regular Expressions ("regexps")

      The term "regular expression" simply means "an abbreviation for a pattern".
      For instance, "(Mon|Tues|Fri)day" means "Monday, Tuesday, or Friday".

      Regular expressions -- often called "regexps" -- are one of the fundamental
      features of many UNIX programs (grep, procmail, trn, etc.), but they are
      not a UNIX-specific feature. Regexps are supported in many popular programs
      for Windows or Mac OS, especially Usenet news programs (for filtering)
      and text editors (for searching.)


      HISTORY OF REGEXPS


      A brief detour into the history of regexps before I explain how they work --

      In the earliest days of UNIX, the standard text-editor was a very unfriendly
      little program named "ed". (There were only two prompts available in ed:
      "*" for "OK" and "?" for "error".) The ed program introduced regular
      expressions for its function that allowed you to print out all lines of
      the current file which matched a regular expression: "g/regexp/p".
      ("g" for "global", "p" for "print". You do not need to know this.)
      Apparently the abbreviation "regexp" was too verbose for the real UNIX
      weenies, so it was also abbreviated "re", as in "g/re/p". This is why,
      when a program was written to search multiple text files for regular
      expressions, it was named "grep".

      So, you will hear the term "grepping" used as a synonym for "searching
      for a regular expression", because "g/re/p" was the command in an old
      text-editing program under UNIX.

      And a second digression --

      "ed" and "grep" were the first two programs to use regular expressions.
      They supported a fairly small set of features compared to more recent
      programs -- other programs that do regexps have much more flexible
      regexps. However, because the regexp support in the early programs was
      lacking, when regexps were enhanced in newer programs everyone did it
      slightly differently, so the regexp "language" differs a bit from
      one program to another. As a result, you may need to check the manual
      for the particular program you're using to see which of the options
      I discuss are supported. (For instance, egrep supports almost all this
      stuff; agrep supports most of it; grep supports some of it. trn
      supports some of it. MT-NewsWatcher supports most of it.)



      WHAT DO REGEXPS DO?


      A regular expression is simply a pattern which represents a possible
      range of words. For instance, ".ez" represents "fez" and "Pez" (as well
      as "bez", "qez", and "zez".) You can see that regexps make it
      easy to construct rules for filtering the spam out of your incoming
      mail or Usenet:

      "discard all articles whose subject line contains the regexp
      [0-9][0-9][0-9][-. ]?[0-9][0-9][0-9][0-9]"

      --and presto, nothing with a phone number in it slips through your filter!
      ("[0-9]" means "any digit", and the "[-. ]?" part means "a hyphen, a
      period, a space, or nothing" -- we'll talk about this later.)

      By the way, although I like to type quotes around every weird symbol
      I'm discussing here, you don't normally type quotes around a regexp.
      (In manuals, sometimes they're shown like /this/ as well.)

      If you've used DOS or the UNIX command line, you may be familiar with
      the concept that "?" represents any character, and "*" means "anything
      goes". So, for instance, in DOS, the command "DEL *.*" would delete
      all your files, and from the UNIX command line, "rm *" would delete all
      your files. Well, forget that stuff. DOS and the UNIX command line
      (the various UNIX shells) have a simple language for representing
      "wildcards" in filenames ("?" for one letter, "*" for any name) but
      this is NOT the same as the regexp language.

      I REPEAT, FILENAME WILDCARDS AND REGULAR EXPRESSIONS ARE DIFFERENT.

      For instance:

      DOS UNIX command line regexp

      any character ? ? .

      anything *.* * .*


      There. Now that I've thoroughly confused you, you're ready to
      forget everything you ever knew about asterisks and question marks
      and move on to regular expressions.


      ELEMENTARY REGEXPS


      The first thing to learn about the regexp language is ".", which
      means "any one character" (a character is any letter, number, symbol,
      or space, by the way.) "." represents exactly one character, no
      more, no less.


      match any one character: "."

      "c.t" matches "cat" and "cot", but not "cart" or "ct"

      "k.bo" matches "kibo" and "kybo"


      Of course, if you're using regexps, you're probably trying to look
      for patterns that are more flexible than "a word with one letter
      that might change". This is where the symbols for repeating
      things come in. "*" means "zero or more occurrences of the previous
      symbol".


      match zero or more occurrences of previous symbol: "*"

      "a*b" matches "b", "ab", "aab", "aaab", etc., but not "bb" or "cb"


      If "*" (repeat) follows "." (wildcard), that doesn't mean identical
      letters have to repeat -- it just means "zero or more of ANYTHING":


      "a.*b" matches "ab", "axylophoneb", "anougatb", "a12345b", etc.

      "fan.*" matches "fan", "fantastic", "fangs", "fancy pants"...

      ".*" matches EVERYTHING!


      (The last example would even match a blank line, so you could say that
      ".*" matches both everything and nothing.)

      ".*" is one of the most important regexp patterns to remember. It will
      match absolutely anything of any length. It will also match nothing ("").
      So, for instance, if you wanted to look for the word "TIP" followed
      by the word "FTP" anywhere one the same line:


      "TIP.*FTP" matches "I have a TIP about FTP for you," etc.


      "*" represents zero to infinity occurrences of whatever is before
      it (as in ".*".) What if we want to match fewer occurrences?


      match one or more occurrences of previous symbol: "+"

      "a+b" matches "ab", "aab", "aaab", etc., but NOT "b"

      ".+" matches "a", "zzz", "xylophone", but not ""

      [note: some programs do not support "+"]


      match zero or one occurrences of previous symbol: "?"

      "a?b" matches "b" and "ab", nothing else

      "colou?r" matches "color" and "colour" only

      "739-?0202" matches "7390202" and "739-0202" only

      "739.?0202" matches "7390202", "739-0202", "739*0202", etc.

      "cat ?food" matches "cat food" and "catfood" only

      [note: some programs do not support "?"]


      Now you can see why regexps are useful to have in a program that
      can filter your mail or Usenet newsgroups, or search text files --
      with a single regular expression you can represent a range of
      different ways of spelling "color" or typing a phone number.
      But the best is yet to come.


      LISTS OF CHARACTERS


      All right, "." means "any character". What if we only want to
      allow some characters? For instance, let's say I'm Vanna White and
      I like vowels. How do I look for vowels?


      match any one character from a list: "[xyz]"

      "[aeiou]" matches "a", "e", "i", "o", or "u" only

      "c[aou]t" matches "cat", "cot", "cut" only

      match any character from a range: "[x-z]"

      "[a-z]" matches any lowercase letter only

      "[b-d]at" matches "bat", "cat", "dat" only

      match any character not in list or range: "[^xyz]" or "[^x-z]"

      "c[^a]t" matches "cot", "cut", etc., but NOT "cat"


      Ready for a complicated example made by sticking "[]" and "?" together?


      "739[-. ]0202" matches only
      "739-0202", "739.0202", "739 0202", and "7390202"


      Wait, didn't "-" have a special meaning inside "[x-z]" above?
      And doesn't "." have a special meaning in regexps in general?

      This is where it gets tricky. Whatever program is interpreting your
      regexps is supposed to be smart enough to know that you obviously
      mean "a period" when you use "." inside brackets and "a wildcard"
      when you use "." outside brackets. Similarly, "[-x]" means "a hyphen
      or the letter x" but "[a-x]" means "a through x".
      So "[-. ]" means "hyphen, period, or space" and "?" means "zero or
      one of those". In other words, "hyphen, period, space, or nothing".

      What if we wanted to look for all phone numbers except 739-0202?


      "[0-689][0-24-9][0-8][-. ]?[1-9][013-9][1-9][013-9]" matches
      "666-7777", "1234567", "000 9999", etc., but not "739-0202"


      The first part of the regexp, "[0-689]", could also have been written
      as "[012345689]", meaning "any digit except 7". But wait! The example
      above has a horrible, horrible, rotten, mistake! Sure, it will
      find lots of phone numbers that aren't 739-0202. But it will also
      not find 777-7777, because its first digit is seven!

      There are ways around this, but they require taking advantage of
      tricky stuff we haven't looked at yet. Suffice it to say that
      there are ways of doing things as weird as "any phone number but
      739-0202" with regexps but they may involve a lot of thinking.
      For now, let's look at some more elementary stuff.


      GROUPS


      So far we've only looked at ways to pick out single characters.
      What if we want to apply "?" or "*" to something other than "."?


      group some stuff together: "(xyz)"

      "abc(xyz)?" matches "abc" and "abcxyz" only

      "abcxyz?" matches "abcxy" and "abcxyz" only

      "(yow)+" matches "yow", "yowyow", "yowyowyowyowyow", etc.

      [note: Some programs use "\(xyz\)" instead. Some
      don't support parentheses at all, and some break if you
      do very complex things with them.]


      If you're lucky enough to have a program that lets you use parentheses
      to group letters into words (most regexp-capable programs do, these days)
      you can also probably do something even more powerful with them:


      any of these alternatives: "(abc|xyz)"

      "(abc|xyz)" matches "abc" or "xyz" only

      "(Mon|Tues|Wednes|Thurs|Fri|Satur|Sun)day" matches a day


      You can see that by combining things you can make some pretty weird
      regular expressions:


      "Vote (for|against) (Bill |Ross )?(Clinton|Perot)"

      matches "Vote for Clinton", "Vote for Bill Clinton",
      "Vote against Bill Clinton", "Vote for Perot", and
      even "Vote for Bill Perot".


      Okay, there's no Bill Perot. So we could do it the right way:


      "Vote (for|against) ((Bill )?Clinton|(Ross )?Perot)"


      Now that's complicated. Some regexp-capable programs would choke
      on it (some aren't too bright about nested parentheses) but we've
      just made something that can find eight different phrases, all
      of which are equally sensible to someone. (Whereas "Vote for
      Bill Perot" isn't sensible to anyone.) If that one broke my
      program, I could just write it out longhand:


      "Vote (for|against) (Clinton|Bill Clinton|Perot|Ross Perot)"


      That is the same as the one above. It might work more rapidly
      because it's simpler (in a technical sense) even though it's longer.
      You can see that there is often more than one way of doing something.
      (You will also note that I put the spaces in specific places
      in the last three examples. Spaces are characters like any others,
      and if I had left them out or put in extras, these regexps would
      start looking for "BillClinton" or "Bill Clinton".)

      Let's put together another regexp: Suppose people keep posting a scam
      under the subject "MAKE.MONEY.FAST". Or sometimes "FAST-CASH".
      And sometimes "BIG*MONEY".


      "(BIG|MAKE|FAST).(MONEY|CASH)"
      matches "MAKE.MONEY", "BIG*MONEY", "FAST-CASH", etc.


      Note that, depending on what program you're using, it may or may not
      also match lowercase versions of those same words. (UNIX programs
      are normally "case-sensitive", meaning they won't match capitals to
      lowercase, but programs for Windows and Mac OS usually AREN'T case-
      sensitive. There will be an option somewhere in your program to
      specify that you want to match or not match capitals and lowercase.)

      Do you think my MAKE.MONEY.FAST detector is too good to be true?
      Watch as I test it (without case-sensitivity) on my own mailbox!


      % egrep -i '(BIG|MAKE|FAST).(MONEY|CASH)' mbox

      Subject: Make Money Like A Porn King!
      Learn how to make money selling and trading them.
      To: Make@...
      To: Make@...
      Would you like to make money with your computer? I may have a
      > but instead they were planned so big money could be made by the bankers
      > > but instead they were planned so big money could be made by the bankers


      ("egrep", by the way, is "enhanced grep", meaning that it can do
      the parentheses and stuff. Regular grep can't.)

      Notice how using a nice flexible regexp like this saves me having to
      search for "MAKE MONEY", "MAKE.MONEY", "FAST.MONEY", and all the
      other permutations -- and if I structure it right it'll even find ones
      nobody's thought of. (It caught "Make@..."!) The trick is
      to generalize your regexp enough to find all the stuff you're looking for
      (you may be looking for good stuff to read or bad stuff to throw away)
      without accidentally including any "false positives".


      MORE STUFF


      By the way, what if you really DO want to search for a period,
      and not the magical thing that "." represents in regexp-land?
      Let's say we want to search for things that contain "..." so we
      can find all the people who drone on and on...


      prevent a symbol for doing something special: "\"

      "c.t" matches "cat", "cot", "c.t", etc.

      "c\.t" matches only "c.t"


      ...so how would you look for a backslash? "\\". The first one
      makes the second one a real backslash. ("\" is also used in
      some regexps for other special functions, as we'll see.)

      So far, we've only searched for individual words and phrases.
      How could we search for expressions such as "a whole line in all
      capitals"? Saying "[A-Z]" (assuming we're using a case-sensitive
      program, of course!) would simply match any one capital letter,
      so if we wanted to discard all mail that was in all capitals, it
      would discard anything that had at least ONE capital -- very bad.
      And "[A-Z]+" would match anything that had ONE OR MORE capital,
      but it still wouldn't rule out the rest of the line being lowercase.
      Wouldn't it be great if there were special symbols that could let
      us demand that the regexp cover the whole line?


      match the beginning of the line: "^"

      match the end of the line: "$"

      "^Re:" matches "Re: Your Mama" but not "I sent mail Re: Pez"

      "food$" matches any line ending with "food"

      "^[A-Z]*$" matches anything consisting of capitals from
      end to end


      ...whoops, that last one still won't work right. It will find
      lines like "THISISINALLCAPITALS" but we didn't list " " or "."
      in our list of characters. Better:


      "^[A-Z. ]*$" matches anything consisting of capitals, spaces,
      or periods from end to end


      But that still won't match hyphens, slashes, numbers, and --
      oh, heck with it, let's just do it the right way:


      "^[^a-z]*$" matches anything having no lowercase letters


      Notice that "^" has a different meaning inside and outside brackets,
      as we saw for ".". That last pattern will match anything where the
      beginning of the line is followed by one or more capitals, digits,
      or punctuation marks, all the way to the end of the line.

      Similarly, you can "anchor" your pattern to the beginning or end
      of a word, as well as the ends of the line:


      match the beginning of a word: "\<"
      match the end of a word: "\>"

      [some programs use "\b" instead to represent either end.
      Some don't do either.]

      "\<cat" (or "\bcat") matches only lines which contain
      a word starting with "cat".

      "s\>" or "s\b" matches only lines which contain a
      word ending with "s".


      In other words, if we searched for "cat", we could find lines
      which contained the words "mercator", "catalog", and "cat".
      If we searched for "\<cat", we wouldn't find "mercator", but we
      would still find "catalog" and "cat". "\<cat\>" would find only
      "cat". See your program's manual to find out if it uses "\<" and
      "\>" or just "\b".

      There's only one more basic regexp symbol to mention -- braces:


      match exactly n occurrences of preceding: "{n}"
      match n or more of preceding: "{n,}"
      match n through n2 of preceding: "{n,n2}"

      "[0-9]{2,5}-[0-9]{2,5}" would match "00-00" through "99999-99999",
      as well as lots of other phone-number-like things.

      "^-{60,80}$" would match any line which is completely full
      of hyphens and 60 to 80 characters long.

      [many programs don't have the functions involving braces.]


      I didn't mention "{n}" when I was talking about "*" and "+" because
      most programs can't do "{n}".

      Regexp-capable programs sometimes support other interesting features.
      See your program's documentation for a list of what it can or can't do.

      So, back to an earlier problem: How would we look for all phone numbers
      except 739-0202? I will leave this as an experiment for
      you folks (I'm sure it will stimulate discussion on wstd.general,
      as it's a tricky problem to solve. Who can find the most elegant solution?)
    • h.paulissen@facburfdcw.unimaas.nl
      Alec, I brought this OT... ... I guess so... Although Alan C. stated that there is a different meaning for d and [ d] I can t see the reason for that. d *to
      Message 2 of 2 , Sep 4, 2002
      • 0 Attachment
        Alec, I brought this OT...

        >
        > Same thing on my Win98 ...NtbPro 4.91:
        > "/d" - works, "[\d]" finds "d" Looks like \d isn't special inside
        > a character class, hence get interpret as \+<any char> = <any
        > char>

        I guess so... Although Alan C. stated that there is a different
        meaning for \d and [\d] I can't see the reason for that. \d *to me*
        means 'any digit', where [\d] stands for '[any digit]'. Especially
        if you're looking for more complex patterns, the meaning of \d (and
        so forth) should not be altered when used in a character class.

        (NoteTab _does_ find a space with this pattern: [\s]. Help has even
        a better example:
        "\b any blank (white) space including space, tab, form-feed, etc.
        Equivalent to [\s\t\f\n\r]"
        But, if you try to run the latter you will find that it finds an f
        too).

        > the > TCL kit Alec proposed << (Visual regexp) finds the digits
        > and only the digits with either "\d" or [\d]".

        Same goes for the tool that you couldn't install. And
        ^[\d|\s|\.]+. finds "0 0 0. h" at the beginning of a line...

        > I think I ran into something similar some time ago.
        >
        > I've adopted the habit of using [0123456780] or [0-9] instead of
        > \d and [a-zA-Z] etc when I need or want to use group character
        > classes. Though usually I make the mistake, guess what the
        > problem is and only then enumerate the class ;-)

        In NoteTab [0-9] inside a character class does not work either;
        been through that...

        > I guess this close to v5 the answer is:
        > It does what ever it does, if its not what it SHOULD be - sorry
        > about your luck ;-)

        <G> Right!

        About Padgett's Regular Expression Manager...
        > > See for yourself at:
        > > http://www.vbcity.com/pubs/article.asp?alias=regexp
        >
        > I d/l'd, installed and tried to run this.
        >
        > <offtopic now ....
        > When I tried to run this I got a warning box saying:
        > Run-time error '429':
        > ActiveX component can't create object.
        >
        > The write-up for the program says its still in "beta". Last
        > comments were about six months ago. I signed up to the bulletin
        > board but have to wait for the confirmation e-mail before I can
        > attempt to ask the author what's likely to be wrong. By your
        > comments, I assume this did not occur to you ?

        No, I can run the app without problems and it does a fairly decent
        job. I don't know what prevents you from running the install
        program. Saying something about missing drivers or components would
        be a very wild guess from my side, possibly leading you further away
        from a solution <g>.


        Hugo
      Your message has been successfully submitted and would be delivered to recipients shortly.