Loading ...
Sorry, an error occurred while loading the content.
 

Regex help

Expand Messages
  • Malcolm Mill
    Hi, Newbie here having trouble with regex. I m trying to parse an html file saved as text to find all instances of time. The parsed file is called
    Message 1 of 7 , Nov 2, 2004
      Hi,
      Newbie here having trouble with regex.

      I'm trying to parse an html file saved as text to find all instances of time.

      The parsed file is called myFindTime2.txt and contains text like....
      =============================================
      <tr>
      <td align="right" valign="top">
      <font size="2" color="#000000" class="listings">
      11:30 am<br />
      </font>
      </td>
      </tr>

      ==============================================

      The script is called myFindTime2.pl and contains the lines....
      ==============================================
      foreach (<>) {
      s/([\t][\d2][:][\d2][\s])/\1/;
      print $_;
      }
      ==============================================

      The command I use is.....
      perl myFindTime2.pl myFindTime2.txt

      Any suggestions? Is what I'm trying to do here with
      s/search/replace_backreference/ valid?
    • Charles K. Clarkson
      ... Depends on your definition of valid. It will run. Without warnings turned on, it will run without any warnings. With warnings turned on, you ll see a
      Message 2 of 7 , Nov 2, 2004
        From: Malcolm Mill <mailto:malcolm.mill@...> wrote:

        : I'm trying to parse an html file saved as text to find all
        : instances of time.
        :
        : The parsed file is called myFindTime2.txt and contains text
        : like.... =============================================
        : <tr>
        : <td align="right" valign="top">
        : <font size="2" color="#000000" class="listings">
        : 11:30 am<br />
        : </font>
        : </td>
        : </tr>
        :
        : ==============================================
        :
        : The script is called myFindTime2.pl and contains the lines....
        : ==============================================
        : foreach (<>) {
        : s/([\t][\d2][:][\d2][\s])/\1/;
        : print $_;
        : }
        : ==============================================
        :
        : The command I use is.....
        : perl myFindTime2.pl myFindTime2.txt
        :
        : Any suggestions? Is what I'm trying to do here with
        : s/search/replace_backreference/ valid?


        Depends on your definition of valid. It will run.
        Without warnings turned on, it will run without any
        warnings. With warnings turned on, you'll see a message
        to use $1 instead. \1 is used inside the right-hand side
        of the s/// operator. $1 is used everywhere else.

        Your regex may not match what you think. Try this

        #!/usr/bin/perl

        use strict;
        use warnings;

        foreach ( <DATA> ) {
        print if m/[\t][\d2][:][\d2][\s]/;
        }
        __END__
        <tr>
        <td align="right" valign="top">
        <font size="2" color="#000000" class="listings">
        11:30 am<br />
        </font>
        </td>
        </tr>


        Nothing prints because you are using character classes
        incorrectly. You can't tell because the algorithm prints
        every line regardless of whether there was a match.

        [\t] will match a single tab and is the same as \t
        [\d2] will match any single digit or 2 which is the same as \d
        [:] will match a colon which is the same as :
        [\s] will match white space and is the same as \s

        /[\t][\d2][:][\d2][\s]/ is equivalent to /\t\d:\d\s/ which
        does not occur in the file sample.

        You probably wanted /\t\d{2}:\d{2}\s/.

        HTH,

        Charles K. Clarkson
        --
        Mobile Homes Specialist
        254 968-8328
      • Jeff 'japhy' Pinyan
        ... That regex has SEVERAL things wrong with it. First of all, backreferences on the right-hand side should be $1, not 1. Second, quantifiers on parts of a
        Message 3 of 7 , Nov 2, 2004
          On Nov 2, Malcolm Mill said:

          ><tr>
          > <td align="right" valign="top">
          > <font size="2" color="#000000" class="listings">
          > 11:30 am<br />
          > </font>
          > </td>
          ></tr>

          > s/([\t][\d2][:][\d2][\s])/\1/;

          That regex has SEVERAL things wrong with it. First of all, backreferences
          on the right-hand side should be $1, not \1. Second, quantifiers on parts
          of a regex are done with {...}, so two digits is \d{2}. Third, NONE of
          those character classes is necessary:

          s/(\t\d{2}:\d{2}\s)/$1/;

          But why are you matching something and replacing it with itself? Perhaps
          you want to do:

          print $1 if /\t(\d{2}:\d{2})\s/;

          --
          Jeff "japhy" Pinyan % How can we ever be the sold short or
          RPI Acacia Brother #734 % the cheated, we who for every service
          http://japhy.perlmonk.org/ % have long ago been overpaid?
          http://www.perlmonks.org/ % -- Meister Eckhart
        • Jenda Krynicky
          From: Malcolm Mill ... It s generaly not too good to try to parse HTML by regexps. Except if the HTML was generated by something and
          Message 4 of 7 , Nov 2, 2004
            From: Malcolm Mill <malcolm.mill@...>
            > Hi,
            > Newbie here having trouble with regex.
            >
            > I'm trying to parse an html file saved as text to find all instances
            > of time.
            >
            > The parsed file is called myFindTime2.txt and contains text like....
            > ============================================= <tr>
            > <td align="right" valign="top">
            > <font size="2" color="#000000" class="listings">
            > 11:30 am<br />
            > </font>
            > </td>
            > </tr>
            >
            > ==============================================

            It's generaly not too good to try to parse HTML by regexps. Except if
            the HTML was generated by something and you can be sure it looks (and
            will look) the way you expect.

            > The script is called myFindTime2.pl and contains the lines....
            > ==============================================
            > foreach (<>) {
            > s/([\t][\d2][:][\d2][\s])/\1/;

            Did you actually read a regexp tutorial or something? [] denotes a
            character class. [abcd] means "any of the characters a, b, c and d".

            > print $_;

            If you are searching for something you should not mangle the data,
            just capture what do you need:

            /\t(\d{2}:\d{2})\s/ and print "$1\n";


            > }

            Jenda
            ===== Jenda@... === http://Jenda.Krynicky.cz =====
            When it comes to wine, women and song, wizards are allowed
            to get drunk and croon as much as they like.
            -- Terry Pratchett in Sourcery
          • Malcolm Mill
            ... Hi Jenda, The book I m working from is Apache, MySQL, and PHP Web Development: For Dummies . It is a 7 books in 1 title, and has sections on Perl and
            Message 5 of 7 , Nov 2, 2004
              On Tue, 02 Nov 2004 23:48:27 +0100, Jenda Krynicky <jenda@...> wrote:
              > From: Malcolm Mill <malcolm.mill@...>
              >
              >
              > > Hi,
              > > Newbie here having trouble with regex.
              > >
              > > I'm trying to parse an html file saved as text to find all instances
              > > of time.
              > >
              > > The parsed file is called myFindTime2.txt and contains text like....
              > > ============================================= <tr>
              > > <td align="right" valign="top">
              > > <font size="2" color="#000000" class="listings">
              > > 11:30 am<br />
              > > </font>
              > > </td>
              > > </tr>
              > >
              > > ==============================================
              >
              > It's generaly not too good to try to parse HTML by regexps. Except if
              > the HTML was generated by something and you can be sure it looks (and
              > will look) the way you expect.
              >
              > > The script is called myFindTime2.pl and contains the lines....
              > > ==============================================
              > > foreach (<>) {
              > > s/([\t][\d2][:][\d2][\s])/\1/;
              >
              > Did you actually read a regexp tutorial or something? [] denotes a
              > character class. [abcd] means "any of the characters a, b, c and d".
              >

              Hi Jenda,
              The book I'm working from is "Apache, MySQL, and PHP Web Development:
              For Dummies". It is a 7 books in 1 title, and has sections on Perl and
              Regular Expressions. I looked for explicit examples of what I want to
              to but couldnt find any so have been messing around with given
              examples to try get the result I want.

              > > print $_;
              >
              > If you are searching for something you should not mangle the data,
              > just capture what do you need:
              >
              > /\t(\d{2}:\d{2})\s/ and print "$1\n";
              >
              >
              > > }
              >

              This code, from another list does what I eventually wanted to do,
              which was capture not only d2 numbers, but d1,2 and [ap]m.

              foreach (<>) {
              print "$1\n" while ($_ =~ /(\d{1,2}\:\d{2} [ap]m)/gi);
              }

              What exactly do { } and g do?
              I understand \d{1,2} means capture digits of length 1 or 2. Why
              couldnt I use [ ]? Are they only for characters?

              > Jenda
              > ===== Jenda@... === http://Jenda.Krynicky.cz =====
              > When it comes to wine, women and song, wizards are allowed
              > to get drunk and croon as much as they like.
              > -- Terry Pratchett in Sourcery
              >
              >
              >
              >
              > Unsubscribing info is here:
              > http://help.yahoo.com/help/us/groups/groups-32.html
              >
              >
              >
              > Yahoo! Groups Sponsor
              >
              >
              > Get unlimited calls to
              >
              > U.S./Canada
              >
              > ________________________________
              > Yahoo! Groups Links
              >
              > To visit your group on the web, go to:
              > http://groups.yahoo.com/group/perl-beginner/
              >
              > To unsubscribe from this group, send an email to:
              > perl-beginner-unsubscribe@yahoogroups.com
              >
              > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
              >
              >
            • Malcolm Mill
              On Tue, 2 Nov 2004 16:57:42 -0500 (EST), Jeff japhy Pinyan ... Hi, Japhy. I wasnt sure how to use the backreference. The example from my book I based my
              Message 6 of 7 , Nov 2, 2004
                On Tue, 2 Nov 2004 16:57:42 -0500 (EST), Jeff 'japhy' Pinyan
                <japhy@...> wrote:
                > On Nov 2, Malcolm Mill said:
                >
                >
                >
                > ><tr>
                > > <td align="right" valign="top">
                > > <font size="2" color="#000000" class="listings">
                > > 11:30 am<br />
                > > </font>
                > > </td>
                > ></tr>
                >
                > > s/([\t][\d2][:][\d2][\s])/\1/;
                >
                > That regex has SEVERAL things wrong with it. First of all, backreferences
                > on the right-hand side should be $1, not \1. Second, quantifiers on parts
                > of a regex are done with {...}, so two digits is \d{2}. Third, NONE of
                > those character classes is necessary:
                >
                > s/(\t\d{2}:\d{2}\s)/$1/;

                Hi, Japhy.
                I wasnt sure how to use the backreference. The example from my book I
                based my regex on was

                $name = "Ingelbert Inguishable";
                $name =~ s/([A-Z]\w+)\b ([A-Z]\w+)\b/\2, \1/;
                print "$name\n";

                >
                > But why are you matching something and replacing it with itself? Perhaps
                > you want to do:
                >
                > print $1 if /\t(\d{2}:\d{2})\s/;

                Yep, that is what I wanted to do. I made a test script

                if ("06:00" =~ /[\d2][:][\d2]/) {
                print "Time found.\n";
                }


                This worked and I assumed I could use it in a larger expression. Why
                was I trying to match something and replace it with itself? I didn't
                know any other way. I am just working from a basic book with limited
                examples trying to extrapolate what I want where there are no explicit
                examples for what I want.

                This code (from a response to this query submitted to another list)
                does what I ultimately wanted to do.

                foreach (<>) {
                print "$1\n" while ($_ =~ /(\d{1,2}\:\d{2} [ap]m)/gi);
                }

                Thanks,
                Malcolm.

                > --
                > Jeff "japhy" Pinyan % How can we ever be the sold short or
                > RPI Acacia Brother #734 % the cheated, we who for every service
                > http://japhy.perlmonk.org/ % have long ago been overpaid?
                > http://www.perlmonks.org/ % -- Meister Eckhart
                >
                >
                >
                >
                > Unsubscribing info is here:
                > http://help.yahoo.com/help/us/groups/groups-32.html
                >
                >
                >
                > Yahoo! Groups Sponsor
                >
                > ADVERTISEMENT
                >
                >
                >
                >
                > ________________________________
                > Yahoo! Groups Links
                >
                > To visit your group on the web, go to:
                > http://groups.yahoo.com/group/perl-beginner/
                >
                > To unsubscribe from this group, send an email to:
                > perl-beginner-unsubscribe@yahoogroups.com
                >
                > Your use of Yahoo! Groups is subject to the Yahoo! Terms of Service.
              • Jenda Krynicky
                From: Malcolm Mill ... Try to read perlretut . Either run perldoc perlretut or go to
                Message 7 of 7 , Nov 3, 2004
                  From: Malcolm Mill <malcolm.mill@...>
                  > Hi Jenda,
                  > The book I'm working from is "Apache, MySQL, and PHP Web Development:
                  > For Dummies". It is a 7 books in 1 title, and has sections on Perl and
                  > Regular Expressions. I looked for explicit examples of what I want to
                  > to but couldnt find any so have been messing around with given
                  > examples to try get the result I want.

                  Try to read "perlretut".
                  Either run
                  perldoc perlretut
                  or go to http://www.perldoc.com/perl5.8.4/pod/perlretut.html

                  > > > print $_;
                  > >
                  > > If you are searching for something you should not mangle the data,
                  > > just capture what do you need:
                  > >
                  > > /\t(\d{2}:\d{2})\s/ and print "$1\n";
                  > >
                  > >
                  > > > }
                  > >
                  >
                  > This code, from another list does what I eventually wanted to do,
                  > which was capture not only d2 numbers, but d1,2 and [ap]m.
                  >
                  > foreach (<>) {
                  > print "$1\n" while ($_ =~ /(\d{1,2}\:\d{2} [ap]m)/gi);
                  > }
                  >
                  > What exactly do { } and g do?

                  The {} specifies that the number(s) within specify how many
                  occurances of the previous character or group to match.

                  > I understand \d{1,2} means capture digits of length 1 or 2. Why
                  > couldnt I use [ ]? Are they only for characters?

                  Inside [] most special characters are not special anymore. The []
                  specifies a group of characters (a character class) that may be
                  accepted at the current place. This means that
                  [\d2]
                  means "match either a digit or 2" and is equivalent to
                  \d
                  or
                  [0-9]
                  or
                  [0123456789]
                  Also
                  [\d{1,2}]
                  is equivalent to
                  [}{,\d]
                  that is it means "match an opening or closng curly or a comma or a
                  digit.

                  Also /[:]/ is the same as /:/, /[\t]/ is the same as /\t/, /[\s]/ is
                  the same as /\s/.

                  So if I just simplified your regexp
                  s/([\t][\d2][:][\d2][\s])/\1/;
                  I'd get
                  s/(\t\d:\d\s)/\1/;

                  Jenda
                  ===== Jenda@... === http://Jenda.Krynicky.cz =====
                  When it comes to wine, women and song, wizards are allowed
                  to get drunk and croon as much as they like.
                  -- Terry Pratchett in Sourcery
                Your message has been successfully submitted and would be delivered to recipients shortly.