Loading ...
Sorry, an error occurred while loading the content.

Regular Expression mathces but captures are undefined

Expand Messages
  • warrengallin
    I am having a problem with some lines of text matching my regular expression, but the captured parts of the match are not defined. Attached is a minimal
    Message 1 of 7 , Jan 19, 2013
    • 0 Attachment
      I am having a problem with some lines of text matching my regular expression, but the captured parts of the match are not defined.

      Attached is a minimal example of my problem. The match condition for the if condition is met, the match value is printed, but the four individual captures are undefined and are not printed. Perl 5.12 running on OSX Mountain Lion. Note, this is one line from a large file, most of which are handled as expected, but several of which fail in the same way.

      #!/usr/bin/perl
      use strict;
      use warnings;

      my $temp_in = "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap, Tm=60 from oligoCalc Web site, note, one C shorter than WJG2981, avoid in frame ORF w/ beta gal promoter IF you are not cutting for ligation into pXT7 CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";

      if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
      print "Match is:\n $&\n";
      my $sequence = $4;
      $sequence =~ tr/[a-z]/[A-Z]/;
      $sequence =~ s/\s//;
      my $title = $1;
      my $comment1 = $2;
      my $comment2 =$3;
      print "Found a Match\n$title\n$comment1\n$comment2\n$sequence\n";
      }
      else{

      print "Did not match.\n";

      }

      exit;
    • timothy adigun
      Hi warrengallin, Please check my comments below: ... suggest using Perl function *split, *like so, to solve this: #!/usr/bin/perl use strict; use warnings; my
      Message 2 of 7 , Jan 19, 2013
      • 0 Attachment
        Hi warrengallin,

        Please check my comments below:
        On Sun, Jan 20, 2013 at 12:45 AM, warrengallin <wgallin@...> wrote:

        > **
        >
        >
        > I am having a problem with some lines of text matching my regular
        > expression, but the captured parts of the match are not defined.
        >
        > Attached is a minimal example of my problem. The match condition for the
        > if condition is met, the match value is printed, but the four individual
        > captures are undefined and are not printed. Perl 5.12 running on OSX
        > Mountain Lion. Note, this is one line from a large file, most of which are
        > handled as expected, but several of which fail in the same way.
        >
        > #!/usr/bin/perl
        > use strict;
        > use warnings;
        >
        > my $temp_in = "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap,
        > Tm=60 from oligoCalc Web site, note, one C shorter than WJG2981, avoid in
        > frame ORF w/ beta gal promoter IF you are not cutting for ligation into
        > pXT7 CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";
        >
        > if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
        > print "Match is:\n $&\n";
        > my $sequence = $4;
        > $sequence =~ tr/[a-z]/[A-Z]/;
        > $sequence =~ s/\s//;
        > my $title = $1;
        > my $comment1 = $2;
        > my $comment2 =$3;
        > print "Found a Match\n$title\n$comment1\n$comment2\n$sequence\n";
        > }
        > else{
        >
        > print "Did not match.\n";
        >
        > }
        >
        > Instead of using regex to match each desired substrings I would rather
        suggest using Perl function *split, *like so, to solve this:

        #!/usr/bin/perl
        use strict;
        use warnings;

        my $temp_in =
        "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap, Tm=60 from
        oligoCalc Web site, note, one C shorter than WJG2981, avoid in frame ORF w/
        beta gal promoter IF you are not cutting for ligation into pXT7 CAA CTT TGG
        CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";

        my @string_array = split /\t/, $temp_in, 4;

        if ( $temp_in =~ m/^WJG\d{4}/ and ( @string_array == 4 ) ) {
        my ( $title, $comment1, $comment2, $sequence ) = @string_array;
        $sequence =~ tr/[a-z]/[A-Z]/;
        print "Found a Match\n\nTitle: ", $title,
        "\n\nComment 1: ", $comment1, "\n\nComment 2: ", $comment2,
        "\n\nSequence: ", $sequence, "\n\n";
        }
        else {
        print "Did not match.\n";
        }
        __END__

        OUTPUT:
        Found a Match

        Title: WJG2983

        Comment 1: jShaw2 ORF Forward

        Comment 2: Redesigned for longer overlap, Tm=60 from oligoCalc Web site,
        note, one C shorter than WJG2981, avoid in frame ORF w/ beta gal promoter
        IF you are not cutting for ligation into pXT7

        Sequence: CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCATGTCGGCAGCAAGAAATCT

        NOTE:
        If you still want to print the match string, one can do that by printing
        the original "$temp_in" if and only when the "IF" condition is met.

        For more on split you can do *perldoc -f split*.


        > exit;
        >
        >
        >



        --
        Tim


        [Non-text portions of this message have been removed]
      • Oral Akkan
        Hi Just use this one and write again if it does not work. I cannot test it for you, because in the originat text you must have tabulator ( t) and I see here
        Message 3 of 7 , Jan 19, 2013
        • 0 Attachment
          Hi

          Just use this one and write again if it does not work. I cannot test it for you, because in the originat text you must have tabulator (\t) and I see here only spaces. Consider that $1, $2, $3, ... have all only short lifetime and you must save them immediately in variables after the matching. 

          ...
          ...
          if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
              my ($title,$comment1,$comment2,$sequence) = ($1,$2,$3,$4);
              print "Match is:\n $&\n";
              $sequence =~ tr/[a-z]/[A-Z]/;
              $sequence =~ s/\s//;
              print "Found a Match\n$title\n$comment1\n$comment2\n$sequence\n";
          }
          else{
              print "Did not match.\n";
          }





          ________________________________
          Von: warrengallin <wgallin@...>
          An: perl-beginner@yahoogroups.com
          Gesendet: 0:45 Sonntag, 20.Januar 2013
          Betreff: [PBML] Regular Expression mathces but captures are undefined


           
          I am having a problem with some lines of text matching my regular expression, but the captured parts of the match are not defined.

          Attached is a minimal example of my problem. The match condition for the if condition is met, the match value is printed, but the four individual captures are undefined and are not printed. Perl 5.12 running on OSX Mountain Lion. Note, this is one line from a large file, most of which are handled as expected, but several of which fail in the same way.

          #!/usr/bin/perl
          use strict;
          use warnings;

          my $temp_in = "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap, Tm=60 from oligoCalc Web site, note, one C shorter than WJG2981, avoid in frame ORF w/ beta gal promoter IF you are not cutting for ligation into pXT7 CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";

          if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
          print "Match is:\n $&\n";
          my $sequence = $4;
          $sequence =~ tr/[a-z]/[A-Z]/;
          $sequence =~ s/\s//;
          my $title = $1;
          my $comment1 = $2;
          my $comment2 =$3;
          print "Found a Match\n$title\n$comment1\n$comment2\n$sequence\n";
          }
          else{

          print "Did not match.\n";

          }

          exit;




          [Non-text portions of this message have been removed]
        • Warren Gallin
          Tim, That worked perfectly. Although I am curious about the reason for the regex approach failing, your suggestions makes my script work, which is the most
          Message 4 of 7 , Jan 19, 2013
          • 0 Attachment
            Tim,

            That worked perfectly. Although I am curious about the reason for the regex approach failing, your suggestions makes my script work, which is the most important thing.

            Thanks,

            Warren

            On 2013-01-19, at 5:45 PM, timothy adigun <2teezperl@...> wrote:

            > Hi warrengallin,
            >
            > Please check my comments below:
            > On Sun, Jan 20, 2013 at 12:45 AM, warrengallin <wgallin@...> wrote:
            >
            >> **
            >>
            >>
            >> I am having a problem with some lines of text matching my regular
            >> expression, but the captured parts of the match are not defined.
            >>
            >> Attached is a minimal example of my problem. The match condition for the
            >> if condition is met, the match value is printed, but the four individual
            >> captures are undefined and are not printed. Perl 5.12 running on OSX
            >> Mountain Lion. Note, this is one line from a large file, most of which are
            >> handled as expected, but several of which fail in the same way.
            >>
            >> #!/usr/bin/perl
            >> use strict;
            >> use warnings;
            >>
            >> my $temp_in = "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap,
            >> Tm=60 from oligoCalc Web site, note, one C shorter than WJG2981, avoid in
            >> frame ORF w/ beta gal promoter IF you are not cutting for ligation into
            >> pXT7 CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";
            >>
            >> if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
            >> print "Match is:\n $&\n";
            >> my $sequence = $4;
            >> $sequence =~ tr/[a-z]/[A-Z]/;
            >> $sequence =~ s/\s//;
            >> my $title = $1;
            >> my $comment1 = $2;
            >> my $comment2 =$3;
            >> print "Found a Match\n$title\n$comment1\n$comment2\n$sequence\n";
            >> }
            >> else{
            >>
            >> print "Did not match.\n";
            >>
            >> }
            >>
            >> Instead of using regex to match each desired substrings I would rather
            > suggest using Perl function *split, *like so, to solve this:
            >
            > #!/usr/bin/perl
            > use strict;
            > use warnings;
            >
            > my $temp_in =
            > "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap, Tm=60 from
            > oligoCalc Web site, note, one C shorter than WJG2981, avoid in frame ORF w/
            > beta gal promoter IF you are not cutting for ligation into pXT7 CAA CTT TGG
            > CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";
            >
            > my @string_array = split /\t/, $temp_in, 4;
            >
            > if ( $temp_in =~ m/^WJG\d{4}/ and ( @string_array == 4 ) ) {
            > my ( $title, $comment1, $comment2, $sequence ) = @string_array;
            > $sequence =~ tr/[a-z]/[A-Z]/;
            > print "Found a Match\n\nTitle: ", $title,
            > "\n\nComment 1: ", $comment1, "\n\nComment 2: ", $comment2,
            > "\n\nSequence: ", $sequence, "\n\n";
            > }
            > else {
            > print "Did not match.\n";
            > }
            > __END__
            >
            > OUTPUT:
            > Found a Match
            >
            > Title: WJG2983
            >
            > Comment 1: jShaw2 ORF Forward
            >
            > Comment 2: Redesigned for longer overlap, Tm=60 from oligoCalc Web site,
            > note, one C shorter than WJG2981, avoid in frame ORF w/ beta gal promoter
            > IF you are not cutting for ligation into pXT7
            >
            > Sequence: CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCATGTCGGCAGCAAGAAATCT
            >
            > NOTE:
            > If you still want to print the match string, one can do that by printing
            > the original "$temp_in" if and only when the "IF" condition is met.
            >
            > For more on split you can do *perldoc -f split*.
            >
            >
            >> exit;
            >>
            >>
            >>
            >
            >
            >
            > --
            > Tim
            >
          • Jenda Krynicky
            From: warrengallin ... The line above is the problem. The $1 and friends contain the data from the last successful regexp match and
            Message 5 of 7 , Jan 20, 2013
            • 0 Attachment
              From: "warrengallin" <wgallin@...>
              > I am having a problem with some lines of text matching my regular expression, but the captured parts of the match are not defined.
              >
              > Attached is a minimal example of my problem. The match condition for
              > the if condition is met, the match value is printed, but the four
              > individual captures are undefined and are not printed. Perl 5.12
              > running on OSX Mountain Lion. Note, this is one line from a large
              > file, most of which are handled as expected, but several of which fail
              > in the same way.
              >
              > #!/usr/bin/perl
              > use strict;
              > use warnings;
              >
              > my $temp_in = "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap, Tm=60 from oligoCalc Web site, note, one C shorter than WJG2981, avoid in frame ORF w/ beta gal promoter IF you are not cutting for ligation into pXT7 CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";
              >
              > if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
              > print "Match is:\n $&\n";
              > my $sequence = $4;
              > $sequence =~ tr/[a-z]/[A-Z]/;
              > $sequence =~ s/\s//;

              The line above is the problem. The $1 and friends contain the data
              from the last successful regexp match and s/.../.../ is a regex match
              and replace.

              > my $title = $1;
              > my $comment1 = $2;
              > my $comment2 =$3;
              > print "Found a Match\n$title\n$comment1\n$comment2\n$sequence\n";

              You should copy the data from $1, $2, ... to ordinary variables as
              soon as possible, before something overwrites them.

              Jenda
              ===== Jenda@... === http://Jenda.Krynicky.cz =====
              When it comes to wine, women and song, wizards are allowed
              to get drunk and croon as much as they like.
              -- Terry Pratchett in Sourcery
            • Warren Gallin
              Thanks, that explains it - I ll keep this in mind in the future. Warren Gallin
              Message 6 of 7 , Jan 20, 2013
              • 0 Attachment
                Thanks, that explains it - I'll keep this in mind in the future.

                Warren Gallin

                On 2013-01-20, at 6:02 PM, "Jenda Krynicky" <Jenda@...> wrote:

                > From: "warrengallin" wgallin@...>
                > > I am having a problem with some lines of text matching my regular expression, but the captured parts of the match are not defined.
                > >
                > > Attached is a minimal example of my problem. The match condition for
                > > the if condition is met, the match value is printed, but the four
                > > individual captures are undefined and are not printed. Perl 5.12
                > > running on OSX Mountain Lion. Note, this is one line from a large
                > > file, most of which are handled as expected, but several of which fail
                > > in the same way.
                > >
                > > #!/usr/bin/perl
                > > use strict;
                > > use warnings;
                > >
                > > my $temp_in = "WJG2983 jShaw2 ORF Forward Redesigned for longer overlap, Tm=60 from oligoCalc Web site, note, one C shorter than WJG2981, avoid in frame ORF w/ beta gal promoter IF you are not cutting for ligation into pXT7 CAA CTT TGG CAG ATC GGT ACC GAA TTCTCGAGCCACCatgtcggcagcaagaaatct ";
                > >
                > > if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
                > > print "Match is:\n $&\n";
                > > my $sequence = $4;
                > > $sequence =~ tr/[a-z]/[A-Z]/;
                > > $sequence =~ s/\s//;
                >
                > The line above is the problem. The $1 and friends contain the data
                > from the last successful regexp match and s/.../.../ is a regex match
                > and replace.
                >
                > > my $title = $1;
                > > my $comment1 = $2;
                > > my $comment2 =$3;
                > > print "Found a Match\n$title\n$comment1\n$comment2\n$sequence\n";
                >
                > You should copy the data from $1, $2, ... to ordinary variables as
                > soon as possible, before something overwrites them.
                >
                > Jenda
                > ===== Jenda@... === http://Jenda.Krynicky.cz =====
                > When it comes to wine, women and song, wizards are allowed
                > to get drunk and croon as much as they like.
                > -- Terry Pratchett in Sourcery
                >
                >
              • afbach1
                if ($temp_in =~ m/^(WJG d{4}) t([^ t]*) t([^ t]*) t([^ t]*) t/){ print Match is: n $& n ; my $sequence = $4; $sequence =~ tr/[a-z]/[A-Z]/; $sequence =~
                Message 7 of 7 , Jan 21, 2013
                • 0 Attachment
                  if ($temp_in =~ m/^(WJG\d{4})\t([^\t]*)\t([^\t]*)\t([^\t]*)\t/){
                  print "Match is:\n $&\n";
                  my $sequence = $4;
                  $sequence =~ tr/[a-z]/[A-Z]/;
                  $sequence =~ s/\s//;
                  my $title = $1;
                  my $comment1 = $2;
                  my $comment2 =$3;
                  print "Found a Match\n$title\n$comment1?92;n$comment2?92;n$sequence?92;n";
                  }

                  Your match against "s/\s//" resets the capture vars (I don't think the
                  match against "tr" does). The advantage of splitting over an RE depends
                  upon how confident you are in the data formatting.

                  else{

                  print "Did not match.\n";

                  }

                  Worth adding input line number ("$.") and input (and maybe "warn" instead
                  of "print") to the error msg for ease of tracking down any data problems.

                  a
                  ----------------------
                  Andy Bach
                  Systems Mangler
                  Internet: andy_bach@...
                  Voice: (608) 261-5738, Cell: (608) 658-1890

                  "If Java had true garbage collection, most programs would delete
                  themselves upon execution."
                  Robert Sewell.
                Your message has been successfully submitted and would be delivered to recipients shortly.