Loading ...
Sorry, an error occurred while loading the content.

Re: [NTS] Trying to perfect RegExp to match various numbers

Expand Messages
  • Don
    ... Hi Sheri, Can you explain those to us? I am trying to grow here a bit ... If I search and replace this: File 67817 Id.ppt Using: ^(.*?) t([0-9,]+) t(.*)
    Message 1 of 14 , Mar 29, 2011
    • 0 Attachment
      On 3/29/2011 11:04 AM, Sheri wrote:
      > Hi Joy,
      >
      > A caret inside a character class negates the character class. Outside of
      > a character class it indicates BOL (beginning of the the line). Avoid
      > making the very first character of a pattern a caret, because NoteTab
      > Pro (but not Std or Lite) sometimes advances the cursor before testing
      > the regex if it is.
      >
      > Sounds like you wanted to match whole lines with one in, so you could do
      >
      > (?:^|.*?\s)[\+\-]?[0-9,\.]+(?=\s).*$
      >
      > Otherwise, try
      >
      > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
      >
      > Hope that helps. I've escaped the plus and the period in character
      > classes above, which is my habit but isn't strictly necessary. It is
      > necessary to escape a hyphen in a character class.
      >
      > Regards,
      > Sheri


      Hi Sheri,

      Can you explain those to us? I am trying to grow here a bit ...
      If I search and replace this:
      File 67817 Id.ppt

      Using:
      ^(.*?)\t([0-9,]+)\t(.*)
      and
      first: $1; second: $2; third $3
      I get:
      first: File; second: 67817; third Id.ppt

      If I use:
      ^(.*?\K)\t([0-9,]+)\t(.*)
      I get this:
      Filefirst: File; second: 67817; third Id.ppt
      I didn't capture the word File, but then it is still in the parenthesis
      -- I could drop the parens around the first .*? netting this:
      Filefirst: 67817; second: Id.ppt; third $3

      But if I use this:
      ^.*?\K\t([0-9,]+)\t(.*\K)
      I get nothing found. why does my last \K not work for me?

      Thanks for helping me understand both what you are doing and my inferior
      attempt.

      Don
    • Sheri
      Hi Don, Let me know if you have a specific question about my patterns, I don t see anything there that should be hard to follow. If a subpattern starts with ?:
      Message 2 of 14 , Mar 29, 2011
      • 0 Attachment
        Hi Don,

        Let me know if you have a specific question about my patterns, I don't
        see anything there that should be hard to follow. If a subpattern starts
        with ?: it makes in non-capturing if that threw you.

        On 3/29/2011 11:27 AM, Don wrote:
        >
        > ^.*?\K\t([0-9,]+)\t(.*\K)
        > I get nothing found. why does my last \K not work for me?
        >
        > Thanks for helping me understand both what you are doing and my inferior
        > attempt.

        \K is for defining a split point in the pattern. Matching stuff before
        the \K is discarded. So if a pattern ends with \K, the most it could
        match would be the empty string that follows all the stuff that's been
        discarded.

        I think the only time it might make sense to have more than one \K in a
        pattern would be if they were parts of different alternatives (where
        alternatives are separated by vertical bars).

        Regards,
        Sheri
      • mycroftj
        I m terribly sorry for not being clear. I branched into quite a few directions at once. The goal was to pad all numbers in a document with spaces or zeros so
        Message 3 of 14 , Mar 30, 2011
        • 0 Attachment
          I'm terribly sorry for not being clear. I branched into quite a few directions at once.

          The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.

          Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?

          I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
          (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
          which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)

          The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.

          Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.

          Joy


          --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
          >
          > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
          > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
          > want.
          >
          > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
          > seems that the first part of the message and the last part are not on the same subject to me.
          >
          > Regards,
          > John
          >
          >
          > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
          > Sent: Monday, March 28, 2011 16:10
          > To: ntb-scripts@yahoogroups.com
          > Subject: [NTS] Trying to perfect RegExp to match various numbers
          >
          >
          > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
          >
          > Actual data looks something like
          >
          > File 67817 Id.ppt
          > File 691037 20dat.sys
          > File 69870 Lock.doc
          > File 705 56968.mbs
          > File 70537 Jil.xls
          > File 71168 Gas.jpg
          >
          > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
          > signs, decimal point and commas optional.
          >
          > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
          > xxx456 and 456xx should NOT be matched.
          >
          > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
          > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
          >
          > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
          >
          > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
          >
          > Joy
          >
          > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
          >
          >
          > 12345
          > 12345.678
          > 1234567.90
          > xxx456
          > 456xx
          >
          > xx456,123.34
          > www.45.67.hmt2
          >
          > 12,345
          > -12,345
          > 12,345.
          > +12,345.01
          >
          > 12345
          > +123,45.678
          > 1234567.90
          >
          > there are 0 lines
          > 1 or 2 more.
          >
          > 12345
          > -12345.678
          > 1234567.90
          > xxx456
          > +456xx
          >
          > xx456.34
          > www.45.67.hmt2
          >
          > +12345
          >
          > 12345
          > -12345.678
          > 1234567.90
          > xxx456
          > 456xx
          >
          > there are 0 lines. The zero should match as should the following one and negative two.
          > 1 or -2 more.
          >
          >
          >
          > [Non-text portions of this message have been removed]
          >
        • Eb
          I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I m talking about? I
          Message 4 of 14 , Mar 30, 2011
          • 0 Attachment
            I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I'm talking about?

            I do not remember the topic, but I believe it had to do with sorting a table of numbers, numerically, even though the numbers were left-justified.

            Cheers

            --- In ntb-scripts@yahoogroups.com, "mycroftj" <mycroftj@...> wrote:
            >
            > I'm terribly sorry for not being clear. I branched into quite a few directions at once.
            >
            > The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.
            >
            > Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?
            >
            > I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
            > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
            > which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)
            >
            > The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.
            >
            > Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.
            >
            > Joy
            >
            >
            > --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@> wrote:
            > >
            > > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
            > > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
            > > want.
            > >
            > > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
            > > seems that the first part of the message and the last part are not on the same subject to me.
            > >
            > > Regards,
            > > John
            > >
            > >
            > > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
            > > Sent: Monday, March 28, 2011 16:10
            > > To: ntb-scripts@yahoogroups.com
            > > Subject: [NTS] Trying to perfect RegExp to match various numbers
            > >
            > >
            > > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
            > >
            > > Actual data looks something like
            > >
            > > File 67817 Id.ppt
            > > File 691037 20dat.sys
            > > File 69870 Lock.doc
            > > File 705 56968.mbs
            > > File 70537 Jil.xls
            > > File 71168 Gas.jpg
            > >
            > > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
            > > signs, decimal point and commas optional.
            > >
            > > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
            > > xxx456 and 456xx should NOT be matched.
            > >
            > > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
            > > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
            > >
            > > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
            > >
            > > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
            > >
            > > Joy
            > >
            > > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
            > >
            > >
            > > 12345
            > > 12345.678
            > > 1234567.90
            > > xxx456
            > > 456xx
            > >
            > > xx456,123.34
            > > www.45.67.hmt2
            > >
            > > 12,345
            > > -12,345
            > > 12,345.
            > > +12,345.01
            > >
            > > 12345
            > > +123,45.678
            > > 1234567.90
            > >
            > > there are 0 lines
            > > 1 or 2 more.
            > >
            > > 12345
            > > -12345.678
            > > 1234567.90
            > > xxx456
            > > +456xx
            > >
            > > xx456.34
            > > www.45.67.hmt2
            > >
            > > +12345
            > >
            > > 12345
            > > -12345.678
            > > 1234567.90
            > > xxx456
            > > 456xx
            > >
            > > there are 0 lines. The zero should match as should the following one and negative two.
            > > 1 or -2 more.
            > >
            > >
            > >
            > > [Non-text portions of this message have been removed]
            > >
            >
          • Alec Burgess
            cc ntb-clips (see note at end) ... Following will enforce 5 digits (zero-padded) before optional decimal and 4 after H=test B3-30 leading / trailing zeros ;
            Message 5 of 14 , Mar 30, 2011
            • 0 Attachment
              cc ntb-clips (see note at end)

              On 2011-03-30 15:21, mycroftj wrote:
              > I'm terribly sorry for not being clear. I branched into quite a few
              > directions at once.
              >
              > The goal was to pad all numbers in a document with spaces or zeros so
              > they would sort correctly.
              >
              > Instead of writing a clip, I thought it might be done in one mighty
              > regexp replace (777->000777 and 77->000077) but that would involve
              > calculating lengths and taking decimal points into consideration so I
              > don't see how that can be done. OR CAN IT?
              >
              >
              Following will enforce 5 digits (zero-padded) before optional decimal
              and 4 after

              H=test B3-30 leading / trailing zeros
              ; Alec Burgess 2011-03-30
              ; currently enforces 5 digits before (optional) decimal and 4 after
              ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2===0000" rwais
              ^!replace "===" >> "" rwais
              ^!replace "\b0*(\d{5})\.(\d{4})0*\b" >> "$1.$2" rwais

              Note - I wanted the first replace to be just "00000$1.$20000" but I
              haven't figured out how to prevent clip replace from confusing $2 with
              $20 - (ie. non-existent 20th or 20000th sub-pattern.

              Does anyone know how to do this? As is just make sure "===" is any
              string which does not exist in the input.

              Note - adding line ^!replace "\.0*\b" >> "" rwais to above will
              eliminate decimal padding after unnecessary decimal point.

              sample input
              1
              123
              123.
              123.1
              123.123

              resulting output
              00001.0000
              00123.0000
              00123.0000.
              00123.1000
              00123.1230

              btw: ntb-clips group would be a better place for this discussion that
              ntb-scripts. As originally intended ntb-scripts was for discussion of
              things like using Perl and JavaScript in clip code. Its readership is
              much less than the ntb-clips though I assume everyone who follows
              ntb-scripts also follows ntb-clips :-)

              Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
            • Eb
              Alec, I m not sure this will work, but the variable ought to break up the output pattern: ^!replace b( d+) .?( d*) b 00000$1.$2^%empty%0000 rwais
              Message 6 of 14 , Apr 1 1:08 PM
              • 0 Attachment
                Alec,

                I'm not sure this will work, but the variable ought to break up the output pattern:

                ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais


                Cheers


                Eb

                --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                ...
                > Does anyone know how to do this? As is just make sure "===" is any
                > string which does not exist in the input.
              • Alec Burgess
                ... It does allow the $2 to be substituted but ^%empty% does not appear to get translated. I get results like this: 45.6 == 0000045.6^%empty%0000 -- Regards
                Message 7 of 14 , Apr 1 2:45 PM
                • 0 Attachment
                  On 2011-04-01 16:08, Eb wrote:
                  >
                  >
                  > I'm not sure this will work, but the variable ought to break up the
                  > output pattern:
                  >
                  > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                  It does allow the $2 to be substituted but ^%empty% does not appear to
                  get translated.
                  I get results like this:
                  45.6 ==> 0000045.6^%empty%0000
                  --
                  Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                • Eb
                  Ok, try this (hex code x30 for the first zero): $2 x30000 Eb
                  Message 8 of 14 , Apr 4 6:18 AM
                  • 0 Attachment
                    Ok, try this (hex code '\x30' for the first zero):

                    $2\x30000

                    Eb


                    --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                    >
                    > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                    > It does allow the $2 to be substituted but ^%empty% does not appear to
                    > get translated.
                    > I get results like this:
                    > 45.6 ==> 0000045.6^%empty%0000
                    > --
                    > Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                    >
                  • Alec Burgess
                    Thanks Eb - x30 works. when I was messing around with this I had tried the same thing but realize now that I was trying (the meaningless) uppercase X30
                    Message 9 of 14 , Apr 4 2:07 PM
                    • 0 Attachment
                      Thanks Eb - \x30 works.
                      when I was messing around with this I had tried the same thing but
                      realize now that I was trying (the meaningless) uppercase \X30 instead
                      of the correct \x30. Ooops ! :-[

                      On 2011-04-04 09:18, Eb wrote:
                      > Ok, try this (hex code '\x30' for the first zero):
                      >
                      > $2\x30000
                      >
                      > Eb
                      >
                      > --- In ntb-scripts@yahoogroups.com
                      > <mailto:ntb-scripts%40yahoogroups.com>, Alec Burgess <buralex@...> wrote:
                      > >
                      > > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                      > > It does allow the $2 to be substituted but ^%empty% does not appear to
                      > > get translated.
                      > > I get results like this:
                      > > 45.6 ==> 0000045.6^%empty%0000

                      --
                      Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)


                      [Non-text portions of this message have been removed]
                    Your message has been successfully submitted and would be delivered to recipients shortly.