Loading ...
Sorry, an error occurred while loading the content.
 

Re: [NTS] Trying to perfect RegExp to match various numbers

Expand Messages
  • Sheri
    Hi Joy, A caret inside a character class negates the character class. Outside of a character class it indicates BOL (beginning of the the line). Avoid making
    Message 1 of 14 , Mar 29, 2011
      Hi Joy,

      A caret inside a character class negates the character class. Outside of
      a character class it indicates BOL (beginning of the the line). Avoid
      making the very first character of a pattern a caret, because NoteTab
      Pro (but not Std or Lite) sometimes advances the cursor before testing
      the regex if it is.

      Sounds like you wanted to match whole lines with one in, so you could do

      (?:^|.*?\s)[\+\-]?[0-9,\.]+(?=\s).*$

      Otherwise, try

      (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)

      Hope that helps. I've escaped the plus and the period in character
      classes above, which is my habit but isn't strictly necessary. It is
      necessary to escape a hyphen in a character class.

      Regards,
      Sheri
    • Don
      ... Hi Sheri, Can you explain those to us? I am trying to grow here a bit ... If I search and replace this: File 67817 Id.ppt Using: ^(.*?) t([0-9,]+) t(.*)
      Message 2 of 14 , Mar 29, 2011
        On 3/29/2011 11:04 AM, Sheri wrote:
        > Hi Joy,
        >
        > A caret inside a character class negates the character class. Outside of
        > a character class it indicates BOL (beginning of the the line). Avoid
        > making the very first character of a pattern a caret, because NoteTab
        > Pro (but not Std or Lite) sometimes advances the cursor before testing
        > the regex if it is.
        >
        > Sounds like you wanted to match whole lines with one in, so you could do
        >
        > (?:^|.*?\s)[\+\-]?[0-9,\.]+(?=\s).*$
        >
        > Otherwise, try
        >
        > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
        >
        > Hope that helps. I've escaped the plus and the period in character
        > classes above, which is my habit but isn't strictly necessary. It is
        > necessary to escape a hyphen in a character class.
        >
        > Regards,
        > Sheri


        Hi Sheri,

        Can you explain those to us? I am trying to grow here a bit ...
        If I search and replace this:
        File 67817 Id.ppt

        Using:
        ^(.*?)\t([0-9,]+)\t(.*)
        and
        first: $1; second: $2; third $3
        I get:
        first: File; second: 67817; third Id.ppt

        If I use:
        ^(.*?\K)\t([0-9,]+)\t(.*)
        I get this:
        Filefirst: File; second: 67817; third Id.ppt
        I didn't capture the word File, but then it is still in the parenthesis
        -- I could drop the parens around the first .*? netting this:
        Filefirst: 67817; second: Id.ppt; third $3

        But if I use this:
        ^.*?\K\t([0-9,]+)\t(.*\K)
        I get nothing found. why does my last \K not work for me?

        Thanks for helping me understand both what you are doing and my inferior
        attempt.

        Don
      • Sheri
        Hi Don, Let me know if you have a specific question about my patterns, I don t see anything there that should be hard to follow. If a subpattern starts with ?:
        Message 3 of 14 , Mar 29, 2011
          Hi Don,

          Let me know if you have a specific question about my patterns, I don't
          see anything there that should be hard to follow. If a subpattern starts
          with ?: it makes in non-capturing if that threw you.

          On 3/29/2011 11:27 AM, Don wrote:
          >
          > ^.*?\K\t([0-9,]+)\t(.*\K)
          > I get nothing found. why does my last \K not work for me?
          >
          > Thanks for helping me understand both what you are doing and my inferior
          > attempt.

          \K is for defining a split point in the pattern. Matching stuff before
          the \K is discarded. So if a pattern ends with \K, the most it could
          match would be the empty string that follows all the stuff that's been
          discarded.

          I think the only time it might make sense to have more than one \K in a
          pattern would be if they were parts of different alternatives (where
          alternatives are separated by vertical bars).

          Regards,
          Sheri
        • mycroftj
          I m terribly sorry for not being clear. I branched into quite a few directions at once. The goal was to pad all numbers in a document with spaces or zeros so
          Message 4 of 14 , Mar 30, 2011
            I'm terribly sorry for not being clear. I branched into quite a few directions at once.

            The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.

            Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?

            I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
            (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
            which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)

            The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.

            Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.

            Joy


            --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
            >
            > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
            > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
            > want.
            >
            > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
            > seems that the first part of the message and the last part are not on the same subject to me.
            >
            > Regards,
            > John
            >
            >
            > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
            > Sent: Monday, March 28, 2011 16:10
            > To: ntb-scripts@yahoogroups.com
            > Subject: [NTS] Trying to perfect RegExp to match various numbers
            >
            >
            > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
            >
            > Actual data looks something like
            >
            > File 67817 Id.ppt
            > File 691037 20dat.sys
            > File 69870 Lock.doc
            > File 705 56968.mbs
            > File 70537 Jil.xls
            > File 71168 Gas.jpg
            >
            > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
            > signs, decimal point and commas optional.
            >
            > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
            > xxx456 and 456xx should NOT be matched.
            >
            > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
            > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
            >
            > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
            >
            > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
            >
            > Joy
            >
            > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
            >
            >
            > 12345
            > 12345.678
            > 1234567.90
            > xxx456
            > 456xx
            >
            > xx456,123.34
            > www.45.67.hmt2
            >
            > 12,345
            > -12,345
            > 12,345.
            > +12,345.01
            >
            > 12345
            > +123,45.678
            > 1234567.90
            >
            > there are 0 lines
            > 1 or 2 more.
            >
            > 12345
            > -12345.678
            > 1234567.90
            > xxx456
            > +456xx
            >
            > xx456.34
            > www.45.67.hmt2
            >
            > +12345
            >
            > 12345
            > -12345.678
            > 1234567.90
            > xxx456
            > 456xx
            >
            > there are 0 lines. The zero should match as should the following one and negative two.
            > 1 or -2 more.
            >
            >
            >
            > [Non-text portions of this message have been removed]
            >
          • Eb
            I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I m talking about? I
            Message 5 of 14 , Mar 30, 2011
              I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I'm talking about?

              I do not remember the topic, but I believe it had to do with sorting a table of numbers, numerically, even though the numbers were left-justified.

              Cheers

              --- In ntb-scripts@yahoogroups.com, "mycroftj" <mycroftj@...> wrote:
              >
              > I'm terribly sorry for not being clear. I branched into quite a few directions at once.
              >
              > The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.
              >
              > Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?
              >
              > I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
              > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
              > which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)
              >
              > The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.
              >
              > Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.
              >
              > Joy
              >
              >
              > --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@> wrote:
              > >
              > > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
              > > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
              > > want.
              > >
              > > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
              > > seems that the first part of the message and the last part are not on the same subject to me.
              > >
              > > Regards,
              > > John
              > >
              > >
              > > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
              > > Sent: Monday, March 28, 2011 16:10
              > > To: ntb-scripts@yahoogroups.com
              > > Subject: [NTS] Trying to perfect RegExp to match various numbers
              > >
              > >
              > > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
              > >
              > > Actual data looks something like
              > >
              > > File 67817 Id.ppt
              > > File 691037 20dat.sys
              > > File 69870 Lock.doc
              > > File 705 56968.mbs
              > > File 70537 Jil.xls
              > > File 71168 Gas.jpg
              > >
              > > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
              > > signs, decimal point and commas optional.
              > >
              > > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
              > > xxx456 and 456xx should NOT be matched.
              > >
              > > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
              > > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
              > >
              > > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
              > >
              > > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
              > >
              > > Joy
              > >
              > > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
              > >
              > >
              > > 12345
              > > 12345.678
              > > 1234567.90
              > > xxx456
              > > 456xx
              > >
              > > xx456,123.34
              > > www.45.67.hmt2
              > >
              > > 12,345
              > > -12,345
              > > 12,345.
              > > +12,345.01
              > >
              > > 12345
              > > +123,45.678
              > > 1234567.90
              > >
              > > there are 0 lines
              > > 1 or 2 more.
              > >
              > > 12345
              > > -12345.678
              > > 1234567.90
              > > xxx456
              > > +456xx
              > >
              > > xx456.34
              > > www.45.67.hmt2
              > >
              > > +12345
              > >
              > > 12345
              > > -12345.678
              > > 1234567.90
              > > xxx456
              > > 456xx
              > >
              > > there are 0 lines. The zero should match as should the following one and negative two.
              > > 1 or -2 more.
              > >
              > >
              > >
              > > [Non-text portions of this message have been removed]
              > >
              >
            • Alec Burgess
              cc ntb-clips (see note at end) ... Following will enforce 5 digits (zero-padded) before optional decimal and 4 after H=test B3-30 leading / trailing zeros ;
              Message 6 of 14 , Mar 30, 2011
                cc ntb-clips (see note at end)

                On 2011-03-30 15:21, mycroftj wrote:
                > I'm terribly sorry for not being clear. I branched into quite a few
                > directions at once.
                >
                > The goal was to pad all numbers in a document with spaces or zeros so
                > they would sort correctly.
                >
                > Instead of writing a clip, I thought it might be done in one mighty
                > regexp replace (777->000777 and 77->000077) but that would involve
                > calculating lengths and taking decimal points into consideration so I
                > don't see how that can be done. OR CAN IT?
                >
                >
                Following will enforce 5 digits (zero-padded) before optional decimal
                and 4 after

                H=test B3-30 leading / trailing zeros
                ; Alec Burgess 2011-03-30
                ; currently enforces 5 digits before (optional) decimal and 4 after
                ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2===0000" rwais
                ^!replace "===" >> "" rwais
                ^!replace "\b0*(\d{5})\.(\d{4})0*\b" >> "$1.$2" rwais

                Note - I wanted the first replace to be just "00000$1.$20000" but I
                haven't figured out how to prevent clip replace from confusing $2 with
                $20 - (ie. non-existent 20th or 20000th sub-pattern.

                Does anyone know how to do this? As is just make sure "===" is any
                string which does not exist in the input.

                Note - adding line ^!replace "\.0*\b" >> "" rwais to above will
                eliminate decimal padding after unnecessary decimal point.

                sample input
                1
                123
                123.
                123.1
                123.123

                resulting output
                00001.0000
                00123.0000
                00123.0000.
                00123.1000
                00123.1230

                btw: ntb-clips group would be a better place for this discussion that
                ntb-scripts. As originally intended ntb-scripts was for discussion of
                things like using Perl and JavaScript in clip code. Its readership is
                much less than the ntb-clips though I assume everyone who follows
                ntb-scripts also follows ntb-clips :-)

                Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
              • Eb
                Alec, I m not sure this will work, but the variable ought to break up the output pattern: ^!replace b( d+) .?( d*) b 00000$1.$2^%empty%0000 rwais
                Message 7 of 14 , Apr 1, 2011
                  Alec,

                  I'm not sure this will work, but the variable ought to break up the output pattern:

                  ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais


                  Cheers


                  Eb

                  --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                  ...
                  > Does anyone know how to do this? As is just make sure "===" is any
                  > string which does not exist in the input.
                • Alec Burgess
                  ... It does allow the $2 to be substituted but ^%empty% does not appear to get translated. I get results like this: 45.6 == 0000045.6^%empty%0000 -- Regards
                  Message 8 of 14 , Apr 1, 2011
                    On 2011-04-01 16:08, Eb wrote:
                    >
                    >
                    > I'm not sure this will work, but the variable ought to break up the
                    > output pattern:
                    >
                    > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                    It does allow the $2 to be substituted but ^%empty% does not appear to
                    get translated.
                    I get results like this:
                    45.6 ==> 0000045.6^%empty%0000
                    --
                    Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                  • Eb
                    Ok, try this (hex code x30 for the first zero): $2 x30000 Eb
                    Message 9 of 14 , Apr 4, 2011
                      Ok, try this (hex code '\x30' for the first zero):

                      $2\x30000

                      Eb


                      --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                      >
                      > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                      > It does allow the $2 to be substituted but ^%empty% does not appear to
                      > get translated.
                      > I get results like this:
                      > 45.6 ==> 0000045.6^%empty%0000
                      > --
                      > Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                      >
                    • Alec Burgess
                      Thanks Eb - x30 works. when I was messing around with this I had tried the same thing but realize now that I was trying (the meaningless) uppercase X30
                      Message 10 of 14 , Apr 4, 2011
                        Thanks Eb - \x30 works.
                        when I was messing around with this I had tried the same thing but
                        realize now that I was trying (the meaningless) uppercase \X30 instead
                        of the correct \x30. Ooops ! :-[

                        On 2011-04-04 09:18, Eb wrote:
                        > Ok, try this (hex code '\x30' for the first zero):
                        >
                        > $2\x30000
                        >
                        > Eb
                        >
                        > --- In ntb-scripts@yahoogroups.com
                        > <mailto:ntb-scripts%40yahoogroups.com>, Alec Burgess <buralex@...> wrote:
                        > >
                        > > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                        > > It does allow the $2 to be substituted but ^%empty% does not appear to
                        > > get translated.
                        > > I get results like this:
                        > > 45.6 ==> 0000045.6^%empty%0000

                        --
                        Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)


                        [Non-text portions of this message have been removed]
                      Your message has been successfully submitted and would be delivered to recipients shortly.