Loading ...
Sorry, an error occurred while loading the content.
 

Re: [NTS] Re: Trying to perfect RegExp to match various numbers

Expand Messages
  • Don
    I ll be honest, I read and did not understand. You just want to sort the middle column in that set of number? Will they always be in columns like this and you
    Message 1 of 14 , Mar 29, 2011
      I'll be honest, I read and did not understand. You just want to sort
      the middle column in that set of number?
      Will they always be in columns like this and you only want to catch the
      middle? What is the ===== between the File=====number=====filename?
      Tab? spaces?
      It appears to be tabs. Are their only three things on each line?

      On 3/29/2011 8:26 AM, mycroftj wrote:
      > During more thinking in the middle of the night...
      > I could do without the code for the commas. This would make it much easier for me to understand and also be easier for anyone who will be taking the time to help me with this.
      >> My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
      >>
      >> Actual data looks something like
      >>
      >> File 67817 Id.ppt
      >> File 691037 20dat.sys
      >> File 69870 Lock.doc
      >> File 705 56968.mbs
      >> File 70537 Jil.xls
      >> File 71168 Gas.jpg
    • Sheri
      Hi Joy, A caret inside a character class negates the character class. Outside of a character class it indicates BOL (beginning of the the line). Avoid making
      Message 2 of 14 , Mar 29, 2011
        Hi Joy,

        A caret inside a character class negates the character class. Outside of
        a character class it indicates BOL (beginning of the the line). Avoid
        making the very first character of a pattern a caret, because NoteTab
        Pro (but not Std or Lite) sometimes advances the cursor before testing
        the regex if it is.

        Sounds like you wanted to match whole lines with one in, so you could do

        (?:^|.*?\s)[\+\-]?[0-9,\.]+(?=\s).*$

        Otherwise, try

        (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)

        Hope that helps. I've escaped the plus and the period in character
        classes above, which is my habit but isn't strictly necessary. It is
        necessary to escape a hyphen in a character class.

        Regards,
        Sheri
      • Don
        ... Hi Sheri, Can you explain those to us? I am trying to grow here a bit ... If I search and replace this: File 67817 Id.ppt Using: ^(.*?) t([0-9,]+) t(.*)
        Message 3 of 14 , Mar 29, 2011
          On 3/29/2011 11:04 AM, Sheri wrote:
          > Hi Joy,
          >
          > A caret inside a character class negates the character class. Outside of
          > a character class it indicates BOL (beginning of the the line). Avoid
          > making the very first character of a pattern a caret, because NoteTab
          > Pro (but not Std or Lite) sometimes advances the cursor before testing
          > the regex if it is.
          >
          > Sounds like you wanted to match whole lines with one in, so you could do
          >
          > (?:^|.*?\s)[\+\-]?[0-9,\.]+(?=\s).*$
          >
          > Otherwise, try
          >
          > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
          >
          > Hope that helps. I've escaped the plus and the period in character
          > classes above, which is my habit but isn't strictly necessary. It is
          > necessary to escape a hyphen in a character class.
          >
          > Regards,
          > Sheri


          Hi Sheri,

          Can you explain those to us? I am trying to grow here a bit ...
          If I search and replace this:
          File 67817 Id.ppt

          Using:
          ^(.*?)\t([0-9,]+)\t(.*)
          and
          first: $1; second: $2; third $3
          I get:
          first: File; second: 67817; third Id.ppt

          If I use:
          ^(.*?\K)\t([0-9,]+)\t(.*)
          I get this:
          Filefirst: File; second: 67817; third Id.ppt
          I didn't capture the word File, but then it is still in the parenthesis
          -- I could drop the parens around the first .*? netting this:
          Filefirst: 67817; second: Id.ppt; third $3

          But if I use this:
          ^.*?\K\t([0-9,]+)\t(.*\K)
          I get nothing found. why does my last \K not work for me?

          Thanks for helping me understand both what you are doing and my inferior
          attempt.

          Don
        • Sheri
          Hi Don, Let me know if you have a specific question about my patterns, I don t see anything there that should be hard to follow. If a subpattern starts with ?:
          Message 4 of 14 , Mar 29, 2011
            Hi Don,

            Let me know if you have a specific question about my patterns, I don't
            see anything there that should be hard to follow. If a subpattern starts
            with ?: it makes in non-capturing if that threw you.

            On 3/29/2011 11:27 AM, Don wrote:
            >
            > ^.*?\K\t([0-9,]+)\t(.*\K)
            > I get nothing found. why does my last \K not work for me?
            >
            > Thanks for helping me understand both what you are doing and my inferior
            > attempt.

            \K is for defining a split point in the pattern. Matching stuff before
            the \K is discarded. So if a pattern ends with \K, the most it could
            match would be the empty string that follows all the stuff that's been
            discarded.

            I think the only time it might make sense to have more than one \K in a
            pattern would be if they were parts of different alternatives (where
            alternatives are separated by vertical bars).

            Regards,
            Sheri
          • mycroftj
            I m terribly sorry for not being clear. I branched into quite a few directions at once. The goal was to pad all numbers in a document with spaces or zeros so
            Message 5 of 14 , Mar 30, 2011
              I'm terribly sorry for not being clear. I branched into quite a few directions at once.

              The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.

              Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?

              I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
              (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
              which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)

              The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.

              Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.

              Joy


              --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
              >
              > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
              > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
              > want.
              >
              > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
              > seems that the first part of the message and the last part are not on the same subject to me.
              >
              > Regards,
              > John
              >
              >
              > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
              > Sent: Monday, March 28, 2011 16:10
              > To: ntb-scripts@yahoogroups.com
              > Subject: [NTS] Trying to perfect RegExp to match various numbers
              >
              >
              > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
              >
              > Actual data looks something like
              >
              > File 67817 Id.ppt
              > File 691037 20dat.sys
              > File 69870 Lock.doc
              > File 705 56968.mbs
              > File 70537 Jil.xls
              > File 71168 Gas.jpg
              >
              > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
              > signs, decimal point and commas optional.
              >
              > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
              > xxx456 and 456xx should NOT be matched.
              >
              > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
              > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
              >
              > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
              >
              > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
              >
              > Joy
              >
              > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
              >
              >
              > 12345
              > 12345.678
              > 1234567.90
              > xxx456
              > 456xx
              >
              > xx456,123.34
              > www.45.67.hmt2
              >
              > 12,345
              > -12,345
              > 12,345.
              > +12,345.01
              >
              > 12345
              > +123,45.678
              > 1234567.90
              >
              > there are 0 lines
              > 1 or 2 more.
              >
              > 12345
              > -12345.678
              > 1234567.90
              > xxx456
              > +456xx
              >
              > xx456.34
              > www.45.67.hmt2
              >
              > +12345
              >
              > 12345
              > -12345.678
              > 1234567.90
              > xxx456
              > 456xx
              >
              > there are 0 lines. The zero should match as should the following one and negative two.
              > 1 or -2 more.
              >
              >
              >
              > [Non-text portions of this message have been removed]
              >
            • Eb
              I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I m talking about? I
              Message 6 of 14 , Mar 30, 2011
                I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I'm talking about?

                I do not remember the topic, but I believe it had to do with sorting a table of numbers, numerically, even though the numbers were left-justified.

                Cheers

                --- In ntb-scripts@yahoogroups.com, "mycroftj" <mycroftj@...> wrote:
                >
                > I'm terribly sorry for not being clear. I branched into quite a few directions at once.
                >
                > The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.
                >
                > Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?
                >
                > I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
                > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
                > which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)
                >
                > The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.
                >
                > Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.
                >
                > Joy
                >
                >
                > --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@> wrote:
                > >
                > > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
                > > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
                > > want.
                > >
                > > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
                > > seems that the first part of the message and the last part are not on the same subject to me.
                > >
                > > Regards,
                > > John
                > >
                > >
                > > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
                > > Sent: Monday, March 28, 2011 16:10
                > > To: ntb-scripts@yahoogroups.com
                > > Subject: [NTS] Trying to perfect RegExp to match various numbers
                > >
                > >
                > > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
                > >
                > > Actual data looks something like
                > >
                > > File 67817 Id.ppt
                > > File 691037 20dat.sys
                > > File 69870 Lock.doc
                > > File 705 56968.mbs
                > > File 70537 Jil.xls
                > > File 71168 Gas.jpg
                > >
                > > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
                > > signs, decimal point and commas optional.
                > >
                > > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
                > > xxx456 and 456xx should NOT be matched.
                > >
                > > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
                > > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
                > >
                > > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
                > >
                > > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
                > >
                > > Joy
                > >
                > > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
                > >
                > >
                > > 12345
                > > 12345.678
                > > 1234567.90
                > > xxx456
                > > 456xx
                > >
                > > xx456,123.34
                > > www.45.67.hmt2
                > >
                > > 12,345
                > > -12,345
                > > 12,345.
                > > +12,345.01
                > >
                > > 12345
                > > +123,45.678
                > > 1234567.90
                > >
                > > there are 0 lines
                > > 1 or 2 more.
                > >
                > > 12345
                > > -12345.678
                > > 1234567.90
                > > xxx456
                > > +456xx
                > >
                > > xx456.34
                > > www.45.67.hmt2
                > >
                > > +12345
                > >
                > > 12345
                > > -12345.678
                > > 1234567.90
                > > xxx456
                > > 456xx
                > >
                > > there are 0 lines. The zero should match as should the following one and negative two.
                > > 1 or -2 more.
                > >
                > >
                > >
                > > [Non-text portions of this message have been removed]
                > >
                >
              • Alec Burgess
                cc ntb-clips (see note at end) ... Following will enforce 5 digits (zero-padded) before optional decimal and 4 after H=test B3-30 leading / trailing zeros ;
                Message 7 of 14 , Mar 30, 2011
                  cc ntb-clips (see note at end)

                  On 2011-03-30 15:21, mycroftj wrote:
                  > I'm terribly sorry for not being clear. I branched into quite a few
                  > directions at once.
                  >
                  > The goal was to pad all numbers in a document with spaces or zeros so
                  > they would sort correctly.
                  >
                  > Instead of writing a clip, I thought it might be done in one mighty
                  > regexp replace (777->000777 and 77->000077) but that would involve
                  > calculating lengths and taking decimal points into consideration so I
                  > don't see how that can be done. OR CAN IT?
                  >
                  >
                  Following will enforce 5 digits (zero-padded) before optional decimal
                  and 4 after

                  H=test B3-30 leading / trailing zeros
                  ; Alec Burgess 2011-03-30
                  ; currently enforces 5 digits before (optional) decimal and 4 after
                  ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2===0000" rwais
                  ^!replace "===" >> "" rwais
                  ^!replace "\b0*(\d{5})\.(\d{4})0*\b" >> "$1.$2" rwais

                  Note - I wanted the first replace to be just "00000$1.$20000" but I
                  haven't figured out how to prevent clip replace from confusing $2 with
                  $20 - (ie. non-existent 20th or 20000th sub-pattern.

                  Does anyone know how to do this? As is just make sure "===" is any
                  string which does not exist in the input.

                  Note - adding line ^!replace "\.0*\b" >> "" rwais to above will
                  eliminate decimal padding after unnecessary decimal point.

                  sample input
                  1
                  123
                  123.
                  123.1
                  123.123

                  resulting output
                  00001.0000
                  00123.0000
                  00123.0000.
                  00123.1000
                  00123.1230

                  btw: ntb-clips group would be a better place for this discussion that
                  ntb-scripts. As originally intended ntb-scripts was for discussion of
                  things like using Perl and JavaScript in clip code. Its readership is
                  much less than the ntb-clips though I assume everyone who follows
                  ntb-scripts also follows ntb-clips :-)

                  Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
                • Eb
                  Alec, I m not sure this will work, but the variable ought to break up the output pattern: ^!replace b( d+) .?( d*) b 00000$1.$2^%empty%0000 rwais
                  Message 8 of 14 , Apr 1, 2011
                    Alec,

                    I'm not sure this will work, but the variable ought to break up the output pattern:

                    ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais


                    Cheers


                    Eb

                    --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                    ...
                    > Does anyone know how to do this? As is just make sure "===" is any
                    > string which does not exist in the input.
                  • Alec Burgess
                    ... It does allow the $2 to be substituted but ^%empty% does not appear to get translated. I get results like this: 45.6 == 0000045.6^%empty%0000 -- Regards
                    Message 9 of 14 , Apr 1, 2011
                      On 2011-04-01 16:08, Eb wrote:
                      >
                      >
                      > I'm not sure this will work, but the variable ought to break up the
                      > output pattern:
                      >
                      > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                      It does allow the $2 to be substituted but ^%empty% does not appear to
                      get translated.
                      I get results like this:
                      45.6 ==> 0000045.6^%empty%0000
                      --
                      Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                    • Eb
                      Ok, try this (hex code x30 for the first zero): $2 x30000 Eb
                      Message 10 of 14 , Apr 4, 2011
                        Ok, try this (hex code '\x30' for the first zero):

                        $2\x30000

                        Eb


                        --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                        >
                        > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                        > It does allow the $2 to be substituted but ^%empty% does not appear to
                        > get translated.
                        > I get results like this:
                        > 45.6 ==> 0000045.6^%empty%0000
                        > --
                        > Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                        >
                      • Alec Burgess
                        Thanks Eb - x30 works. when I was messing around with this I had tried the same thing but realize now that I was trying (the meaningless) uppercase X30
                        Message 11 of 14 , Apr 4, 2011
                          Thanks Eb - \x30 works.
                          when I was messing around with this I had tried the same thing but
                          realize now that I was trying (the meaningless) uppercase \X30 instead
                          of the correct \x30. Ooops ! :-[

                          On 2011-04-04 09:18, Eb wrote:
                          > Ok, try this (hex code '\x30' for the first zero):
                          >
                          > $2\x30000
                          >
                          > Eb
                          >
                          > --- In ntb-scripts@yahoogroups.com
                          > <mailto:ntb-scripts%40yahoogroups.com>, Alec Burgess <buralex@...> wrote:
                          > >
                          > > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                          > > It does allow the $2 to be substituted but ^%empty% does not appear to
                          > > get translated.
                          > > I get results like this:
                          > > 45.6 ==> 0000045.6^%empty%0000

                          --
                          Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)


                          [Non-text portions of this message have been removed]
                        Your message has been successfully submitted and would be delivered to recipients shortly.