Loading ...
Sorry, an error occurred while loading the content.

Re: Trying to perfect RegExp to match various numbers

Expand Messages
  • mycroftj
    During more thinking in the middle of the night... I could do without the code for the commas. This would make it much easier for me to understand and also be
    Message 1 of 14 , Mar 29, 2011
    • 0 Attachment
      During more thinking in the middle of the night...
      I could do without the code for the commas. This would make it much easier for me to understand and also be easier for anyone who will be taking the time to help me with this.

      Thanks in advance.

      Joy


      --- In ntb-scripts@yahoogroups.com, "mycroftj" <mycroftj@...> wrote:
      >
      > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
      >
      > Actual data looks something like
      >
      > File 67817 Id.ppt
      > File 691037 20dat.sys
      > File 69870 Lock.doc
      > File 705 56968.mbs
      > File 70537 Jil.xls
      > File 71168 Gas.jpg
    • John Shotsky
      After reading through this several times, I could not determine the actual goal. To get good assistance, you should provide the starting data, the result that
      Message 2 of 14 , Mar 29, 2011
      • 0 Attachment
        After reading through this several times, I could not determine the actual goal. To get good assistance, you should
        provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
        want.

        So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
        seems that the first part of the message and the last part are not on the same subject to me.

        Regards,
        John


        From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
        Sent: Monday, March 28, 2011 16:10
        To: ntb-scripts@yahoogroups.com
        Subject: [NTS] Trying to perfect RegExp to match various numbers


        My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.

        Actual data looks something like

        File 67817 Id.ppt
        File 691037 20dat.sys
        File 69870 Lock.doc
        File 705 56968.mbs
        File 70537 Jil.xls
        File 71168 Gas.jpg

        I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
        signs, decimal point and commas optional.

        In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
        xxx456 and 456xx should NOT be matched.

        I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
        It also misses 0, 1 and 2 although it does (correctly) pick up -2.

        I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?

        Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!

        Joy

        What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)


        12345
        12345.678
        1234567.90
        xxx456
        456xx

        xx456,123.34
        www.45.67.hmt2

        12,345
        -12,345
        12,345.
        +12,345.01

        12345
        +123,45.678
        1234567.90

        there are 0 lines
        1 or 2 more.

        12345
        -12345.678
        1234567.90
        xxx456
        +456xx

        xx456.34
        www.45.67.hmt2

        +12345

        12345
        -12345.678
        1234567.90
        xxx456
        456xx

        there are 0 lines. The zero should match as should the following one and negative two.
        1 or -2 more.



        [Non-text portions of this message have been removed]
      • Don
        I ll be honest, I read and did not understand. You just want to sort the middle column in that set of number? Will they always be in columns like this and you
        Message 3 of 14 , Mar 29, 2011
        • 0 Attachment
          I'll be honest, I read and did not understand. You just want to sort
          the middle column in that set of number?
          Will they always be in columns like this and you only want to catch the
          middle? What is the ===== between the File=====number=====filename?
          Tab? spaces?
          It appears to be tabs. Are their only three things on each line?

          On 3/29/2011 8:26 AM, mycroftj wrote:
          > During more thinking in the middle of the night...
          > I could do without the code for the commas. This would make it much easier for me to understand and also be easier for anyone who will be taking the time to help me with this.
          >> My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
          >>
          >> Actual data looks something like
          >>
          >> File 67817 Id.ppt
          >> File 691037 20dat.sys
          >> File 69870 Lock.doc
          >> File 705 56968.mbs
          >> File 70537 Jil.xls
          >> File 71168 Gas.jpg
        • Sheri
          Hi Joy, A caret inside a character class negates the character class. Outside of a character class it indicates BOL (beginning of the the line). Avoid making
          Message 4 of 14 , Mar 29, 2011
          • 0 Attachment
            Hi Joy,

            A caret inside a character class negates the character class. Outside of
            a character class it indicates BOL (beginning of the the line). Avoid
            making the very first character of a pattern a caret, because NoteTab
            Pro (but not Std or Lite) sometimes advances the cursor before testing
            the regex if it is.

            Sounds like you wanted to match whole lines with one in, so you could do

            (?:^|.*?\s)[\+\-]?[0-9,\.]+(?=\s).*$

            Otherwise, try

            (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)

            Hope that helps. I've escaped the plus and the period in character
            classes above, which is my habit but isn't strictly necessary. It is
            necessary to escape a hyphen in a character class.

            Regards,
            Sheri
          • Don
            ... Hi Sheri, Can you explain those to us? I am trying to grow here a bit ... If I search and replace this: File 67817 Id.ppt Using: ^(.*?) t([0-9,]+) t(.*)
            Message 5 of 14 , Mar 29, 2011
            • 0 Attachment
              On 3/29/2011 11:04 AM, Sheri wrote:
              > Hi Joy,
              >
              > A caret inside a character class negates the character class. Outside of
              > a character class it indicates BOL (beginning of the the line). Avoid
              > making the very first character of a pattern a caret, because NoteTab
              > Pro (but not Std or Lite) sometimes advances the cursor before testing
              > the regex if it is.
              >
              > Sounds like you wanted to match whole lines with one in, so you could do
              >
              > (?:^|.*?\s)[\+\-]?[0-9,\.]+(?=\s).*$
              >
              > Otherwise, try
              >
              > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
              >
              > Hope that helps. I've escaped the plus and the period in character
              > classes above, which is my habit but isn't strictly necessary. It is
              > necessary to escape a hyphen in a character class.
              >
              > Regards,
              > Sheri


              Hi Sheri,

              Can you explain those to us? I am trying to grow here a bit ...
              If I search and replace this:
              File 67817 Id.ppt

              Using:
              ^(.*?)\t([0-9,]+)\t(.*)
              and
              first: $1; second: $2; third $3
              I get:
              first: File; second: 67817; third Id.ppt

              If I use:
              ^(.*?\K)\t([0-9,]+)\t(.*)
              I get this:
              Filefirst: File; second: 67817; third Id.ppt
              I didn't capture the word File, but then it is still in the parenthesis
              -- I could drop the parens around the first .*? netting this:
              Filefirst: 67817; second: Id.ppt; third $3

              But if I use this:
              ^.*?\K\t([0-9,]+)\t(.*\K)
              I get nothing found. why does my last \K not work for me?

              Thanks for helping me understand both what you are doing and my inferior
              attempt.

              Don
            • Sheri
              Hi Don, Let me know if you have a specific question about my patterns, I don t see anything there that should be hard to follow. If a subpattern starts with ?:
              Message 6 of 14 , Mar 29, 2011
              • 0 Attachment
                Hi Don,

                Let me know if you have a specific question about my patterns, I don't
                see anything there that should be hard to follow. If a subpattern starts
                with ?: it makes in non-capturing if that threw you.

                On 3/29/2011 11:27 AM, Don wrote:
                >
                > ^.*?\K\t([0-9,]+)\t(.*\K)
                > I get nothing found. why does my last \K not work for me?
                >
                > Thanks for helping me understand both what you are doing and my inferior
                > attempt.

                \K is for defining a split point in the pattern. Matching stuff before
                the \K is discarded. So if a pattern ends with \K, the most it could
                match would be the empty string that follows all the stuff that's been
                discarded.

                I think the only time it might make sense to have more than one \K in a
                pattern would be if they were parts of different alternatives (where
                alternatives are separated by vertical bars).

                Regards,
                Sheri
              • mycroftj
                I m terribly sorry for not being clear. I branched into quite a few directions at once. The goal was to pad all numbers in a document with spaces or zeros so
                Message 7 of 14 , Mar 30, 2011
                • 0 Attachment
                  I'm terribly sorry for not being clear. I branched into quite a few directions at once.

                  The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.

                  Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?

                  I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
                  (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
                  which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)

                  The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.

                  Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.

                  Joy


                  --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@...> wrote:
                  >
                  > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
                  > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
                  > want.
                  >
                  > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
                  > seems that the first part of the message and the last part are not on the same subject to me.
                  >
                  > Regards,
                  > John
                  >
                  >
                  > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
                  > Sent: Monday, March 28, 2011 16:10
                  > To: ntb-scripts@yahoogroups.com
                  > Subject: [NTS] Trying to perfect RegExp to match various numbers
                  >
                  >
                  > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
                  >
                  > Actual data looks something like
                  >
                  > File 67817 Id.ppt
                  > File 691037 20dat.sys
                  > File 69870 Lock.doc
                  > File 705 56968.mbs
                  > File 70537 Jil.xls
                  > File 71168 Gas.jpg
                  >
                  > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
                  > signs, decimal point and commas optional.
                  >
                  > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
                  > xxx456 and 456xx should NOT be matched.
                  >
                  > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
                  > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
                  >
                  > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
                  >
                  > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
                  >
                  > Joy
                  >
                  > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
                  >
                  >
                  > 12345
                  > 12345.678
                  > 1234567.90
                  > xxx456
                  > 456xx
                  >
                  > xx456,123.34
                  > www.45.67.hmt2
                  >
                  > 12,345
                  > -12,345
                  > 12,345.
                  > +12,345.01
                  >
                  > 12345
                  > +123,45.678
                  > 1234567.90
                  >
                  > there are 0 lines
                  > 1 or 2 more.
                  >
                  > 12345
                  > -12345.678
                  > 1234567.90
                  > xxx456
                  > +456xx
                  >
                  > xx456.34
                  > www.45.67.hmt2
                  >
                  > +12345
                  >
                  > 12345
                  > -12345.678
                  > 1234567.90
                  > xxx456
                  > 456xx
                  >
                  > there are 0 lines. The zero should match as should the following one and negative two.
                  > 1 or -2 more.
                  >
                  >
                  >
                  > [Non-text portions of this message have been removed]
                  >
                • Eb
                  I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I m talking about? I
                  Message 8 of 14 , Mar 30, 2011
                  • 0 Attachment
                    I recall a post by Diodeom in the Clips group, with a bit of razzle-dazzle, that might could do what you want. Perhpas Dio would know what I'm talking about?

                    I do not remember the topic, but I believe it had to do with sorting a table of numbers, numerically, even though the numbers were left-justified.

                    Cheers

                    --- In ntb-scripts@yahoogroups.com, "mycroftj" <mycroftj@...> wrote:
                    >
                    > I'm terribly sorry for not being clear. I branched into quite a few directions at once.
                    >
                    > The goal was to pad all numbers in a document with spaces or zeros so they would sort correctly.
                    >
                    > Instead of writing a clip, I thought it might be done in one mighty regexp replace (777->000777 and 77->000077) but that would involve calculating lengths and taking decimal points into consideration so I don't see how that can be done. OR CAN IT?
                    >
                    > I DID want to learn how to pick out just the numbers and Sheri seems to have hit on that perfectly with
                    > (?:^|\s)\K[\+\-]?[0-9,\.]+(?=\s|$)
                    > which picks out just the numbers so I can manipulate the selected text in a clip. (Thanks again, Sheri!)
                    >
                    > The lines containing FILE were from actual data where as the rest was just a made up assortment of numbers I was experimenting with.
                    >
                    > Regexps are one of the most useful things I've stumbled upon in years but SO frustrating. I'll keep trying and I really do appreciate the help from everyone.
                    >
                    > Joy
                    >
                    >
                    > --- In ntb-scripts@yahoogroups.com, "John Shotsky" <jshotsky@> wrote:
                    > >
                    > > After reading through this several times, I could not determine the actual goal. To get good assistance, you should
                    > > provide the starting data, the result that is wanted, and the rules you want, as well as identifying any data you don't
                    > > want.
                    > >
                    > > So, is the goal to sort? By numbers only? Padded numbers? Or is the actual data not of interest in your message? It
                    > > seems that the first part of the message and the last part are not on the same subject to me.
                    > >
                    > > Regards,
                    > > John
                    > >
                    > >
                    > > From: ntb-scripts@yahoogroups.com [mailto:ntb-scripts@yahoogroups.com] On Behalf Of mycroftj
                    > > Sent: Monday, March 28, 2011 16:10
                    > > To: ntb-scripts@yahoogroups.com
                    > > Subject: [NTS] Trying to perfect RegExp to match various numbers
                    > >
                    > >
                    > > My ultimate goal was to create a script that (left) space or zero pads numbers to a fixed length for sorting.
                    > >
                    > > Actual data looks something like
                    > >
                    > > File 67817 Id.ppt
                    > > File 691037 20dat.sys
                    > > File 69870 Lock.doc
                    > > File 705 56968.mbs
                    > > File 70537 Jil.xls
                    > > File 71168 Gas.jpg
                    > >
                    > > I then became interested in trying to find a regexp that will match numbers surrounded by BOL, EOL, spaces and tabs with
                    > > signs, decimal point and commas optional.
                    > >
                    > > In the following test data. the numbers starting with 123 should be matched as well as the integers 0, 1 and 2.
                    > > xxx456 and 456xx should NOT be matched.
                    > >
                    > > I have something that mostly works but it also matches x456 and the t2 in hmt2. WHY IS THAT?
                    > > It also misses 0, 1 and 2 although it does (correctly) pick up -2.
                    > >
                    > > I put in the caret because it was not matching numbers at the start of a line. Is that how it's done?
                    > >
                    > > Thanks for your help. I ordered the Regular Expressions Cookbook today. Hope it's as good as the reviews say!
                    > >
                    > > Joy
                    > >
                    > > What I have so far [^\s][\-\+]?\d+,*\d*\.?\d*(?=\s)
                    > >
                    > >
                    > > 12345
                    > > 12345.678
                    > > 1234567.90
                    > > xxx456
                    > > 456xx
                    > >
                    > > xx456,123.34
                    > > www.45.67.hmt2
                    > >
                    > > 12,345
                    > > -12,345
                    > > 12,345.
                    > > +12,345.01
                    > >
                    > > 12345
                    > > +123,45.678
                    > > 1234567.90
                    > >
                    > > there are 0 lines
                    > > 1 or 2 more.
                    > >
                    > > 12345
                    > > -12345.678
                    > > 1234567.90
                    > > xxx456
                    > > +456xx
                    > >
                    > > xx456.34
                    > > www.45.67.hmt2
                    > >
                    > > +12345
                    > >
                    > > 12345
                    > > -12345.678
                    > > 1234567.90
                    > > xxx456
                    > > 456xx
                    > >
                    > > there are 0 lines. The zero should match as should the following one and negative two.
                    > > 1 or -2 more.
                    > >
                    > >
                    > >
                    > > [Non-text portions of this message have been removed]
                    > >
                    >
                  • Alec Burgess
                    cc ntb-clips (see note at end) ... Following will enforce 5 digits (zero-padded) before optional decimal and 4 after H=test B3-30 leading / trailing zeros ;
                    Message 9 of 14 , Mar 30, 2011
                    • 0 Attachment
                      cc ntb-clips (see note at end)

                      On 2011-03-30 15:21, mycroftj wrote:
                      > I'm terribly sorry for not being clear. I branched into quite a few
                      > directions at once.
                      >
                      > The goal was to pad all numbers in a document with spaces or zeros so
                      > they would sort correctly.
                      >
                      > Instead of writing a clip, I thought it might be done in one mighty
                      > regexp replace (777->000777 and 77->000077) but that would involve
                      > calculating lengths and taking decimal points into consideration so I
                      > don't see how that can be done. OR CAN IT?
                      >
                      >
                      Following will enforce 5 digits (zero-padded) before optional decimal
                      and 4 after

                      H=test B3-30 leading / trailing zeros
                      ; Alec Burgess 2011-03-30
                      ; currently enforces 5 digits before (optional) decimal and 4 after
                      ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2===0000" rwais
                      ^!replace "===" >> "" rwais
                      ^!replace "\b0*(\d{5})\.(\d{4})0*\b" >> "$1.$2" rwais

                      Note - I wanted the first replace to be just "00000$1.$20000" but I
                      haven't figured out how to prevent clip replace from confusing $2 with
                      $20 - (ie. non-existent 20th or 20000th sub-pattern.

                      Does anyone know how to do this? As is just make sure "===" is any
                      string which does not exist in the input.

                      Note - adding line ^!replace "\.0*\b" >> "" rwais to above will
                      eliminate decimal padding after unnecessary decimal point.

                      sample input
                      1
                      123
                      123.
                      123.1
                      123.123

                      resulting output
                      00001.0000
                      00123.0000
                      00123.0000.
                      00123.1000
                      00123.1230

                      btw: ntb-clips group would be a better place for this discussion that
                      ntb-scripts. As originally intended ntb-scripts was for discussion of
                      things like using Perl and JavaScript in clip code. Its readership is
                      much less than the ntb-clips though I assume everyone who follows
                      ntb-scripts also follows ntb-clips :-)

                      Regards ... Alec (buralex@gmail& WinLiveMess - alec.m.burgess@skype)
                    • Eb
                      Alec, I m not sure this will work, but the variable ought to break up the output pattern: ^!replace b( d+) .?( d*) b 00000$1.$2^%empty%0000 rwais
                      Message 10 of 14 , Apr 1, 2011
                      • 0 Attachment
                        Alec,

                        I'm not sure this will work, but the variable ought to break up the output pattern:

                        ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais


                        Cheers


                        Eb

                        --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                        ...
                        > Does anyone know how to do this? As is just make sure "===" is any
                        > string which does not exist in the input.
                      • Alec Burgess
                        ... It does allow the $2 to be substituted but ^%empty% does not appear to get translated. I get results like this: 45.6 == 0000045.6^%empty%0000 -- Regards
                        Message 11 of 14 , Apr 1, 2011
                        • 0 Attachment
                          On 2011-04-01 16:08, Eb wrote:
                          >
                          >
                          > I'm not sure this will work, but the variable ought to break up the
                          > output pattern:
                          >
                          > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                          It does allow the $2 to be substituted but ^%empty% does not appear to
                          get translated.
                          I get results like this:
                          45.6 ==> 0000045.6^%empty%0000
                          --
                          Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                        • Eb
                          Ok, try this (hex code x30 for the first zero): $2 x30000 Eb
                          Message 12 of 14 , Apr 4, 2011
                          • 0 Attachment
                            Ok, try this (hex code '\x30' for the first zero):

                            $2\x30000

                            Eb


                            --- In ntb-scripts@yahoogroups.com, Alec Burgess <buralex@...> wrote:
                            >
                            > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                            > It does allow the $2 to be substituted but ^%empty% does not appear to
                            > get translated.
                            > I get results like this:
                            > 45.6 ==> 0000045.6^%empty%0000
                            > --
                            > Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)
                            >
                          • Alec Burgess
                            Thanks Eb - x30 works. when I was messing around with this I had tried the same thing but realize now that I was trying (the meaningless) uppercase X30
                            Message 13 of 14 , Apr 4, 2011
                            • 0 Attachment
                              Thanks Eb - \x30 works.
                              when I was messing around with this I had tried the same thing but
                              realize now that I was trying (the meaningless) uppercase \X30 instead
                              of the correct \x30. Ooops ! :-[

                              On 2011-04-04 09:18, Eb wrote:
                              > Ok, try this (hex code '\x30' for the first zero):
                              >
                              > $2\x30000
                              >
                              > Eb
                              >
                              > --- In ntb-scripts@yahoogroups.com
                              > <mailto:ntb-scripts%40yahoogroups.com>, Alec Burgess <buralex@...> wrote:
                              > >
                              > > > ^!replace "\b(\d+)\.?(\d*)\b" >> "00000$1.$2^%empty%0000" rwais
                              > > It does allow the $2 to be substituted but ^%empty% does not appear to
                              > > get translated.
                              > > I get results like this:
                              > > 45.6 ==> 0000045.6^%empty%0000

                              --
                              Regards ... Alec (buralex@gmail & WinLiveMess - alec.m.burgess@skype)


                              [Non-text portions of this message have been removed]
                            Your message has been successfully submitted and would be delivered to recipients shortly.