Loading ...
Sorry, an error occurred while loading the content.

Line frequency analysis.

Expand Messages
  • John Fitzsimons
    Hi, I start with a newsgroup list like...... 0.verizon.windows2000 0.verizon.windowsxp 0.verizon.windowsxp 0.verizon.windowsxp 24hoursupport.helpdesk
    Message 1 of 15 , Nov 2, 2008
    View Source
    • 0 Attachment
      Hi,

      I start with a newsgroup list like......

      0.verizon.windows2000
      0.verizon.windowsxp
      0.verizon.windowsxp
      0.verizon.windowsxp
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      24hoursupport.helpdesk
      alt.computer
      alt.computer
      alt.computer
      alt.computer
      alt.computer
      alt.computer
      alt.computer
      alt.computer


      I want to end up with a list like......

      000001,0.verizon.windows2000
      000003,0.verizon.windowsxp
      000012,24hoursupport.helpdesk
      000008,alt.computer

      Is there an existing way/clip to do this ? If not then can someone
      provide the needed code to produce this result please ?

      It doesn't matter if the numbers follow the newsgroup name but I
      would need a delimiter of some sort first eg. a comma.


      Regards, John.
    • Sheri
      ... This will do it exactly as above, but version 5+ is required: ^!SetScreenUpdate Off ^!Jump Doc_End ^!If ^$GetCol$ 1 Next Else Skip ^!InsertText ^P ^!Jump
      Message 2 of 15 , Nov 2, 2008
      View Source
      • 0 Attachment
        --- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@...> wrote:

        >
        > I want to end up with a list like......
        >
        > 000001,0.verizon.windows2000
        > 000003,0.verizon.windowsxp
        > 000012,24hoursupport.helpdesk
        > 000008,alt.computer
        >
        > Is there an existing way/clip to do this ? If not then can someone
        > provide the needed code to produce this result please ?
        >

        This will do it exactly as above, but version 5+ is required:

        ^!SetScreenUpdate Off
        ^!Jump Doc_End
        ^!If ^$GetCol$>1 Next Else Skip
        ^!InsertText ^P
        ^!Jump Doc_Start
        :Loop
        ^!Find "^(.+\r\n)\1*" RS
        ^!IfError Quit
        ^!Set %count%=^$StrCount("^%NL%";"^$GetSelection$";Yes;Yes)$
        ^!Set %fill%=^$Calc(6-^$StrSize(^%count%)$)$
        ^!Replace "(.+\r\n)\1*" >> "^$StrFill("0";^%fill%)$^%count%,$1" RHS
        ^!Goto Loop
        :Quit
        ^!ClearVariable %count%
        ^!ClearVariable %fill%
        ;end of clip
      • Don - HtmlFixIt.com
        ... I have done that much less efficiently in the past. I ll try to break it down ... set screen update off just speeds life up jump doc end takes us to the
        Message 3 of 15 , Nov 3, 2008
        View Source
        • 0 Attachment
          Sheri wrote:
          > --- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@...> wrote:
          >
          >> I want to end up with a list like......
          >>
          >> 000001,0.verizon.windows2000
          >> 000003,0.verizon.windowsxp
          >> 000012,24hoursupport.helpdesk
          >> 000008,alt.computer
          >>
          >> Is there an existing way/clip to do this ? If not then can someone
          >> provide the needed code to produce this result please ?
          >>
          >
          > This will do it exactly as above, but version 5+ is required:
          >
          > ^!SetScreenUpdate Off
          > ^!Jump Doc_End
          > ^!If ^$GetCol$>1 Next Else Skip
          > ^!InsertText ^P
          > ^!Jump Doc_Start
          > :Loop
          > ^!Find "^(.+\r\n)\1*" RS
          > ^!IfError Quit
          > ^!Set %count%=^$StrCount("^%NL%";"^$GetSelection$";Yes;Yes)$
          > ^!Set %fill%=^$Calc(6-^$StrSize(^%count%)$)$
          > ^!Replace "(.+\r\n)\1*" >> "^$StrFill("0";^%fill%)$^%count%,$1" RHS
          > ^!Goto Loop
          > :Quit
          > ^!ClearVariable %count%
          > ^!ClearVariable %fill%
          > ;end of clip


          I have done that much less efficiently in the past.

          I'll try to break it down ...
          set screen update off just speeds life up
          jump doc end takes us to the end
          I think you are adding a blank line at the end next ... interesting way
          -- although you could have multiple blank lines at the end and you
          aren't removing them
          Loop ... finds basically anything of one character or more followed by a
          new line
          if there is an error it quits ... so if there were a blank line in the
          middle of the list, it would quit because it is less than one in length?

          PLEASE EXPLAIN THIS PART:
          It apparently counts the number of times the string occurs in the
          document? -- this part confused me a little. I assume that the \1* is
          the key part because it finds all incidents of that term, highlights
          them and deletes them to replace them with your final product.
          THANKS

          I get the $1 being the () part of the find.
          From there it essentially just cycles.


          Sheri, even though his data was sorted, should we not do a sort to begin?

          What about word wrap. Could it affect your outcome at all?


          Here is how I did something similar without regex -- I bet yours is faster.
          In my case I am taking one element out of a delimited list vs his
          example that has just one element, ie, the entire line. Here is mine:

          :NewTeam
          ;first time set team
          ^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
          ;^!Info ^%GrabField2%
          ^!Set %Team%=^$GetSelection$
          ^!Set %TeamCount%=0
          ;^!Info ^%TeamCount%
          :Loop
          ;get team for this line
          ^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
          ;be sure that this line is for current team
          ;if not, go to ProcessTeam
          ;otherwise continue here
          ^!If "^%Team%" <> "^$GetSelection$" ProcessTeam
          ^!Set %TeamCount%=^$Calc(^%TeamCount%+1;0)$
          ;^!Info ^%TeamCount%
          ^!Jump +1
          ^!GoTo Loop

          :ProcessTeam
          ;There is more here that does something with the info
          ^!GoTo NewTeam
        • Sheri
          ... Doesn t matter to the clip if there are multiple blank lines at the end -- only thing that matters is, the last line with content needs to have a CRLF aka
          Message 4 of 15 , Nov 3, 2008
          View Source
          • 0 Attachment
            --- In ntb-clips@yahoogroups.com, "Don - HtmlFixIt.com" <don@...> wrote:
            >
            > Sheri wrote:
            > > --- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@> wrote:
            > >
            > >> I want to end up with a list like......
            > >>
            > >> 000001,0.verizon.windows2000
            > >> 000003,0.verizon.windowsxp
            > >> 000012,24hoursupport.helpdesk
            > >> 000008,alt.computer
            > >>
            > >> Is there an existing way/clip to do this ? If not then can
            > >> someone provide the needed code to produce this result please ?
            > >>
            > >
            > > This will do it exactly as above, but version 5+ is required:
            > >
            > > ^!SetScreenUpdate Off
            > > ^!Jump Doc_End
            > > ^!If ^$GetCol$>1 Next Else Skip
            > > ^!InsertText ^P
            > > ^!Jump Doc_Start
            > > :Loop
            > > ^!Find "^(.+\r\n)\1*" RS
            > > ^!IfError Quit
            > > ^!Set %count%=^$StrCount("^%NL%";"^$GetSelection$";Yes;Yes)$
            > > ^!Set %fill%=^$Calc(6-^$StrSize(^%count%)$)$
            > > ^!Replace "(.+\r\n)\1*" >> "^$StrFill("0";^%fill%)$^%count%,$1" RHS
            > > ^!Goto Loop
            > > :Quit
            > > ^!ClearVariable %count%
            > > ^!ClearVariable %fill%
            > > ;end of clip
            >
            >
            > I have done that much less efficiently in the past.
            >
            > I'll try to break it down ...
            > set screen update off just speeds life up
            > jump doc end takes us to the end

            > I think you are adding a blank line at the end next ...
            > interesting way -- although you could have multiple blank lines
            > at the end and you aren't removing them

            Doesn't matter to the clip if there are multiple blank lines at the
            end -- only thing that matters is, the last line with content needs to
            have a CRLF aka ^%NL% at the end of it.

            > Loop ... finds basically anything of one character or more
            > followed by a new line

            > if there is an error it quits ... so if there were a blank line
            > in the middle of the list, it would quit because it is less than
            > one in length?

            Find finds next, it skips what doesn't match. So a blank line in the
            middle is skipped over. Only problem with a blank line in the middle
            would be if it occurs in the middle of repeated content. Each set of
            repeated content needs to be consecutive, so a blank line would cause
            multiple count outputs for that content/line.

            >
            > PLEASE EXPLAIN THIS PART:
            > It apparently counts the number of times the string occurs in the
            > document? -- this part confused me a little. I assume that the
            > \1* is the key part because it finds all incidents of that term,
            > highlights them and deletes them to replace them with your final
            > product.

            \1 Matches the same thing that matched for substring 1, i.e., the part
            in parentheses (.+\r\n)

            IOW, a repetition of the whole line.

            The asterisk after \1 says it can match zero or more times.

            After the find, the repeated lines are selected. So to find the number
            of repetitions, I just count the number of ^%NL%'s in the selection
            (aka highlight).

            >
            > I get the $1 being the () part of the find.
            > From there it essentially just cycles.
            >
            >
            > Sheri, even though his data was sorted, should we not do a sort
            > to begin?

            To do its job, the repetitions need to be consecutive, so unless
            already sorted, yes.

            > What about word wrap. Could it affect your outcome at all?

            No, word wrap does not add ^%NL%'s

            >
            >
            > Here is how I did something similar without regex -- I bet yours
            > is faster. In my case I am taking one element out of a delimited
            > list vs his example that has just one element, ie, the entire
            > line.

            > Here is mine:

            > :NewTeam
            > ;first time set team
            > ^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
            > ;^!Info ^%GrabField2%
            > ^!Set %Team%=^$GetSelection$
            > ^!Set %TeamCount%=0
            > ;^!Info ^%TeamCount%
            > :Loop
            > ;get team for this line
            > ^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
            > ;be sure that this line is for current team
            > ;if not, go to ProcessTeam
            > ;otherwise continue here
            > ^!If "^%Team%" <> "^$GetSelection$" ProcessTeam
            > ^!Set %TeamCount%=^$Calc(^%TeamCount%+1;0)$
            > ;^!Info ^%TeamCount%
            > ^!Jump +1
            > ^!GoTo Loop
            >
            > :ProcessTeam
            > ;There is more here that does something with the info
            > ^!GoTo NewTeam
            >

            That looks fine, John may prefer it since he was still using 4.95 in a
            previous posting this year. Setting screenupdate off might make it faster.

            John, there a free Light version of 5.7b (latest version) that
            includes clipcode and regex capability.

            Regards,
            Sheri
          • Alec Burgess
            Sheri (silvermoonwoman@comcast.net) wrote (in part) (on 2008-11-03 at ... Sheri - Version 5.7 of Notetab Standard $29.95 US Version 5.7 of Notetab Pro
            Message 5 of 15 , Nov 3, 2008
            View Source
            • 0 Attachment
              Sheri (silvermoonwoman@...) wrote (in part) (on 2008-11-03 at
              08:15):
              > That looks fine, John may prefer it since he was still using 4.95 in a
              > previous posting this year. Setting screenupdate off might make it
              > faster.

              Sheri -

              Version 5.7 of Notetab Standard $29.95 US
              Version 5.7 of Notetab Pro ......... $19.95 US
              Version 5.7 of Notetab Light ...... free
              Keeping track of which version every past poster to yahoo-groups is
              using ... Priceless!

              > John, there a free Light version of 5.7b (latest version) that
              > includes clipcode and regex capability.



              [Non-text portions of this message have been removed]
            • John Fitzsimons
              ... Hi Cheri, ... Thanks for the mention. I downloaded 5+ to try it out. ... Excellent ! I started with a file with more than 200K lines and it did the job
              Message 6 of 15 , Nov 3, 2008
              View Source
              • 0 Attachment
                On Mon, 03 Nov 2008 04:48:45 -0000, Sheri wrote:

                >--- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@...> wrote:

                Hi Cheri,

                >> I want to end up with a list like......

                >> 000001,0.verizon.windows2000
                >> 000003,0.verizon.windowsxp
                >> 000012,24hoursupport.helpdesk
                >> 000008,alt.computer

                >> Is there an existing way/clip to do this ? If not then can someone
                >> provide the needed code to produce this result please ?

                >This will do it exactly as above, but version 5+ is required:

                Thanks for the mention. I downloaded 5+ to try it out.

                >^!SetScreenUpdate Off
                >^!Jump Doc_End
                >^!If ^$GetCol$>1 Next Else Skip
                >^!InsertText ^P
                >^!Jump Doc_Start
                >:Loop
                >^!Find "^(.+\r\n)\1*" RS
                >^!IfError Quit
                >^!Set %count%=^$StrCount("^%NL%";"^$GetSelection$";Yes;Yes)$
                >^!Set %fill%=^$Calc(6-^$StrSize(^%count%)$)$
                >^!Replace "(.+\r\n)\1*" >> "^$StrFill("0";^%fill%)$^%count%,$1" RHS
                >^!Goto Loop
                >:Quit
                >^!ClearVariable %count%
                >^!ClearVariable %fill%
                >;end of clip

                Excellent ! I started with a file with more than 200K lines and it did
                the job very very quickly. Many thanks.

                I did a quick check of some of the totals and although they didn't all
                match the "Occurrences" I did with S/R they were very close.
                Certainly fine for what I wanted.

                You are very clever for having done that. I wish I were as smart.
                I have done programming BUT reg exps still intimidate me greatly.


                Regards, John.
              • John Fitzsimons
                On Mon, 03 Nov 2008 07:01:02 -0500, Don - HtmlFixIt.com wrote: Hi Don, ... Thanks. I did give it a go in 4.95 but unfortunately it didn t finish. It
                Message 7 of 15 , Nov 3, 2008
                View Source
                • 0 Attachment
                  On Mon, 03 Nov 2008 07:01:02 -0500, Don - HtmlFixIt.com wrote:

                  Hi Don,

                  < snip >

                  >Here is how I did something similar without regex -- I bet yours is faster.
                  >In my case I am taking one element out of a delimited list vs his
                  >example that has just one element, ie, the entire line. Here is mine:

                  >:NewTeam
                  >;first time set team
                  >^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
                  >;^!Info ^%GrabField2%
                  >^!Set %Team%=^$GetSelection$
                  >^!Set %TeamCount%=0
                  >;^!Info ^%TeamCount%
                  >:Loop
                  >;get team for this line
                  >^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
                  >;be sure that this line is for current team
                  >;if not, go to ProcessTeam
                  >;otherwise continue here
                  >^!If "^%Team%" <> "^$GetSelection$" ProcessTeam
                  >^!Set %TeamCount%=^$Calc(^%TeamCount%+1;0)$
                  >;^!Info ^%TeamCount%
                  >^!Jump +1
                  >^!GoTo Loop

                  >:ProcessTeam
                  >;There is more here that does something with the info
                  >^!GoTo NewTeam

                  Thanks. I did give it a go in 4.95 but unfortunately it didn't finish.
                  It took about a half hour to go through my text file and then kept
                  doing something (without producing any output) for another half
                  an hour. Before I stopped it.

                  Regards, John.
                • Sheri
                  ... Can you explain the differences? Might there be spaces at the end of some lines or something? Regards, Sheri
                  Message 8 of 15 , Nov 4, 2008
                  View Source
                  • 0 Attachment
                    --- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@...> wrote:

                    > I did a quick check of some of the totals and although they didn't all
                    > match the "Occurrences" I did with S/R they were very close.
                    > Certainly fine for what I wanted.

                    Can you explain the differences? Might there be spaces at the end of
                    some lines or something?

                    Regards,
                    Sheri
                  • Flo
                    ... Sheri, Just another idea concerning your solution: In the past, we have used that pattern... ^(.+ r n) 1* quite often for finding duplicate lines. I think
                    Message 9 of 15 , Nov 5, 2008
                    View Source
                    • 0 Attachment
                      --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
                      >
                      > --- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@> wrote:
                      >
                      > >
                      > > I want to end up with a list like......
                      > >
                      > > 000001,0.verizon.windows2000
                      > > 000003,0.verizon.windowsxp
                      > > 000012,24hoursupport.helpdesk
                      > > 000008,alt.computer
                      > >
                      > > Is there an existing way/clip to do this ? If not then can someone
                      > > provide the needed code to produce this result please ?
                      > >
                      >
                      > This will do it exactly as above, but version 5+ is required:
                      >
                      > ^!SetScreenUpdate Off
                      > ^!Jump Doc_End
                      > ^!If ^$GetCol$>1 Next Else Skip
                      > ^!InsertText ^P
                      > ^!Jump Doc_Start
                      > :Loop
                      > ^!Find "^(.+\r\n)\1*" RS
                      > ^!IfError Quit
                      > ^!Set %count%=^$StrCount("^%NL%";"^$GetSelection$";Yes;Yes)$
                      > ^!Set %fill%=^$Calc(6-^$StrSize(^%count%)$)$
                      > ^!Replace "(.+\r\n)\1*" >> "^$StrFill("0";^%fill%)$^%count%,$1" RHS
                      > ^!Goto Loop
                      > :Quit
                      > ^!ClearVariable %count%
                      > ^!ClearVariable %fill%
                      > ;end of clip


                      Sheri,

                      Just another idea concerning your solution:

                      In the past, we have used that pattern...

                      ^(.+\r\n)\1*

                      quite often for finding duplicate lines. I think we could also use...

                      ^(.+)(\r\n\1)*

                      The advantage is that, with this pattern, we don't have to care for
                      CRNL at the end of the list.

                      In your clip, a final NL is (also) needed for counting the selected
                      (duplicate) lines. But, instead of calculating the NL, we could
                      write...

                      ^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$

                      So I think the clip could be slightly shortened like this...


                      ^!Jump Doc_Start
                      :Loop
                      ^!Find "^(.+)(\r\n\1)*" RS
                      ^!IfError End
                      ^!Set %Count%=^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$
                      ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
                      ^!Replace "(.+)(\r\n\1)*" >> "^$StrFill("0";^%Fill%)$^%Count%,$1" RS
                      ^!Goto Loop

                      The only minor disadvantage: Empty lines within the list will provide
                      wrong results. So if there are any, we have to remove them with an
                      additional command line.

                      Do you agree with this solution?

                      Regards,
                      Flo
                       
                    • Sheri
                      ... That works fine. Here s one that ignores trailing white space and empty lines in the data (but goes back to counting strings for count): ^!SetScreenUpdate
                      Message 10 of 15 , Nov 5, 2008
                      View Source
                      • 0 Attachment
                        --- In ntb-clips@yahoogroups.com, "Flo" <flo.gehrke@...> wrote:
                        >
                        > Sheri,
                        >
                        > Just another idea concerning your solution:
                        >
                        > In the past, we have used that pattern...
                        >
                        > ^(.+\r\n)\1*
                        >
                        > quite often for finding duplicate lines. I think we could also use...
                        >
                        > ^(.+)(\r\n\1)*
                        >
                        > The advantage is that, with this pattern, we don't have to care
                        > for CRNL at the end of the list. In your clip, a final NL is
                        > (also) needed for counting the selected (duplicate) lines. But,
                        > instead of calculating the NL, we could write...
                        >
                        > ^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$
                        >
                        > So I think the clip could be slightly shortened like this...
                        >
                        >
                        > ^!Jump Doc_Start
                        > :Loop
                        > ^!Find "^(.+)(\r\n\1)*" RS
                        > ^!IfError End
                        > ^!Set %Count%=^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$
                        > ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
                        > ^!Replace "(.+)(\r\n\1)*" >> "^$StrFill("0";^%Fill%)$^%Count%,$1" RS
                        > ^!Goto Loop
                        >
                        > The only minor disadvantage: Empty lines within the list will
                        > provide wrong results. So if there are any, we have to remove
                        > them with an additional command line.
                        >
                        > Do you agree with this solution?
                        >
                        > Regards,
                        > Flo
                        >  
                        >

                        That works fine. Here's one that ignores trailing white space and
                        empty lines in the data (but goes back to counting strings for count):

                        ^!SetScreenUpdate off
                        ^!Jump Doc_Start
                        :Loop
                        ^!Find "^(.+)(\s*\r\n\1)*" RS
                        ^!IfError Out
                        ^!SetArray %farray%=^$GetReSubstrings$
                        ;begin long line
                        ^!Set
                        %Count%=^$StrCount("^%NL%";"^$GetDocReplaceAll("\s*(\r\n)+|(?<=.)\z";"\r\n")$";YES;YES)$
                        ;end long line
                        ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
                        ^!InsertText ^$StrFill("0";^%fill%)$^%Count%,^%farray1%
                        ^!Goto Loop
                        :Out
                        ^!Set %farray%=""
                        ^!ClearVariable %farray%
                        ^!ClearVariable %Count%
                        ^!ClearVariable %Fill%
                        ;end of clip
                      • Flo
                        ... z ; r n )$ ;YES;YES)$ ... Sheri, I think there s a problem with this solution. It ignores trailing blanks and empty lines in the counting of duplicate
                        Message 11 of 15 , Nov 6, 2008
                        View Source
                        • 0 Attachment
                          --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
                          >
                          > That works fine. Here's one that ignores trailing white space and
                          > empty lines in the data (but goes back to counting strings for
                          count):
                          >
                          > ^!SetScreenUpdate off
                          > ^!Jump Doc_Start
                          > :Loop
                          > ^!Find "^(.+)(\s*\r\n\1)*" RS
                          > ^!IfError Out
                          > ^!SetArray %farray%=^$GetReSubstrings$
                          > ;begin long line
                          > ^!Set
                          > %Count%=^$StrCount("^%NL%";"^$GetDocReplaceAll("\s*(\r\n)+|(?<=.)
                          \z";"\r\n")$";YES;YES)$
                          > ;end long line
                          > ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
                          > ^!InsertText ^$StrFill("0";^%fill%)$^%Count%,^%farray1%
                          > ^!Goto Loop
                          > :Out
                          > ^!Set %farray%=""
                          > ^!ClearVariable %farray%
                          > ^!ClearVariable %Count%
                          > ^!ClearVariable %Fill%
                          > ;end of clip

                          Sheri,

                          I think there's a problem with this solution. It ignores trailing
                          blanks and empty lines in the counting of duplicate lines but not in
                          the ^!Find command. So if we have got...

                          0.verizon.windowsxp <-- trailing blank
                          0.verizon.windowsxp
                          0.verizon.windowsxp

                          for example, the clip doesn't find three duplicates but interprets
                          this as a singular line plus two duplicates of another line. That is,
                          it works fine only on the condition that all three duplicates end
                          with a trailing blank.

                          So it may be better to remove trailing blanks prior to ^!Find. Isn't
                          it?

                          Regards,
                          Flo

                          P.S. By the way: It makes no difference -- but what about...

                          ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$

                          (since NT v.5.0 the \s matches "any white space", including CRNL).
                           
                        • Sheri
                          ... You re right. ... Not necessary if we change the ^!Find to this: ^!Find ^(.+ S)( s* r n 1)* RS ... Haven t tested that and it may work fine. However it
                          Message 12 of 15 , Nov 6, 2008
                          View Source
                          • 0 Attachment
                            Flo wrote:
                            > Sheri,
                            >
                            > I think there's a problem with this solution. It ignores trailing
                            > blanks and empty lines in the counting of duplicate lines but not in
                            > the ^!Find command. So if we have got...
                            >
                            > 0.verizon.windowsxp <-- trailing blank
                            > 0.verizon.windowsxp
                            > 0.verizon.windowsxp
                            >
                            > for example, the clip doesn't find three duplicates but interprets
                            > this as a singular line plus two duplicates of another line. That is,
                            > it works fine only on the condition that all three duplicates end
                            > with a trailing blank.
                            >
                            You're right.
                            > So it may be better to remove trailing blanks prior to ^!Find. Isn't
                            > it?
                            >
                            Not necessary if we change the ^!Find to this:

                            ^!Find "^(.+\S)(\s*\r\n\1)*" RS

                            > P.S. By the way: It makes no difference -- but what about...
                            >
                            > ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$
                            >
                            > (since NT v.5.0 the \s matches "any white space", including CRNL).
                            >
                            Haven't tested that and it may work fine. However it looks suspicious
                            because: we are using the PCRE multiline option by default in NoteTab so
                            $ matches at line ends (and line ends by definition are followed by \r\n
                            when showing in a NoteTab document) but \s+ matches across line breaks.
                            So how can it match $ if it picks up all the \r\n's? In order to match $
                            it would need to backtrack. It might create an issue if it backtracks
                            between the \r and \n.

                            I'll try to test it later.

                            Regards,
                            Sheri
                          • Sheri
                            ... Hmn, maybe it should be ^!Find ^(.* S)( s* r n 1)* RS just in case there is only one visible character on the line. Regards, Sheri
                            Message 13 of 15 , Nov 6, 2008
                            View Source
                            • 0 Attachment
                              Sheri wrote:
                              >
                              > Not necessary if we change the ^!Find to this:
                              >
                              > ^!Find "^(.+\S)(\s*\r\n\1)*" RS
                              >
                              Hmn, maybe it should be

                              ^!Find "^(.*\S)(\s*\r\n\1)*" RS

                              just in case there is only one visible character on the line.

                              Regards,
                              Sheri
                            • Sheri
                              ... Indeed that pattern matches in the middle of r n. So after replacement, we end up with r n n. The reason it makes no difference to the outcome is because
                              Message 14 of 15 , Nov 7, 2008
                              View Source
                              • 0 Attachment
                                --- In ntb-clips@yahoogroups.com, Sheri <silvermoonwoman@...> wrote:
                                >
                                > Flo wrote:
                                > > Sheri,
                                > >
                                > > I think there's a problem with this solution. It ignores trailing
                                > > blanks and empty lines in the counting of duplicate lines but not in
                                > > the ^!Find command. So if we have got...
                                > >
                                > > 0.verizon.windowsxp <-- trailing blank
                                > > 0.verizon.windowsxp
                                > > 0.verizon.windowsxp
                                > >
                                > > for example, the clip doesn't find three duplicates but
                                > > interprets this as a singular line plus two duplicates of another
                                > > line. That is, it works fine only on the condition that all three
                                > > duplicates end with a trailing blank. You're right. So it may be
                                > > better to remove trailing blanks prior to ^!Find. Isn't it?

                                > >
                                > Not necessary if we change the ^!Find to this:
                                >
                                > ^!Find "^(.+\S)(\s*\r\n\1)*" RS
                                >
                                > > P.S. By the way: It makes no difference -- but what about...
                                > >
                                > > ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$
                                > >
                                > > (since NT v.5.0 the \s matches "any white space", including
                                > > CRNL). Haven't tested that and it may work fine. However it looks
                                > > suspicious because: we are using the PCRE multiline option by
                                > > default in NoteTab so $ matches at line ends (and line ends by
                                > > definition are followed by \r\n when showing in a NoteTab
                                > > document) but \s+ matches across line breaks. So how can it match
                                > > $ if it picks up all the \r\n's? In order to match $ it would
                                > > need to backtrack. It might create an issue if it backtracks
                                > > between the \r and \n.
                                >
                                > I'll try to test it later.

                                Indeed that pattern matches in the middle of \r\n. So after
                                replacement, we end up with \r\n\n. The reason it makes no difference
                                to the outcome is because we are counting "\r\n" after applying this
                                in ^$GetDocReplaceAll$. You would be able to see a problem if that
                                were getting inserted in the document (however, something else happens
                                when you insert it, \r\n\n becomes \r\n\r\n because the input control
                                needs line breaks to be \r\n -- you can test the string size prior to
                                insertion vs testing the size of a selection after insertion).

                                I think it is preferable to greedily replace white space that precedes
                                "\r\n" with "". The white space does include other CRLFs.

                                Regards,
                                Sheri
                              • Flo
                                ... Yes, I can see now what happens here. I think your explanation is in better accordance with the PCRE Documentation than the NoteTab Help on RegEx. The
                                Message 15 of 15 , Nov 8, 2008
                                View Source
                                • 0 Attachment
                                  --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
                                  >
                                  > Indeed that pattern (Flo:^$GetDocReplaceAll("\s+|(?<=.)\z";"\r\n")$
                                  > matches in the middle of \r\n. So after replacement, we end up with
                                  > \r\n\n...The reason it makes no difference...

                                  Yes, I can see now what happens here. I think your explanation is in
                                  better accordance with the PCRE Documentation than the NoteTab Help
                                  on RegEx. The latter says: "$ assert end of string (or line, in
                                  multiline mode)".

                                  The PCRESYNTAX Documentation from PCRE 7.7 is more detailed: "$ end
                                  of subject, also before newline at end of subject, also before
                                  internal newline in multiline mode."

                                  Thanks again, Sheri! I can clearly see the difference now and
                                  why "^$GetDocReplaceAll("\s+|(?<=.)\z";"\r\n")$" doesn't affect the
                                  result.

                                  Flo
                                   
                                Your message has been successfully submitted and would be delivered to recipients shortly.