Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Re: Line frequency analysis.

Expand Messages
  • John Fitzsimons
    ... Hi Cheri, ... Thanks for the mention. I downloaded 5+ to try it out. ... Excellent ! I started with a file with more than 200K lines and it did the job
    Message 1 of 15 , Nov 3, 2008
    • 0 Attachment
      On Mon, 03 Nov 2008 04:48:45 -0000, Sheri wrote:

      >--- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@...> wrote:

      Hi Cheri,

      >> I want to end up with a list like......

      >> 000001,0.verizon.windows2000
      >> 000003,0.verizon.windowsxp
      >> 000012,24hoursupport.helpdesk
      >> 000008,alt.computer

      >> Is there an existing way/clip to do this ? If not then can someone
      >> provide the needed code to produce this result please ?

      >This will do it exactly as above, but version 5+ is required:

      Thanks for the mention. I downloaded 5+ to try it out.

      >^!SetScreenUpdate Off
      >^!Jump Doc_End
      >^!If ^$GetCol$>1 Next Else Skip
      >^!InsertText ^P
      >^!Jump Doc_Start
      >:Loop
      >^!Find "^(.+\r\n)\1*" RS
      >^!IfError Quit
      >^!Set %count%=^$StrCount("^%NL%";"^$GetSelection$";Yes;Yes)$
      >^!Set %fill%=^$Calc(6-^$StrSize(^%count%)$)$
      >^!Replace "(.+\r\n)\1*" >> "^$StrFill("0";^%fill%)$^%count%,$1" RHS
      >^!Goto Loop
      >:Quit
      >^!ClearVariable %count%
      >^!ClearVariable %fill%
      >;end of clip

      Excellent ! I started with a file with more than 200K lines and it did
      the job very very quickly. Many thanks.

      I did a quick check of some of the totals and although they didn't all
      match the "Occurrences" I did with S/R they were very close.
      Certainly fine for what I wanted.

      You are very clever for having done that. I wish I were as smart.
      I have done programming BUT reg exps still intimidate me greatly.


      Regards, John.
    • John Fitzsimons
      On Mon, 03 Nov 2008 07:01:02 -0500, Don - HtmlFixIt.com wrote: Hi Don, ... Thanks. I did give it a go in 4.95 but unfortunately it didn t finish. It
      Message 2 of 15 , Nov 3, 2008
      • 0 Attachment
        On Mon, 03 Nov 2008 07:01:02 -0500, Don - HtmlFixIt.com wrote:

        Hi Don,

        < snip >

        >Here is how I did something similar without regex -- I bet yours is faster.
        >In my case I am taking one element out of a delimited list vs his
        >example that has just one element, ie, the entire line. Here is mine:

        >:NewTeam
        >;first time set team
        >^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
        >;^!Info ^%GrabField2%
        >^!Set %Team%=^$GetSelection$
        >^!Set %TeamCount%=0
        >;^!Info ^%TeamCount%
        >:Loop
        >;get team for this line
        >^!Set %GrabField2%=^$GetField(^$GetRow$;^%TeamField%)$
        >;be sure that this line is for current team
        >;if not, go to ProcessTeam
        >;otherwise continue here
        >^!If "^%Team%" <> "^$GetSelection$" ProcessTeam
        >^!Set %TeamCount%=^$Calc(^%TeamCount%+1;0)$
        >;^!Info ^%TeamCount%
        >^!Jump +1
        >^!GoTo Loop

        >:ProcessTeam
        >;There is more here that does something with the info
        >^!GoTo NewTeam

        Thanks. I did give it a go in 4.95 but unfortunately it didn't finish.
        It took about a half hour to go through my text file and then kept
        doing something (without producing any output) for another half
        an hour. Before I stopped it.

        Regards, John.
      • Sheri
        ... Can you explain the differences? Might there be spaces at the end of some lines or something? Regards, Sheri
        Message 3 of 15 , Nov 4, 2008
        • 0 Attachment
          --- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@...> wrote:

          > I did a quick check of some of the totals and although they didn't all
          > match the "Occurrences" I did with S/R they were very close.
          > Certainly fine for what I wanted.

          Can you explain the differences? Might there be spaces at the end of
          some lines or something?

          Regards,
          Sheri
        • Flo
          ... Sheri, Just another idea concerning your solution: In the past, we have used that pattern... ^(.+ r n) 1* quite often for finding duplicate lines. I think
          Message 4 of 15 , Nov 5, 2008
          • 0 Attachment
            --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
            >
            > --- In ntb-clips@yahoogroups.com, John Fitzsimons <johnf@> wrote:
            >
            > >
            > > I want to end up with a list like......
            > >
            > > 000001,0.verizon.windows2000
            > > 000003,0.verizon.windowsxp
            > > 000012,24hoursupport.helpdesk
            > > 000008,alt.computer
            > >
            > > Is there an existing way/clip to do this ? If not then can someone
            > > provide the needed code to produce this result please ?
            > >
            >
            > This will do it exactly as above, but version 5+ is required:
            >
            > ^!SetScreenUpdate Off
            > ^!Jump Doc_End
            > ^!If ^$GetCol$>1 Next Else Skip
            > ^!InsertText ^P
            > ^!Jump Doc_Start
            > :Loop
            > ^!Find "^(.+\r\n)\1*" RS
            > ^!IfError Quit
            > ^!Set %count%=^$StrCount("^%NL%";"^$GetSelection$";Yes;Yes)$
            > ^!Set %fill%=^$Calc(6-^$StrSize(^%count%)$)$
            > ^!Replace "(.+\r\n)\1*" >> "^$StrFill("0";^%fill%)$^%count%,$1" RHS
            > ^!Goto Loop
            > :Quit
            > ^!ClearVariable %count%
            > ^!ClearVariable %fill%
            > ;end of clip


            Sheri,

            Just another idea concerning your solution:

            In the past, we have used that pattern...

            ^(.+\r\n)\1*

            quite often for finding duplicate lines. I think we could also use...

            ^(.+)(\r\n\1)*

            The advantage is that, with this pattern, we don't have to care for
            CRNL at the end of the list.

            In your clip, a final NL is (also) needed for counting the selected
            (duplicate) lines. But, instead of calculating the NL, we could
            write...

            ^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$

            So I think the clip could be slightly shortened like this...


            ^!Jump Doc_Start
            :Loop
            ^!Find "^(.+)(\r\n\1)*" RS
            ^!IfError End
            ^!Set %Count%=^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$
            ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
            ^!Replace "(.+)(\r\n\1)*" >> "^$StrFill("0";^%Fill%)$^%Count%,$1" RS
            ^!Goto Loop

            The only minor disadvantage: Empty lines within the list will provide
            wrong results. So if there are any, we have to remove them with an
            additional command line.

            Do you agree with this solution?

            Regards,
            Flo
             
          • Sheri
            ... That works fine. Here s one that ignores trailing white space and empty lines in the data (but goes back to counting strings for count): ^!SetScreenUpdate
            Message 5 of 15 , Nov 5, 2008
            • 0 Attachment
              --- In ntb-clips@yahoogroups.com, "Flo" <flo.gehrke@...> wrote:
              >
              > Sheri,
              >
              > Just another idea concerning your solution:
              >
              > In the past, we have used that pattern...
              >
              > ^(.+\r\n)\1*
              >
              > quite often for finding duplicate lines. I think we could also use...
              >
              > ^(.+)(\r\n\1)*
              >
              > The advantage is that, with this pattern, we don't have to care
              > for CRNL at the end of the list. In your clip, a final NL is
              > (also) needed for counting the selected (duplicate) lines. But,
              > instead of calculating the NL, we could write...
              >
              > ^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$
              >
              > So I think the clip could be slightly shortened like this...
              >
              >
              > ^!Jump Doc_Start
              > :Loop
              > ^!Find "^(.+)(\r\n\1)*" RS
              > ^!IfError End
              > ^!Set %Count%=^$Calc(1+^$GetRowEnd$-^$GetRowStart$)$
              > ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
              > ^!Replace "(.+)(\r\n\1)*" >> "^$StrFill("0";^%Fill%)$^%Count%,$1" RS
              > ^!Goto Loop
              >
              > The only minor disadvantage: Empty lines within the list will
              > provide wrong results. So if there are any, we have to remove
              > them with an additional command line.
              >
              > Do you agree with this solution?
              >
              > Regards,
              > Flo
              >  
              >

              That works fine. Here's one that ignores trailing white space and
              empty lines in the data (but goes back to counting strings for count):

              ^!SetScreenUpdate off
              ^!Jump Doc_Start
              :Loop
              ^!Find "^(.+)(\s*\r\n\1)*" RS
              ^!IfError Out
              ^!SetArray %farray%=^$GetReSubstrings$
              ;begin long line
              ^!Set
              %Count%=^$StrCount("^%NL%";"^$GetDocReplaceAll("\s*(\r\n)+|(?<=.)\z";"\r\n")$";YES;YES)$
              ;end long line
              ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
              ^!InsertText ^$StrFill("0";^%fill%)$^%Count%,^%farray1%
              ^!Goto Loop
              :Out
              ^!Set %farray%=""
              ^!ClearVariable %farray%
              ^!ClearVariable %Count%
              ^!ClearVariable %Fill%
              ;end of clip
            • Flo
              ... z ; r n )$ ;YES;YES)$ ... Sheri, I think there s a problem with this solution. It ignores trailing blanks and empty lines in the counting of duplicate
              Message 6 of 15 , Nov 6, 2008
              • 0 Attachment
                --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
                >
                > That works fine. Here's one that ignores trailing white space and
                > empty lines in the data (but goes back to counting strings for
                count):
                >
                > ^!SetScreenUpdate off
                > ^!Jump Doc_Start
                > :Loop
                > ^!Find "^(.+)(\s*\r\n\1)*" RS
                > ^!IfError Out
                > ^!SetArray %farray%=^$GetReSubstrings$
                > ;begin long line
                > ^!Set
                > %Count%=^$StrCount("^%NL%";"^$GetDocReplaceAll("\s*(\r\n)+|(?<=.)
                \z";"\r\n")$";YES;YES)$
                > ;end long line
                > ^!Set %Fill%=^$Calc(6-^$StrSize(^%Count%)$)$
                > ^!InsertText ^$StrFill("0";^%fill%)$^%Count%,^%farray1%
                > ^!Goto Loop
                > :Out
                > ^!Set %farray%=""
                > ^!ClearVariable %farray%
                > ^!ClearVariable %Count%
                > ^!ClearVariable %Fill%
                > ;end of clip

                Sheri,

                I think there's a problem with this solution. It ignores trailing
                blanks and empty lines in the counting of duplicate lines but not in
                the ^!Find command. So if we have got...

                0.verizon.windowsxp <-- trailing blank
                0.verizon.windowsxp
                0.verizon.windowsxp

                for example, the clip doesn't find three duplicates but interprets
                this as a singular line plus two duplicates of another line. That is,
                it works fine only on the condition that all three duplicates end
                with a trailing blank.

                So it may be better to remove trailing blanks prior to ^!Find. Isn't
                it?

                Regards,
                Flo

                P.S. By the way: It makes no difference -- but what about...

                ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$

                (since NT v.5.0 the \s matches "any white space", including CRNL).
                 
              • Sheri
                ... You re right. ... Not necessary if we change the ^!Find to this: ^!Find ^(.+ S)( s* r n 1)* RS ... Haven t tested that and it may work fine. However it
                Message 7 of 15 , Nov 6, 2008
                • 0 Attachment
                  Flo wrote:
                  > Sheri,
                  >
                  > I think there's a problem with this solution. It ignores trailing
                  > blanks and empty lines in the counting of duplicate lines but not in
                  > the ^!Find command. So if we have got...
                  >
                  > 0.verizon.windowsxp <-- trailing blank
                  > 0.verizon.windowsxp
                  > 0.verizon.windowsxp
                  >
                  > for example, the clip doesn't find three duplicates but interprets
                  > this as a singular line plus two duplicates of another line. That is,
                  > it works fine only on the condition that all three duplicates end
                  > with a trailing blank.
                  >
                  You're right.
                  > So it may be better to remove trailing blanks prior to ^!Find. Isn't
                  > it?
                  >
                  Not necessary if we change the ^!Find to this:

                  ^!Find "^(.+\S)(\s*\r\n\1)*" RS

                  > P.S. By the way: It makes no difference -- but what about...
                  >
                  > ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$
                  >
                  > (since NT v.5.0 the \s matches "any white space", including CRNL).
                  >
                  Haven't tested that and it may work fine. However it looks suspicious
                  because: we are using the PCRE multiline option by default in NoteTab so
                  $ matches at line ends (and line ends by definition are followed by \r\n
                  when showing in a NoteTab document) but \s+ matches across line breaks.
                  So how can it match $ if it picks up all the \r\n's? In order to match $
                  it would need to backtrack. It might create an issue if it backtracks
                  between the \r and \n.

                  I'll try to test it later.

                  Regards,
                  Sheri
                • Sheri
                  ... Hmn, maybe it should be ^!Find ^(.* S)( s* r n 1)* RS just in case there is only one visible character on the line. Regards, Sheri
                  Message 8 of 15 , Nov 6, 2008
                  • 0 Attachment
                    Sheri wrote:
                    >
                    > Not necessary if we change the ^!Find to this:
                    >
                    > ^!Find "^(.+\S)(\s*\r\n\1)*" RS
                    >
                    Hmn, maybe it should be

                    ^!Find "^(.*\S)(\s*\r\n\1)*" RS

                    just in case there is only one visible character on the line.

                    Regards,
                    Sheri
                  • Sheri
                    ... Indeed that pattern matches in the middle of r n. So after replacement, we end up with r n n. The reason it makes no difference to the outcome is because
                    Message 9 of 15 , Nov 7, 2008
                    • 0 Attachment
                      --- In ntb-clips@yahoogroups.com, Sheri <silvermoonwoman@...> wrote:
                      >
                      > Flo wrote:
                      > > Sheri,
                      > >
                      > > I think there's a problem with this solution. It ignores trailing
                      > > blanks and empty lines in the counting of duplicate lines but not in
                      > > the ^!Find command. So if we have got...
                      > >
                      > > 0.verizon.windowsxp <-- trailing blank
                      > > 0.verizon.windowsxp
                      > > 0.verizon.windowsxp
                      > >
                      > > for example, the clip doesn't find three duplicates but
                      > > interprets this as a singular line plus two duplicates of another
                      > > line. That is, it works fine only on the condition that all three
                      > > duplicates end with a trailing blank. You're right. So it may be
                      > > better to remove trailing blanks prior to ^!Find. Isn't it?

                      > >
                      > Not necessary if we change the ^!Find to this:
                      >
                      > ^!Find "^(.+\S)(\s*\r\n\1)*" RS
                      >
                      > > P.S. By the way: It makes no difference -- but what about...
                      > >
                      > > ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$
                      > >
                      > > (since NT v.5.0 the \s matches "any white space", including
                      > > CRNL). Haven't tested that and it may work fine. However it looks
                      > > suspicious because: we are using the PCRE multiline option by
                      > > default in NoteTab so $ matches at line ends (and line ends by
                      > > definition are followed by \r\n when showing in a NoteTab
                      > > document) but \s+ matches across line breaks. So how can it match
                      > > $ if it picks up all the \r\n's? In order to match $ it would
                      > > need to backtrack. It might create an issue if it backtracks
                      > > between the \r and \n.
                      >
                      > I'll try to test it later.

                      Indeed that pattern matches in the middle of \r\n. So after
                      replacement, we end up with \r\n\n. The reason it makes no difference
                      to the outcome is because we are counting "\r\n" after applying this
                      in ^$GetDocReplaceAll$. You would be able to see a problem if that
                      were getting inserted in the document (however, something else happens
                      when you insert it, \r\n\n becomes \r\n\r\n because the input control
                      needs line breaks to be \r\n -- you can test the string size prior to
                      insertion vs testing the size of a selection after insertion).

                      I think it is preferable to greedily replace white space that precedes
                      "\r\n" with "". The white space does include other CRLFs.

                      Regards,
                      Sheri
                    • Flo
                      ... Yes, I can see now what happens here. I think your explanation is in better accordance with the PCRE Documentation than the NoteTab Help on RegEx. The
                      Message 10 of 15 , Nov 8, 2008
                      • 0 Attachment
                        --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
                        >
                        > Indeed that pattern (Flo:^$GetDocReplaceAll("\s+|(?<=.)\z";"\r\n")$
                        > matches in the middle of \r\n. So after replacement, we end up with
                        > \r\n\n...The reason it makes no difference...

                        Yes, I can see now what happens here. I think your explanation is in
                        better accordance with the PCRE Documentation than the NoteTab Help
                        on RegEx. The latter says: "$ assert end of string (or line, in
                        multiline mode)".

                        The PCRESYNTAX Documentation from PCRE 7.7 is more detailed: "$ end
                        of subject, also before newline at end of subject, also before
                        internal newline in multiline mode."

                        Thanks again, Sheri! I can clearly see the difference now and
                        why "^$GetDocReplaceAll("\s+|(?<=.)\z";"\r\n")$" doesn't affect the
                        result.

                        Flo
                         
                      Your message has been successfully submitted and would be delivered to recipients shortly.