Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Re: Line frequency analysis.

Expand Messages
  • Sheri
    ... You re right. ... Not necessary if we change the ^!Find to this: ^!Find ^(.+ S)( s* r n 1)* RS ... Haven t tested that and it may work fine. However it
    Message 1 of 15 , Nov 6, 2008
    • 0 Attachment
      Flo wrote:
      > Sheri,
      >
      > I think there's a problem with this solution. It ignores trailing
      > blanks and empty lines in the counting of duplicate lines but not in
      > the ^!Find command. So if we have got...
      >
      > 0.verizon.windowsxp <-- trailing blank
      > 0.verizon.windowsxp
      > 0.verizon.windowsxp
      >
      > for example, the clip doesn't find three duplicates but interprets
      > this as a singular line plus two duplicates of another line. That is,
      > it works fine only on the condition that all three duplicates end
      > with a trailing blank.
      >
      You're right.
      > So it may be better to remove trailing blanks prior to ^!Find. Isn't
      > it?
      >
      Not necessary if we change the ^!Find to this:

      ^!Find "^(.+\S)(\s*\r\n\1)*" RS

      > P.S. By the way: It makes no difference -- but what about...
      >
      > ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$
      >
      > (since NT v.5.0 the \s matches "any white space", including CRNL).
      >
      Haven't tested that and it may work fine. However it looks suspicious
      because: we are using the PCRE multiline option by default in NoteTab so
      $ matches at line ends (and line ends by definition are followed by \r\n
      when showing in a NoteTab document) but \s+ matches across line breaks.
      So how can it match $ if it picks up all the \r\n's? In order to match $
      it would need to backtrack. It might create an issue if it backtracks
      between the \r and \n.

      I'll try to test it later.

      Regards,
      Sheri
    • Sheri
      ... Hmn, maybe it should be ^!Find ^(.* S)( s* r n 1)* RS just in case there is only one visible character on the line. Regards, Sheri
      Message 2 of 15 , Nov 6, 2008
      • 0 Attachment
        Sheri wrote:
        >
        > Not necessary if we change the ^!Find to this:
        >
        > ^!Find "^(.+\S)(\s*\r\n\1)*" RS
        >
        Hmn, maybe it should be

        ^!Find "^(.*\S)(\s*\r\n\1)*" RS

        just in case there is only one visible character on the line.

        Regards,
        Sheri
      • Sheri
        ... Indeed that pattern matches in the middle of r n. So after replacement, we end up with r n n. The reason it makes no difference to the outcome is because
        Message 3 of 15 , Nov 7, 2008
        • 0 Attachment
          --- In ntb-clips@yahoogroups.com, Sheri <silvermoonwoman@...> wrote:
          >
          > Flo wrote:
          > > Sheri,
          > >
          > > I think there's a problem with this solution. It ignores trailing
          > > blanks and empty lines in the counting of duplicate lines but not in
          > > the ^!Find command. So if we have got...
          > >
          > > 0.verizon.windowsxp <-- trailing blank
          > > 0.verizon.windowsxp
          > > 0.verizon.windowsxp
          > >
          > > for example, the clip doesn't find three duplicates but
          > > interprets this as a singular line plus two duplicates of another
          > > line. That is, it works fine only on the condition that all three
          > > duplicates end with a trailing blank. You're right. So it may be
          > > better to remove trailing blanks prior to ^!Find. Isn't it?

          > >
          > Not necessary if we change the ^!Find to this:
          >
          > ^!Find "^(.+\S)(\s*\r\n\1)*" RS
          >
          > > P.S. By the way: It makes no difference -- but what about...
          > >
          > > ^$GetDocReplaceAll("\s+$|(?<=.)\z";"\r\n")$
          > >
          > > (since NT v.5.0 the \s matches "any white space", including
          > > CRNL). Haven't tested that and it may work fine. However it looks
          > > suspicious because: we are using the PCRE multiline option by
          > > default in NoteTab so $ matches at line ends (and line ends by
          > > definition are followed by \r\n when showing in a NoteTab
          > > document) but \s+ matches across line breaks. So how can it match
          > > $ if it picks up all the \r\n's? In order to match $ it would
          > > need to backtrack. It might create an issue if it backtracks
          > > between the \r and \n.
          >
          > I'll try to test it later.

          Indeed that pattern matches in the middle of \r\n. So after
          replacement, we end up with \r\n\n. The reason it makes no difference
          to the outcome is because we are counting "\r\n" after applying this
          in ^$GetDocReplaceAll$. You would be able to see a problem if that
          were getting inserted in the document (however, something else happens
          when you insert it, \r\n\n becomes \r\n\r\n because the input control
          needs line breaks to be \r\n -- you can test the string size prior to
          insertion vs testing the size of a selection after insertion).

          I think it is preferable to greedily replace white space that precedes
          "\r\n" with "". The white space does include other CRLFs.

          Regards,
          Sheri
        • Flo
          ... Yes, I can see now what happens here. I think your explanation is in better accordance with the PCRE Documentation than the NoteTab Help on RegEx. The
          Message 4 of 15 , Nov 8, 2008
          • 0 Attachment
            --- In ntb-clips@yahoogroups.com, "Sheri" <silvermoonwoman@...> wrote:
            >
            > Indeed that pattern (Flo:^$GetDocReplaceAll("\s+|(?<=.)\z";"\r\n")$
            > matches in the middle of \r\n. So after replacement, we end up with
            > \r\n\n...The reason it makes no difference...

            Yes, I can see now what happens here. I think your explanation is in
            better accordance with the PCRE Documentation than the NoteTab Help
            on RegEx. The latter says: "$ assert end of string (or line, in
            multiline mode)".

            The PCRESYNTAX Documentation from PCRE 7.7 is more detailed: "$ end
            of subject, also before newline at end of subject, also before
            internal newline in multiline mode."

            Thanks again, Sheri! I can clearly see the difference now and
            why "^$GetDocReplaceAll("\s+|(?<=.)\z";"\r\n")$" doesn't affect the
            result.

            Flo
             
          Your message has been successfully submitted and would be delivered to recipients shortly.