Loading ...
Sorry, an error occurred while loading the content.

Re: Creation of clip

Expand Messages
  • Flo
    Sheri wrote... ... In fact, the alternation to be used with ^$GetDocMatchAll$ seems to be limited. When testing this with a file of 250 keywords, and a text of
    Message 1 of 30 , Jun 21, 2007
    • 0 Attachment
      Sheri wrote...

      > How many keywords? If not more than a few hundred could
      > possibly use something like this (uses regular expression
      > matching).
      >
      > ^!Setlistdelimiter ^P ;next is one long line ^!Set
      > %linesout%=^$GetDocMatchAll("(?-
      > i)^.*(comprehensive|switch|system).*^%dollar%";0)$ ;end long
      > line ^!Toolbar New Document ^!InsertText ^%linesout%

      In fact, the alternation to be used with ^$GetDocMatchAll$ seems to
      be limited. When testing this with a file of 250 keywords, and a text
      of 16,000 lines, it works fine. It fails when taking those 250
      keywords as text, and 16.000 words as keywords. NT5 reacts with the
      message...

      "Regex error: internal error: overran compiling workspace".

      (You may test it with those files at http://flogehrke.homepage.t-
      online.de/491/ntf-wordlist.zip we used for testing another clip some
      month ago.)

      Is this limitation definable in any way?

      Flo
       
    • Sheri
      ... Hi Flo, I don t think it is definable per se. You could test generated patterns in clips with ^!IfRegexOK. You can retrieve the error message (if not ok)
      Message 2 of 30 , Jun 21, 2007
      • 0 Attachment
        Flo wrote:
        > Sheri wrote...
        >
        >
        >> How many keywords? If not more than a few hundred could
        >> possibly use something like this (uses regular expression
        >> matching).
        >>
        >> ^!Setlistdelimiter ^P ;next is one long line ^!Set
        >> %linesout%=^$GetDocMatchAll("(?-
        >> i)^.*(comprehensive|switch|system).*^%dollar%";0)$ ;end long
        >> line ^!Toolbar New Document ^!InsertText ^%linesout%
        >>
        >
        > In fact, the alternation to be used with ^$GetDocMatchAll$ seems to
        > be limited. When testing this with a file of 250 keywords, and a text
        > of 16,000 lines, it works fine. It fails when taking those 250
        > keywords as text, and 16.000 words as keywords. NT5 reacts with the
        > message...
        >
        > "Regex error: internal error: overran compiling workspace".
        >
        > (You may test it with those files at http://flogehrke.homepage.t-
        > online.de/491/ntf-wordlist.zip we used for testing another clip some
        > month ago.)
        >
        > Is this limitation definable in any way?
        >
        > Flo
        >
        >
        >
        Hi Flo,

        I don't think it is definable per se. You could test generated patterns
        in clips with ^!IfRegexOK. You can retrieve the error message (if not
        ok) with ^$GetRegexErrorMsg$. A clip could possibly take corrective
        action for some errors (like reducing the number of alternatives to
        processed at one time).

        PCRE 7.2 was just released, and it says it corrected this:

        "A pattern with a very large number of alternatives (more than several
        hundred) was running out of internal workspace during the pre-compile
        phase, where pcre_compile() figures out how much memory will be needed.
        A bit of new cunning has reduced the workspace needed for groups with
        alternatives. The 1000-alternative test pattern now uses 12 bytes of
        workspace instead of running out of the 4096 that are available."

        I don't think it will be too long before NoteTab incorporates the
        update. However, there are other factors besides "internal workspace"
        that affect how many alternatives will work. When working on the stop
        list clip, I remember an error message that the regular expression was
        "too long". In one of the stop list clips, I applied the keywords in
        approximately 10K chunks and that worked at that time (think it was pcre
        6.7 then).

        Regards,
        Sheri
      • paulmaser
        You could probably replace the first two lines below with one command, that would look something like this: ^!Replace ( r n)+ | AWRS
        Message 3 of 30 , Jun 21, 2007
        • 0 Attachment
          You could probably replace the first two lines below with one command,
          that would look something like this:
          ^!Replace "(\r\n)+" >> "|" AWRS


          > ^!Replace "(\r\n){2,}" >> "\r\n" AWRS
          > ^!Replace "\r\n" >> "|" AWRS
          > ^!Replace "\|\Z" >> "" AWRS
        • Sheri
          ... Hi again, I haven t been following this thread in detail, but if he just wants to remove lines having a keyword, wouldn t it be better to use a replace
          Message 4 of 30 , Jun 21, 2007
          • 0 Attachment
            --- In ntb-clips@yahoogroups.com, "Flo" <flo.gehrke@...> wrote:
            >
            > Sheri wrote...
            >
            > > How many keywords? If not more than a few hundred could
            > > possibly use something like this (uses regular expression
            > > matching).
            > >
            > > ^!Setlistdelimiter ^P ;next is one long line ^!Set
            > > %linesout%=^$GetDocMatchAll("(?-
            > > i)^.*(comprehensive|switch|system).*^%dollar%";0)$ ;end long
            > > line ^!Toolbar New Document ^!InsertText ^%linesout%
            >
            > In fact, the alternation to be used with ^$GetDocMatchAll$ seems to
            > be limited. When testing this with a file of 250 keywords, and a text
            > of 16,000 lines, it works fine. It fails when taking those 250
            > keywords as text, and 16.000 words as keywords. NT5 reacts with the
            > message...
            >
            > "Regex error: internal error: overran compiling workspace".
            >
            > (You may test it with those files at http://flogehrke.homepage.t-
            > online.de/491/ntf-wordlist.zip we used for testing another clip some
            > month ago.)
            >
            > Is this limitation definable in any way?
            >
            > Flo
            >
            >

            Hi again,

            I haven't been following this thread in detail, but if he just wants
            to remove lines having a keyword, wouldn't it be better to use a
            replace command (replacing keyword lines with "") instead of using
            getdocmatchall?

            Seems to me the stop word task was more complicated because you wanted
            to not only delete lines matching a stop word, but also eliminate
            duplicates that were not stop words.

            Using ^!Replace all(s) would be fast (though you still have to keep
            your alternates lists reasonably sized for PCRE).

            Regards,
            Sheri
          • Flo
            ... command, ... Thanks, Paul. You are right. ^!Replace ( r n)+ | AWRS will do the job. By the way: The clip wouldn t even need to open and to process
            Message 5 of 30 , Jun 21, 2007
            • 0 Attachment
              "paulmaser" <paul@...> wrote:
              >
              > You could probably replace the first two lines below with one
              command,
              > that would look something like this:
              > ^!Replace "(\r\n)+" >> "|" AWRS
              >
              >
              > > ^!Replace "(\r\n){2,}" >> "\r\n" AWRS
              > > ^!Replace "\r\n" >> "|" AWRS
              > > ^!Replace "\|\Z" >> "" AWRS

              Thanks, Paul. You are right. "^!Replace "(\r\n)+" >> "|" AWRS" will
              do the job.

              By the way: The clip wouldn't even need to open and to process the
              keyword list if we make sure from the outset that it doesn't contain
              any empty lines. Thus we could replace all the lines from "^!Open ^%
              Keywords%" to "^!Close ^%Keywords% Discard" with...

              ^!SetClipboard ^$GetFileText(^%Keywords%)$
              ^!SetClipboard=^$StrReplace(^%NL%;|;^$GetClipboard$;0;0)$
              ^!Set %Search%=^$GetClipboard$

              This could speed up the clip even more ;-)

              Flo
               
            • Flo
              ... command, ... Thanks, Paul. You are right. ^!Replace ( r n)+ | AWRS will do the job. By the way: The clip wouldn t even need to open and to process
              Message 6 of 30 , Jun 21, 2007
              • 0 Attachment
                "paulmaser" <paul@...> wrote:
                >
                > You could probably replace the first two lines below with one
                command,
                > that would look something like this:
                > ^!Replace "(\r\n)+" >> "|" AWRS
                >
                >
                > > ^!Replace "(\r\n){2,}" >> "\r\n" AWRS
                > > ^!Replace "\r\n" >> "|" AWRS
                > > ^!Replace "\|\Z" >> "" AWRS

                Thanks, Paul. You are right. "^!Replace "(\r\n)+" >> "|" AWRS" will
                do the job.

                By the way: The clip wouldn't even need to open and to process the
                keyword list if we make sure from the outset that it doesn't contain
                any empty lines. Thus we could replace all the lines from "^!Open ^%
                Keywords%" to "^!Close ^%Keywords% Discard" with...

                ^!SetClipboard ^$GetFileText(^%Keywords%)$
                ^!SetClipboard=^$StrReplace(^%NL%;|;^$GetClipboard$;0;0)$
                ^!Set %Search%=^$GetClipboard$

                This could speed up the clip even more ;-)

                Flo
                 
              • Flo
                Thanks for that information, Sheri! I remember those 10K chunks . Members who want to read up on that issue - it s in message # 15213 (see ^!Select
                Message 7 of 30 , Jun 21, 2007
                • 0 Attachment
                  Thanks for that information, Sheri!

                  I remember those "10K chunks". Members who want to read up on that
                  issue - it's in message # 15213 (see ^!Select +10000...).

                  Flo
                   
                • Flo
                  Sheri wrote... ... Indeed - why not this way... ^!SetScreenUpdate Off ^!SetHintInfo Working... ^!Set %Doc%=^$GetDocIndex$ ^!Set %Keywords%=^?[(T=O;F= Textfiles
                  Message 8 of 30 , Jun 21, 2007
                  • 0 Attachment
                    Sheri wrote...

                    > I haven't been following this thread in detail, but if he just wants
                    > to remove lines having a keyword, wouldn't it be better to use a
                    > replace command (replacing keyword lines with "") instead of using
                    > getdocmatchall?

                    Indeed - why not this way...


                    ^!SetScreenUpdate Off
                    ^!SetHintInfo Working...
                    ^!Set %Doc%=^$GetDocIndex$
                    ^!Set %Keywords%=^?[(T=O;F="Textfiles (*.txt)|*.txt")Choose Keyword
                    File:]
                    ^!Set %Case%=^?[Case-sensitive search:==Yes^=(?-i)|_No^=(?i)]
                    ^!Open ^%Keywords%
                    ^!Replace "(\r\n)+" >> "|" AWRS
                    ^!Replace "\|\Z" >> "" AWRS
                    ^!Replace "\A\|" >> "" AWRS
                    ^!Set %Search%=^$GetText$
                    ^!Close ^%Keywords% Discard
                    ^!SetDocIndex ^%Doc%
                    ^!Menu Edit/Copy All
                    ^!Menu Edit/Paste New
                    ^!Replace "^%Case%^.*(^%Search%).*\r\n" >> "" AWRS
                    ^!Info Finished!


                    Regards,
                    Flo
                     
                  • Sheri
                    ... Great! If interested in making further improvements, here are a few more enhancements to consider. When a clip makes use of the clipboard, its nice to
                    Message 9 of 30 , Jun 22, 2007
                    • 0 Attachment
                      --- In ntb-clips@yahoogroups.com, "Flo" <flo.gehrke@...> wrote:
                      >
                      > Sheri wrote...
                      >
                      > > I haven't been following this thread in detail, but if he just wants
                      > > to remove lines having a keyword, wouldn't it be better to use a
                      > > replace command (replacing keyword lines with "") instead of using
                      > > getdocmatchall?
                      >
                      > Indeed - why not this way...
                      >
                      >
                      > ^!SetScreenUpdate Off
                      > ^!SetHintInfo Working...
                      > ^!Set %Doc%=^$GetDocIndex$
                      > ^!Set %Keywords%=^?[(T=O;F="Textfiles (*.txt)|*.txt")Choose Keyword
                      > File:]
                      > ^!Set %Case%=^?[Case-sensitive search:==Yes^=(?-i)|_No^=(?i)]
                      > ^!Open ^%Keywords%
                      > ^!Replace "(\r\n)+" >> "|" AWRS
                      > ^!Replace "\|\Z" >> "" AWRS
                      > ^!Replace "\A\|" >> "" AWRS
                      > ^!Set %Search%=^$GetText$
                      > ^!Close ^%Keywords% Discard
                      > ^!SetDocIndex ^%Doc%
                      > ^!Menu Edit/Copy All
                      > ^!Menu Edit/Paste New
                      > ^!Replace "^%Case%^.*(^%Search%).*\r\n" >> "" AWRS
                      > ^!Info Finished!
                      >
                      >
                      > Regards,
                      > Flo
                      >
                      >

                      Great! If interested in making further improvements, here are a few
                      more enhancements to consider.

                      When a clip makes use of the clipboard, its nice to restore its
                      original contents at the end.

                      You are closing the keyword document, before navigating to the
                      original document. You need to be sure the keyword document was not
                      already open when the clip was started. If it gets closed from a lower
                      docindex than the starting document, you would not return to the
                      original document when you set your docindex. You'd have to navigate
                      to the original docindex and then close discard the keywords document.

                      Normally it would be a good idea to reverse sort alternates when
                      constructing a regular expression, but since whole lines containing
                      alternates are being deleted, in this case that wouldn't make any
                      difference. The reason they should normally be reverse sorted is,
                      alternates are searched from left to right. If there's a keyword "be"
                      and a keyword "before", "be|before" will never find "before" in the
                      text. Using \b's before and after the alternates would also work, if
                      the keywords are meant to be whole words only.

                      If there are any characters that might get interpreted by the regex
                      engine as metacharacters in the keyword document, they should be
                      escaped with a backslash prior to using them in the alternates.

                      When constructing a regular expression with code, its probably a good
                      idea to check ^!IfRegexOK before using the expression in a "real"
                      statement. If there is an error, you'd have an opportunity to show a
                      message and still do clean up tasks (like restore the clipboard).

                      Regards,
                      Sheri
                    • Flo
                      Hi Sheri, I m grateful to you for all these recommendations, and I tried to apply them to this clip... ... That s not given here, isn t it? But I think it
                      Message 10 of 30 , Jun 23, 2007
                      • 0 Attachment
                        Hi Sheri,

                        I'm grateful to you for all these recommendations, and I tried to
                        apply them to this clip...

                        > When a clip makes use of the clipboard, its nice to restore its
                        > original contents at the end.

                        That's not given here, isn't it? But I think it could easily be done
                        by saving its contents in a variable, and afterwards pasting it back
                        to the clipboard like...

                          ^!Set %Var%=^$GetClipboard$ ... ^!SetClipboard ^%Var%

                        > You'd have to navigate to the original docindex and then close
                        > discard the keywords document.

                        I changed the order of these command lines.

                        By the way: Isn't it even safer to work with the document name? Given
                        that the clip always gets started from the original document, we
                        could replace...

                          ^!Set %Doc%=^$GetDocIndex$^  with  ^!Set %Doc%=^GetDocName

                        and

                          ^!SetDocIndex ^%Doc%  with  ^!Open ^%Doc%

                        (According to the help file, I suppose that ^!Open also selects a
                        document that is open already.)

                        > Normally it would be a good idea to reverse sort alternates...

                        See line #8, and 9 now

                        > metacharacters in the keyword document...should be escaped
                        > with a backslash

                        Certainly, this would be a professional solution. In message # 15199
                        you created a subclip GetRegEscape that would do this job.

                        > its probably a good idea to check ^!IfRegexOK before using the
                        > expression in a "real" statement.

                        I hope I've done it the right way.

                        > Using \b's before and after the alternates would also work, if
                        > the keywords are meant to be whole words only.

                        This has been added too.

                        In addition to that, I've combined the \b's with a negative
                        lookbehind and lookahead. They do not allow certain characters before
                        or behind a search word that is being treated as a whole word. This
                        is mainly aiming at words hyphenated with - (ANSI 45) and the
                        apostrophe ' (ANSI 39). For example: If "McDonald" is defined as a
                        keyword it normally matches "McDonald's" too even if embraced with \b
                        since - and ' are interpreted as word delimiters. Consequently, the
                        clip would delete a line like...

                            "eating a hamburger at McDonald's"

                        although it isn't really matched by "McDonald" as a whole word.
                        Or "self-service" would be matched by "self" and "service" as well
                        although they possibly are regarded as substrings of "self-service"
                        only. It depends, of course, on the way you look at "lexical
                        problems" like that, and also on the sort of text to be processed.
                        Certainly, this construction needs some more testing...

                        How to deal with compound nouns written with a space (ANSI 32)? For
                        example: "Express" would delete "American Express" although we
                        possibly don't regard it as a match of that compound. The only
                        solution I can see for that is to enter "American Express" with a
                        protected space (ANSI 160) in order to distinguish it from the normal
                        space (ANSI 32). With regard to this, we could extend the Lookarounds
                        with \xA0 in order to match ANSI 160. Maybe there's a better solution
                        (or even more problems)...

                        Regards,
                        Flo


                        ^!SetScreenUpdate Off
                        ^!SetHintInfo Working...
                        ^!Set %Doc%=^$GetDocIndex$
                        ^!Set %Keywords%=^?[(T=O;F="Textfiles (*.txt)|*.txt")Choose Keyword
                        File:]
                        ^!Set %Case%=^?[Case-sensitive search:==Yes^=(?-i)|_No^=(?i)]
                        ^!Set %Substr%=^?[Search whole words only:==Yes^=1|_No^=0]
                        ^!Open ^%Keywords%
                        ^!Select All
                        ^$StrSort("^$GetSelection$";0;0;1)$
                        ^!Replace "(\r\n)+" >> "|" AWRS
                        ^!Replace "\|\Z" >> "" AWRS
                        ^!Replace "\A\|" >> "" AWRS
                        ^!Set %Search%=^$GetText$
                        ^!SetDocIndex ^%Doc%
                        ^!Close ^%Keywords% Discard
                        ^!IfTrue ^%Substr% Next Else Skip_2
                        ;^!Set %Expr%="^%Case%^.*\b(^%Search%)\b.*\r\n"
                        ; start of long line
                        ^!Set %Expr%="^%Case%^.*\b(?<![[:punct:]])(^%Search%)(?![[:punct:]])
                        \b.*\r\n"
                        ; end of long line
                        ^!Goto Skip
                        ^!Set %Expr%="^%Case%^.*(^%Search%).*\r\n"
                        ; Try next line for testing RegEx error ;-)
                        ;^!Set %Expr%="[[:punkt:]]+"
                        ^!IfRegExOK "^%Expr%" Next Else Message
                        ^!Menu Edit/Copy All
                        ^!Menu Edit/Paste New
                        ^!Replace "^%Expr%" >> "" AWRS
                        ^!Info Finished!
                        ^!Goto End

                        :Message
                        ^!Prompt ^$GetRegexErrorMsg$
                      • Sheri
                        Hi Flo, ... Well you do ^!Menu Edit/Copy All near the end so you can paste the result to a new document. As is, that ends up remaining on the clipboard after
                        Message 11 of 30 , Jun 24, 2007
                        • 0 Attachment
                          Hi Flo,

                          --- In ntb-clips@yahoogroups.com, "Flo" <flo.gehrke@...> wrote:
                          >
                          > I'm grateful to you for all these recommendations, and I tried to
                          > apply them to this clip...
                          >
                          > > When a clip makes use of the clipboard, its nice to restore its
                          > > original contents at the end.
                          >
                          > That's not given here, isn't it?

                          Well you do "^!Menu Edit/Copy All" near the end so you can paste the
                          result to a new document. As is, that ends up remaining on the
                          clipboard after the clip has finished.

                          > But I think it could easily be done by saving its contents in a
                          > variable, and afterwards pasting it back to the clipboard like..
                          >
                          > ^!Set %Var%=^$GetClipboard$ ... ^!SetClipboard ^%Var%

                          See ^!ClipboardSave and ^!ClipboardRestore

                          >
                          > > You'd have to navigate to the original docindex and then close
                          > > discard the keywords document.
                          >
                          > I changed the order of these command lines.
                          >
                          > By the way: Isn't it even safer to work with the document name?
                          > Given that the clip always gets started from the original
                          > document, we could replace...

                          >
                          > ^!Set %Doc%=^$GetDocIndex$^ with ^!Set %Doc%=^GetDocName
                          >
                          > and
                          >
                          > ^!SetDocIndex ^%Doc% with ^!Open ^%Doc%

                          Yes, that should work. But then NoteTab has to find the docindex,
                          maybe slightly faster if you save and restore the docindex yourself.

                          >
                          > (According to the help file, I suppose that ^!Open also selects a
                          > document that is open already.)
                          >
                          > > Normally it would be a good idea to reverse sort alternates...
                          >
                          > See line #8, and 9 now
                          >
                          > > metacharacters in the keyword document...should be escaped
                          > > with a backslash
                          >
                          > Certainly, this would be a professional solution. In message # 15199
                          > you created a subclip GetRegEscape that would do this job.

                          Since you're using a document buffer, you could use a single ^!Replace
                          to replace any metacharacters (alternates -- be sure to escape them)
                          with "\\$0"; the GetRegEscape clip approach is necessary only when
                          acting on a string instead of a document. There is currently no
                          provision in NoteTab to do regex string operations.

                          >
                          > > its probably a good idea to check ^!IfRegexOK before using the
                          > > expression in a "real" statement.
                          >
                          > I hope I've done it the right way.

                          Haven't tried it, but it looks good to me :)

                          I haven't made use of classes like punct before myself, so you're
                          blazing a trail :)

                          >
                          > > Using \b's before and after the alternates would also work, if
                          > > the keywords are meant to be whole words only.
                          >
                          > This has been added too.
                          >
                          > In addition to that, I've combined the \b's with a negative
                          > lookbehind and lookahead. They do not allow certain characters
                          > before or behind a search word that is being treated as a whole
                          > word. This is mainly aiming at words hyphenated with - (ANSI 45)
                          > and the apostrophe ' (ANSI 39). For example: If "McDonald" is
                          > defined as a keyword it normally matches "McDonald's" too even if
                          > embraced with \b
                          > since - and ' are interpreted as word delimiters. Consequently, the
                          > clip would delete a line like...
                          >
                          > "eating a hamburger at McDonald's"
                          >
                          > although it isn't really matched by "McDonald" as a whole word.
                          > Or "self-service" would be matched by "self" and "service" as well
                          > although they possibly are regarded as substrings of "self-service"
                          > only. It depends, of course, on the way you look at "lexical
                          > problems" like that, and also on the sort of text to be processed.
                          > Certainly, this construction needs some more testing...

                          > How to deal with compound nouns written with a space (ANSI 32)?
                          > For example: "Express" would delete "American Express" although
                          > we possibly don't regard it as a match of that compound. The only
                          > solution I can see for that is to enter "American Express" with a
                          > protected space (ANSI 160) in order to distinguish it from the
                          > normal space (ANSI 32). With regard to this, we could extend the
                          > Lookarounds with \xA0 in order to match ANSI 160. Maybe there's a
                          > better solution (or even more problems)...

                          Hmn, you bring up some interersting points. "American Express" would
                          be its own keyword as would "Express". In the case of the "Express"
                          alternate, it could use a negative look behind, to make sure it it not
                          preceded by "American\x20". Obviously would require some fine tuning
                          of the keywords or alternates before applying them to customize them
                          to that extent.

                          Regards,
                          Sheri

                          >
                          >
                          > ^!SetScreenUpdate Off
                          > ^!SetHintInfo Working...
                          > ^!Set %Doc%=^$GetDocIndex$
                          > ^!Set %Keywords%=^?[(T=O;F="Textfiles (*.txt)|*.txt")Choose Keyword
                          > File:]
                          > ^!Set %Case%=^?[Case-sensitive search:==Yes^=(?-i)|_No^=(?i)]
                          > ^!Set %Substr%=^?[Search whole words only:==Yes^=1|_No^=0]
                          > ^!Open ^%Keywords%
                          > ^!Select All
                          > ^$StrSort("^$GetSelection$";0;0;1)$
                          > ^!Replace "(\r\n)+" >> "|" AWRS
                          > ^!Replace "\|\Z" >> "" AWRS
                          > ^!Replace "\A\|" >> "" AWRS
                          > ^!Set %Search%=^$GetText$
                          > ^!SetDocIndex ^%Doc%
                          > ^!Close ^%Keywords% Discard
                          > ^!IfTrue ^%Substr% Next Else Skip_2
                          > ;^!Set %Expr%="^%Case%^.*\b(^%Search%)\b.*\r\n"
                          > ; start of long line
                          > ^!Set %Expr%="^%Case%^.*\b(?<![[:punct:]])(^%Search%)(?![[:punct:]])
                          > \b.*\r\n"
                          > ; end of long line
                          > ^!Goto Skip
                          > ^!Set %Expr%="^%Case%^.*(^%Search%).*\r\n"
                          > ; Try next line for testing RegEx error ;-)
                          > ;^!Set %Expr%="[[:punkt:]]+"
                          > ^!IfRegExOK "^%Expr%" Next Else Message
                          > ^!Menu Edit/Copy All
                          > ^!Menu Edit/Paste New
                          > ^!Replace "^%Expr%" >> "" AWRS
                          > ^!Info Finished!
                          > ^!Goto End
                          >
                          > :Message
                          > ^!Prompt ^$GetRegexErrorMsg$
                          >
                        • hsavage
                          ... tried to apply them to this clip... ... Flo, If you re insistent about restoring the clipboard to its previous state after running a clip you might want to
                          Message 12 of 30 , Jun 25, 2007
                          • 0 Attachment
                            Flo wrote:
                            > Hi Sheri,
                            >
                            > I'm grateful to you for all these recommendations, and I
                            tried to apply them to this clip...
                            >
                            >> When a clip makes use of the clipboard, its nice to
                            >> restore its original contents at the end.
                            >
                            > That's not given here, isn't it? But I think it could
                            >> easily be done by saving its contents in a variable, and
                            >> afterwards pasting it back to the clipboard like...
                            >
                            > ^!Set %Var%=^$GetClipboard$ ... ^!SetClipboard ^%Var%

                            Flo,

                            If you're insistent about restoring the clipboard to its previous state
                            after running a clip you might want to check into the following 2 clip
                            commands.

                            ^!ClipBoardSave
                            ^!ClipBoardRestore [+]


                            ºvº SL-6-199 -created- 2007.06.25 - 19.48.24

                            "Party Etiquette; Drinking Your Fair Share."
                            ¤ ø ¤ hrs ø hsavage@...
                          • Flo
                            The latest version of this clip splits the keyword list into chunks of 500 lines in order to meet the restrictions of the alternation. In my tests, that error
                            Message 13 of 30 , Jun 27, 2007
                            • 0 Attachment
                              The latest version of this clip splits the keyword list into chunks
                              of 500 lines in order to meet the restrictions of the alternation. In
                              my tests, that error message (mentioned above) appeared from 818
                              keywords on. Now it works with an unlimited amount of keywords. It's
                              designed to delete certain keywords (i.e. stopwords) in a word list,
                              or complete lines in a list, that contain these keywords. In full-
                              text it will delete whole paragraphs containing the keyword (or
                              substrings).

                              Also metacharacters in the keyword list are escaped now (e.g.,
                              replace ? with \?).

                              H=Delete Keywords
                              ^!SetScreenUpdate Off
                              ^!SetHintInfo Working...
                              ; Save clipboard, and restore it later on (recommended by Sheri)
                              ^!ClipBoardSave
                              ; Store the index of active document
                              ^!Set %Doc%=^$GetDocIndex$
                              ; Choose keyword (stopword) file, case, and whole words
                              ^!Set %Keywords%=^?[(T=O;F="Textfiles (*.txt)|*.txt")Choose Keyword
                              File:]
                              ^!Set %Case%=^?[Case-sensitive search:==Yes^=(?-i)|_No^=(?i)]
                              ^!Set %WholeWords%=^?[Search whole words only:==Yes^=1|_No^=0]
                              ^!Open ^%Keywords%
                              ; Reverse sort of keywords (to put longer words before shorter words)
                              ^!Select All
                              ^$StrSort("^$GetSelection$";0;0;1)$
                              ; Escape metacharacters (next one long line)
                              ^!Replace "\\|\^|\!|\$|\?|\.|\*|\<|\>|\+|\(|\)|\[|\]|\{|\}|\=|\||\:"
                              >> "\\$0" AWRST
                              ; Divide document into chunks of 500 lines to meet the
                              ; restrictons of alternation
                              ^!Set %ChunkIndex%=1
                              ^!Jump 1

                              :Loop_1
                              ^!Select 500
                              ^!Toolbar Copy
                              ; Make alternation by replacing NL with vertical bar
                              ^!SetClipboard ^$StrReplace(^%NL%;|;^$GetClipboard$;0;0)$
                              ; Remove vertical bar at end of string to avoid empty
                              ; alternative; note: (A|B|) matches A or B or anything.
                              ; You may do the same at start of string, or watch empty lines
                              ; at the start of keyword list
                              ^!IfSame "^$StrCopyRight(^$GetClipboard$;1)$" "|" Next Else Skip
                              ^!SetClipboard ^$StrDeleteRight(^$GetClipboard$;1)$
                              ; Save chunks in variables %Chunk1%, %Chunk2%, etc.
                              ^!Set %Chunk^%ChunkIndex%%=^$GetClipboard$
                              ^!Jump +1
                              ^!If ^$GetRow$=^$GetLineCount$ Replace
                              ^!Inc %ChunkIndex%
                              ^!Goto Loop_1

                              :Replace
                              ; Return to active document
                              ^!SetDocIndex ^%Doc%
                              ; Close keyword file and copy active document to new document
                              ^!Close ^%Keywords% Discard
                              ^!Menu Edit/Copy All
                              ^!Menu Edit/Paste New
                              ^!Set %RepIndex%=1

                              :Loop_2
                              ^!If ^%RepIndex% > ^%ChunkIndex% Finish
                              ; Grab %Chunk1%, %Chunk2%, etc. for search
                              ^!Set %Search%=^%Chunk^%RepIndex%%
                              ; If "whole words", use word delimiters in RegEx; lookarounds
                              ; prevent hyphenated words from being deleted
                              ^!IfTrue ^%WholeWords% Next Else Skip_2
                              ^!Set %Expr%="^%Case%^.*\b(?<![-])(^%Search%)(?![-])\b.*(\r\n|\z)"
                              ^!Goto Skip
                              ^!Set %Expr%="^%Case%^.*(^%Search%).*(\r\n|\z)"
                              ; Check syntax of RegEx
                              ^!IfRegExOK "^%Expr%" Next Else Message
                              ; Delete matching words and lines
                              ^!Replace "^%Expr%" >> "" AWRS
                              ^!Inc %RepIndex%
                              ^!Goto Loop_2

                              :Finish
                              ^!Info Finished!
                              ^!ClipBoardRestore
                              ^!Goto End

                              :Message
                              ^!Prompt ^$GetRegexErrorMsg$
                              ; end of clip


                              The clip prevents terms hyphenated with - (ANSI 45) from being
                              deleted by substrings, e.g. "self" would not delete "self-catering"
                              (unless you choose deleting of substrings).

                              Regarding apostrophes and compound nouns with space I've been on the
                              wrong track. This issue is much more complicated, and I don't think
                              it could be solved by a general RegEx that would match all
                              eventualities. The apostrophe, for example, is used in a company name
                              like "McDonald's". This name will be deleted by a substring "Mc", and
                              by "McDonald" defined as a whole word as well since the apostrophe is
                              interpreted as a word delimiter. On the other hand, it indicates the
                              genitive of a lemma that possibly should be deleted, e.g. "Dickens'
                              works".

                              Another idea is to process the source file with the following clip
                              before running the "Delete Keywords" clip (of course, it also may be
                              integrated into "Delete Keywords").

                              Look at the following company names...

                              McDonald's
                              General Electric
                              Bank of America

                              In order to protect these names from being deleted
                              by "McDonald", "electric", or "bank", the Protect Keywords clip
                              replaces the apostrophe and space with _apo_ and _spc_ (even more
                              characters may be added that function as word delimiters). Thus the
                              names are interpreted as whole words. After running "Delete Keywords"
                              we can reverse this replacement.

                              First of all, you have to create a PROTECT.TXT file that contains a
                              list of terms like those three company names mentioned above.

                              Please note that "Protect Keywords" is meant to be run on the source
                              file, not on the keyword (or stopword) list!


                              H=Protect Keywords
                              ^!SetScreenUpdate Off
                              ^!SetHintInfo Working...
                              ^!Goto=^?[Choose action:==Protect Words^=Protect|Remove
                              Protection^=Remove]

                              :Protect
                              ^!Set %Doc%=^$GetDocIndex$
                              ; Choose the list of words to be protected, e.g. PROTECT.TXT
                              ^!Set %ProFile%=^?{(T=O;F="Textfiles (*.txt)|*.txt")Choose Protected
                              List:}
                              ^!Open ^%ProFile%
                              ^!Jump Doc_End
                              ^!IfFalse ^$IsEmpty(^$GetLine$)$ Next Else Skip
                              ^!InsertText ^%NL%
                              ^!Set %LineIndex%=^$GetTextLineCount$

                              :Loop_1
                              ^!Jump ^%LineIndex%
                              ^!SetClipboard ^$StrReplace("'";"_apo_";"^$GetLine$";0;0)$
                              ^!SetClipboard ^$StrReplace("^%Space%";"_spc_";"^$GetClipboard$";0;0)$
                              ^!Jump Line_End
                              ^!InsertText "^P^$GetClipboard$"
                              ^!If ^%LineIndex%=1 Replace
                              ^!Dec %LineIndex%
                              ^!Goto Loop_1

                              :Replace
                              ^!Select All
                              ^!SetListDelimiter ^p
                              ^!SetArray %Except%=^$GetSelection$
                              ^!SetDocIndex ^%Doc%
                              ^!Close ^%ProFile% Discard
                              ^!Jump 1
                              ^!Set %Count%=1

                              :Loop_2
                              ^!If ^%Count%=^%Except0% End
                              ^!Set %Search%="^%Except^%Count%%"
                              ^!Inc %Count%
                              ^!Set %Repl%="^%Except^%Count%%"
                              ^!Replace "^%Search%" >> "^%Repl%" AWRS
                              ^!Inc %Count%
                              ^!Goto Loop_2

                              :Remove
                              ^!Replace "_spc_" >> "^%Space%" AWST
                              ^!Replace "_apo_" >> "'" AWST

                              :End
                              ^!Info Finished!


                              Regards,
                              Flo
                               
                            • ebbtidalflats
                              Hi Flo, I m curious about a line in your clips, where you replace the text in the document with ^$StrSort. I see what you re doing, but am wondering why you
                              Message 14 of 30 , Jun 28, 2007
                              • 0 Attachment
                                Hi Flo,

                                I'm curious about a line in your clips, where you replace the text in
                                the document with ^$StrSort.

                                I see what you're doing, but am wondering why you chose the function,
                                rather than the menu command?

                                ^!Menu Modify/Lines/Sort/Descending

                                to select and sort all in one step, instead of using three different
                                functions.

                                > ^!Select All
                                > ^$StrSort("^$GetSelection$";0;0;1)$

                                Also, why sort the short words to the bottom? I know you put a lot of
                                effort into this, but didn't the original poster's (who we havn't
                                heard from for some time) example call for finding partial words? If
                                so, wouldn't finding the partials speed up the search by eliminating a
                                lot of lines from the search for the longer words?

                                Just curious.


                                One more Question. Do you have a specific use in mind for this keyword
                                manipulation? Is this a comparison of two keyword lists, or what? Or
                                was this just a clipcoding exercise?


                                Thanks,


                                Eb
                              • Flo
                                ... Eb, ... The menu command follows the settings in Options | Tools . ^$StrSort$ allows to define the sorting independently of these settings. ... This has
                                Message 15 of 30 , Jun 29, 2007
                                • 0 Attachment
                                  --- In ntb-clips@yahoogroups.com, "ebbtidalflats" <ebbtidalflats@...>
                                  wrote:
                                  >
                                  > Hi Flo,
                                  >
                                  > I'm curious about a line in your clips,...

                                  Eb,

                                  > ...why you chose the function, rather than the menu command?

                                  The menu command follows the settings in "Options | Tools".
                                  ^$StrSort$ allows to define the sorting independently of these
                                  settings.

                                  > Also, why sort the short words to the bottom?

                                  This has been described by Sheri before. Sheri also explained why
                                  this isn't really necessary when running the clip on word lists and
                                  lines.

                                  > wouldn't finding the partials speed up the search

                                  I think it isn't a matter of speed, and the difference would scarcely
                                  be measurable. What really matters is what you want to achieve.
                                  That's why you can choose substrings or whole words.

                                  > Do you have a specific use in mind for this keyword
                                  > manipulation? Is this a comparison of two keyword lists, or
                                  > what?

                                  One use, I suppose, has sufficiently been described (protection of
                                  certain terms and word forms from being deleted by substrings). There
                                  are many more applications I could think of. Why not comparing two
                                  word lists, e.g. by subtracting list A from list B in order to get
                                  the difference? For me, dealing with word lists is mainly related to
                                  Text Retrieval and indexing of text databases, and NT has become an
                                  indispensable tool in this field.

                                  Several members have contributed to this thread. I just tried to find
                                  out how these proposals could be integrated into this clip. It isn't
                                  more than a box of building blocks. Maybe you could pick out some
                                  ideas matching your own needs...

                                  Flo
                                   
                                • ebbtidalflats
                                  Flo, ... I asked, because that approach is counter to the original request. Not that there was a whole lot of input from the requester. However, he did furnish
                                  Message 16 of 30 , Jun 30, 2007
                                  • 0 Attachment
                                    Flo,

                                    --- In ntb-clips@yahoogroups.com, "Flo" <flo.gehrke@...> wrote:
                                    >
                                    > > Also, why sort the short words to the bottom?
                                    >
                                    > This has been described by Sheri before. Sheri also explained why
                                    > this isn't really necessary when running the clip on word lists and
                                    > lines.

                                    I asked, because that approach is counter to the original request.
                                    Not that there was a whole lot of input from the requester.

                                    However, he did furnish an example, that specifically searched for
                                    partial words. Hence my curiosity.


                                    > are many more applications I could think of. Why not comparing two
                                    > word lists, e.g. by subtracting list A from list B in order to get
                                    > the difference?

                                    Ahh! Good idea.

                                    > For me, dealing with word lists is mainly related to
                                    > Text Retrieval and indexing of text databases, and NT has become an
                                    > indispensable tool in this field.

                                    Hm, mine is more in the area of glossaries, but NT is just as
                                    indispensable to me.


                                    Thanks for your comments.


                                    Eb
                                  Your message has been successfully submitted and would be delivered to recipients shortly.