Loading ...
Sorry, an error occurred while loading the content.

16675Re: Creation of clip

Expand Messages
  • Flo
    Jun 27 7:04 AM
      The latest version of this clip splits the keyword list into chunks
      of 500 lines in order to meet the restrictions of the alternation. In
      my tests, that error message (mentioned above) appeared from 818
      keywords on. Now it works with an unlimited amount of keywords. It's
      designed to delete certain keywords (i.e. stopwords) in a word list,
      or complete lines in a list, that contain these keywords. In full-
      text it will delete whole paragraphs containing the keyword (or
      substrings).

      Also metacharacters in the keyword list are escaped now (e.g.,
      replace ? with \?).

      H=Delete Keywords
      ^!SetScreenUpdate Off
      ^!SetHintInfo Working...
      ; Save clipboard, and restore it later on (recommended by Sheri)
      ^!ClipBoardSave
      ; Store the index of active document
      ^!Set %Doc%=^$GetDocIndex$
      ; Choose keyword (stopword) file, case, and whole words
      ^!Set %Keywords%=^?[(T=O;F="Textfiles (*.txt)|*.txt")Choose Keyword
      File:]
      ^!Set %Case%=^?[Case-sensitive search:==Yes^=(?-i)|_No^=(?i)]
      ^!Set %WholeWords%=^?[Search whole words only:==Yes^=1|_No^=0]
      ^!Open ^%Keywords%
      ; Reverse sort of keywords (to put longer words before shorter words)
      ^!Select All
      ^$StrSort("^$GetSelection$";0;0;1)$
      ; Escape metacharacters (next one long line)
      ^!Replace "\\|\^|\!|\$|\?|\.|\*|\<|\>|\+|\(|\)|\[|\]|\{|\}|\=|\||\:"
      >> "\\$0" AWRST
      ; Divide document into chunks of 500 lines to meet the
      ; restrictons of alternation
      ^!Set %ChunkIndex%=1
      ^!Jump 1

      :Loop_1
      ^!Select 500
      ^!Toolbar Copy
      ; Make alternation by replacing NL with vertical bar
      ^!SetClipboard ^$StrReplace(^%NL%;|;^$GetClipboard$;0;0)$
      ; Remove vertical bar at end of string to avoid empty
      ; alternative; note: (A|B|) matches A or B or anything.
      ; You may do the same at start of string, or watch empty lines
      ; at the start of keyword list
      ^!IfSame "^$StrCopyRight(^$GetClipboard$;1)$" "|" Next Else Skip
      ^!SetClipboard ^$StrDeleteRight(^$GetClipboard$;1)$
      ; Save chunks in variables %Chunk1%, %Chunk2%, etc.
      ^!Set %Chunk^%ChunkIndex%%=^$GetClipboard$
      ^!Jump +1
      ^!If ^$GetRow$=^$GetLineCount$ Replace
      ^!Inc %ChunkIndex%
      ^!Goto Loop_1

      :Replace
      ; Return to active document
      ^!SetDocIndex ^%Doc%
      ; Close keyword file and copy active document to new document
      ^!Close ^%Keywords% Discard
      ^!Menu Edit/Copy All
      ^!Menu Edit/Paste New
      ^!Set %RepIndex%=1

      :Loop_2
      ^!If ^%RepIndex% > ^%ChunkIndex% Finish
      ; Grab %Chunk1%, %Chunk2%, etc. for search
      ^!Set %Search%=^%Chunk^%RepIndex%%
      ; If "whole words", use word delimiters in RegEx; lookarounds
      ; prevent hyphenated words from being deleted
      ^!IfTrue ^%WholeWords% Next Else Skip_2
      ^!Set %Expr%="^%Case%^.*\b(?<![-])(^%Search%)(?![-])\b.*(\r\n|\z)"
      ^!Goto Skip
      ^!Set %Expr%="^%Case%^.*(^%Search%).*(\r\n|\z)"
      ; Check syntax of RegEx
      ^!IfRegExOK "^%Expr%" Next Else Message
      ; Delete matching words and lines
      ^!Replace "^%Expr%" >> "" AWRS
      ^!Inc %RepIndex%
      ^!Goto Loop_2

      :Finish
      ^!Info Finished!
      ^!ClipBoardRestore
      ^!Goto End

      :Message
      ^!Prompt ^$GetRegexErrorMsg$
      ; end of clip


      The clip prevents terms hyphenated with - (ANSI 45) from being
      deleted by substrings, e.g. "self" would not delete "self-catering"
      (unless you choose deleting of substrings).

      Regarding apostrophes and compound nouns with space I've been on the
      wrong track. This issue is much more complicated, and I don't think
      it could be solved by a general RegEx that would match all
      eventualities. The apostrophe, for example, is used in a company name
      like "McDonald's". This name will be deleted by a substring "Mc", and
      by "McDonald" defined as a whole word as well since the apostrophe is
      interpreted as a word delimiter. On the other hand, it indicates the
      genitive of a lemma that possibly should be deleted, e.g. "Dickens'
      works".

      Another idea is to process the source file with the following clip
      before running the "Delete Keywords" clip (of course, it also may be
      integrated into "Delete Keywords").

      Look at the following company names...

      McDonald's
      General Electric
      Bank of America

      In order to protect these names from being deleted
      by "McDonald", "electric", or "bank", the Protect Keywords clip
      replaces the apostrophe and space with _apo_ and _spc_ (even more
      characters may be added that function as word delimiters). Thus the
      names are interpreted as whole words. After running "Delete Keywords"
      we can reverse this replacement.

      First of all, you have to create a PROTECT.TXT file that contains a
      list of terms like those three company names mentioned above.

      Please note that "Protect Keywords" is meant to be run on the source
      file, not on the keyword (or stopword) list!


      H=Protect Keywords
      ^!SetScreenUpdate Off
      ^!SetHintInfo Working...
      ^!Goto=^?[Choose action:==Protect Words^=Protect|Remove
      Protection^=Remove]

      :Protect
      ^!Set %Doc%=^$GetDocIndex$
      ; Choose the list of words to be protected, e.g. PROTECT.TXT
      ^!Set %ProFile%=^?{(T=O;F="Textfiles (*.txt)|*.txt")Choose Protected
      List:}
      ^!Open ^%ProFile%
      ^!Jump Doc_End
      ^!IfFalse ^$IsEmpty(^$GetLine$)$ Next Else Skip
      ^!InsertText ^%NL%
      ^!Set %LineIndex%=^$GetTextLineCount$

      :Loop_1
      ^!Jump ^%LineIndex%
      ^!SetClipboard ^$StrReplace("'";"_apo_";"^$GetLine$";0;0)$
      ^!SetClipboard ^$StrReplace("^%Space%";"_spc_";"^$GetClipboard$";0;0)$
      ^!Jump Line_End
      ^!InsertText "^P^$GetClipboard$"
      ^!If ^%LineIndex%=1 Replace
      ^!Dec %LineIndex%
      ^!Goto Loop_1

      :Replace
      ^!Select All
      ^!SetListDelimiter ^p
      ^!SetArray %Except%=^$GetSelection$
      ^!SetDocIndex ^%Doc%
      ^!Close ^%ProFile% Discard
      ^!Jump 1
      ^!Set %Count%=1

      :Loop_2
      ^!If ^%Count%=^%Except0% End
      ^!Set %Search%="^%Except^%Count%%"
      ^!Inc %Count%
      ^!Set %Repl%="^%Except^%Count%%"
      ^!Replace "^%Search%" >> "^%Repl%" AWRS
      ^!Inc %Count%
      ^!Goto Loop_2

      :Remove
      ^!Replace "_spc_" >> "^%Space%" AWST
      ^!Replace "_apo_" >> "'" AWST

      :End
      ^!Info Finished!


      Regards,
      Flo
       
    • Show all 30 messages in this topic