Loading ...
Sorry, an error occurred while loading the content.

12892Re: [Clip] Re: Extracting words from a file

Expand Messages
  • Alec Burgess
    Dec 1, 2004
      Franz
      >>I'm trying to create a clip that extracts all capitalized words from
      a file and stores them in a new file.
      <<

      Following Hugo's suggestion about changing the sort parameters, I tested
      this on a 475 KB file. Its not instantaneous;-( , but the result in fairly
      acceptable time is a list of all individual upper case words in a file.

      H=Just UpperCase words
      ; Alec Burgess 2004-12-01 (Wed)
      ;^!setdebug ON

      ; change spaces and tabs to new-lines
      ^!replace " " >> "^P" wsa
      ^!replace "^t" >> "^P" wsa

      ;Change every non-alphanumeric leading char string to null
      ; -- this one takes the longest to execute - less than 30 sec
      ; -- on my P-III 750 Mhz 256 MB ram laptop
      ;putting the + on the find clause makes it catch ";;;Asdf" in addition to
      ; -- just ";Asdf" - time taken was doubled to about a minute.

      ^!replace "^[^A-Za-z0-9]+" >> "" rwsa

      ^!select ALL

      ; sort ignore case, ascending, remove duplicates
      ^$StrSort("^$GetSelection$";False;True;True)$

      ; remove all lines that do *NOT* begin with an UPPER-CASE letter
      ; -- using do *NOT* ignore case might make it run either faster or slower
      ; -- by making it find more smaller groups but has no effect on final result
      ^!replace "(^[^A-Z].*\n)+" >> "" rwsa

      Regards ... Alec
      --


      ---- Original Message ----
      From: "Hugo Paulissen" <hugopaulissen@...>
      To: <ntb-clips@yahoogroups.com>
      Sent: Wednesday, December 01, 2004 14:42
      Subject: [gla: [Clip] Re: Extracting words from a file

      > franz,
      >
      > Are you using Pro or Light? That makes quite a difference
      > in speed.
      >
      > What about this approach? You can easily see for yourself
      > if this is of any help.
      >
      > 1. replace " " with "^P" - don't know how fast that would
      > be
      > 2. trim/left align the text (which should have most words
      > on a separate line by now)
      > 3. sort the document with [Case Sensitive Sorting] and
      > [Remove Duplicates] switched on (in options)
      >
      > Hugo
      >
      >
      >> Maybe a mixture of both models would be the best
      >> solution. That is, first to reduce the file by
      >> eliminating certain strings, and then extracting the
      >> words I need. (The use of all this is to produce an
      >> index or thesaurus of keywords in a text database.)
      >>
      >> I used the ^$IsAlphaNumeric$ operator you mentioned but
      >> this wouldn't select compounds with hyphen like
      >> "Hewlett-Packard" since the uppercase letter at the
      >> beginning is followed by another uppercase letter. So
      >> I'm working with ^$IsUppercase(^$StrIndex("Str";1)$.
    • Show all 23 messages in this topic