12892Re: [Clip] Re: Extracting words from a file
- Dec 1, 2004Franz
>>I'm trying to create a clip that extracts all capitalized words froma file and stores them in a new file.
Following Hugo's suggestion about changing the sort parameters, I tested
this on a 475 KB file. Its not instantaneous;-( , but the result in fairly
acceptable time is a list of all individual upper case words in a file.
H=Just UpperCase words
; Alec Burgess 2004-12-01 (Wed)
; change spaces and tabs to new-lines
^!replace " " >> "^P" wsa
^!replace "^t" >> "^P" wsa
;Change every non-alphanumeric leading char string to null
; -- this one takes the longest to execute - less than 30 sec
; -- on my P-III 750 Mhz 256 MB ram laptop
;putting the + on the find clause makes it catch ";;;Asdf" in addition to
; -- just ";Asdf" - time taken was doubled to about a minute.
^!replace "^[^A-Za-z0-9]+" >> "" rwsa
; sort ignore case, ascending, remove duplicates
; remove all lines that do *NOT* begin with an UPPER-CASE letter
; -- using do *NOT* ignore case might make it run either faster or slower
; -- by making it find more smaller groups but has no effect on final result
^!replace "(^[^A-Z].*\n)+" >> "" rwsa
Regards ... Alec
---- Original Message ----
From: "Hugo Paulissen" <hugopaulissen@...>
Sent: Wednesday, December 01, 2004 14:42
Subject: [gla: [Clip] Re: Extracting words from a file
> Are you using Pro or Light? That makes quite a difference
> in speed.
> What about this approach? You can easily see for yourself
> if this is of any help.
> 1. replace " " with "^P" - don't know how fast that would
> 2. trim/left align the text (which should have most words
> on a separate line by now)
> 3. sort the document with [Case Sensitive Sorting] and
> [Remove Duplicates] switched on (in options)
>> Maybe a mixture of both models would be the best
>> solution. That is, first to reduce the file by
>> eliminating certain strings, and then extracting the
>> words I need. (The use of all this is to produce an
>> index or thesaurus of keywords in a text database.)
>> I used the ^$IsAlphaNumeric$ operator you mentioned but
>> this wouldn't select compounds with hyphen like
>> "Hewlett-Packard" since the uppercase letter at the
>> beginning is followed by another uppercase letter. So
>> I'm working with ^$IsUppercase(^$StrIndex("Str";1)$.
- << Previous post in topic Next post in topic >>