Loading ...
Sorry, an error occurred while loading the content.

Extracting words from a file

Expand Messages
  • franz_sternbald
    Hi, I m trying to create a clip that extracts all capitalized words from a file and stores them in a new file. As a basis for that, I took the result of TOOLS
    Message 1 of 23 , Nov 30, 2004
    • 0 Attachment
      Hi,

      I'm trying to create a clip that extracts all capitalized words from
      a file and stores them in a new file.

      As a basis for that, I took the result of TOOLS | TEXT STATISTICS.
      Unfortunately, the Text Statistic provides a sorted output that
      ignores case and removes duplicates, regardless of the options we
      choose in VIEW | OPTIONS | TOOLS. Consequently, it deletes the
      capitalized version of all homonyms. For example: If the
      words "Report" and "report" were found in a file, the Text Statistics
      outputs "report" only. Thus many capitalized words get lost.

      So far, I didn't manage to substitute the Text Statistics with a clip
      that provides a complete list of all capitalized words
      in a normal text file (no list). First, I tried it this way...

      :Loop
      ^!Find "[A-Z][A-Za-z\-]+" CRS
      ^!IfError Output
      ^!Set %Word%=^$GetSelection$
      ^!Append %Copy%=^%Word%^%NL%
      ^!Keyboard Right
      ^!GoTo Loop

      :Output
      ^!Toolbar New Document
      ^!InsertCode ^%Copy%

      In principle, this is doing the job. But processing a file of 500 KB
      is lasting "hours" and ends up in an "Out of memory" message.

      Do you know any better solution?

      Thanks,
      Franz
    • Josh - CK
      I d say don t look for Capped words, just delete everything else and copy the result. (Save, delete stuff, copy result, reload) On Tue, 30 Nov 2004 21:03:44
      Message 2 of 23 , Nov 30, 2004
      • 0 Attachment
        I'd say don't look for Capped words, just delete everything else and
        copy the result.
        (Save, delete stuff, copy result, reload)

        On Tue, 30 Nov 2004 21:03:44 -0000, franz_sternbald
        <franz_sternbald@...> wrote:
        >
        >
        > Hi,
        >
        > I'm trying to create a clip that extracts all capitalized words from
        > a file and stores them in a new file.
        >
        > As a basis for that, I took the result of TOOLS | TEXT STATISTICS.
        > Unfortunately, the Text Statistic provides a sorted output that
        > ignores case and removes duplicates, regardless of the options we
        > choose in VIEW | OPTIONS | TOOLS. Consequently, it deletes the
        > capitalized version of all homonyms. For example: If the
        > words "Report" and "report" were found in a file, the Text Statistics
        > outputs "report" only. Thus many capitalized words get lost.
        >
        > So far, I didn't manage to substitute the Text Statistics with a clip
        > that provides a complete list of all capitalized words
        > in a normal text file (no list). First, I tried it this way...
        >
        > :Loop
        > ^!Find "[A-Z][A-Za-z\-]+" CRS
        > ^!IfError Output
        > ^!Set %Word%=^$GetSelection$
        > ^!Append %Copy%=^%Word%^%NL%
        > ^!Keyboard Right
        > ^!GoTo Loop
        >
        > :Output
        > ^!Toolbar New Document
        > ^!InsertCode ^%Copy%
        >
        > In principle, this is doing the job. But processing a file of 500 KB
        > is lasting "hours" and ends up in an "Out of memory" message.
        >
        > Do you know any better solution?
        >
        > Thanks,
        > Franz
        >
        >
        >
        > Yahoo! Groups Links
        >
        >
        >
        >
        >


        --
        Josh Simmons - CK
      • Hugo Paulissen
        ... from ... Hi Franz, For these tasks I use the pasteboard-function of NoteTab. Create a new empty document (change, if needed,the pasteboard-divider), set it
        Message 3 of 23 , Nov 30, 2004
        • 0 Attachment
          >
          > I'm trying to create a clip that extracts all capitalized words
          from
          > a file and stores them in a new file.
          >

          Hi Franz,

          For these tasks I use the pasteboard-function of NoteTab. Create a
          new empty document (change, if needed,the pasteboard-divider), set it
          to receive the contents of the clipboard (Shift+CTRL+P) and run a
          simple clip with a loop. Anyhow, that's how I would start. You are
          aware of the function ^$IsCapitalized$?

          ^!Jump DOC_START
          :GETCAP
          ^!Find "[A-Z]+" R
          ^!IfError END
          ^!IfTrue ^$IsCapitalized("^$GetWord$")$ ^!SetClipboard ^$GetWord$
          ^!GoTo GETCAP

          After you run this you'll get a message that the searchstring cannot
          be found anymore. Click NO (do not start again) and see what's in
          your pasteboard-file...


          Good luck,

          Hugo
        • Jody
          Hi Franz ... That is a problem with the current regular expression engine. We hope to have a new regular expression engine in the next maintenance release. I
          Message 4 of 23 , Nov 30, 2004
          • 0 Attachment
            Hi Franz

            >So far, I didn't manage to substitute the Text Statistics with a clip
            >that provides a complete list of all capitalized words
            >in a normal text file (no list). First, I tried it this way...
            >
            >:Loop
            >^!Find "[A-Z][A-Za-z\-]+" CRS
            >^!IfError Output
            >^!Set %Word%=^$GetSelection$
            >^!Append %Copy%=^%Word%^%NL%
            >^!Keyboard Right
            >^!GoTo Loop
            >
            >:Output
            >^!Toolbar New Document
            >^!InsertCode ^%Copy%
            >
            >In principle, this is doing the job. But processing a file of 500 KB
            >is lasting "hours" and ends up in an "Out of memory" message.
            >
            >Do you know any better solution?

            That is a problem with the current regular expression engine. We
            hope to have a new regular expression engine in the next
            maintenance release. I know that doesn't help now, but I thought
            I would let you know where the problem lies.

            Happy Clip'n!
            Jody

            www.clean-funnies.com, http://www.fookes.us/maillist.htm

            Subscribe: mailto:ntb-Clips-Subscribe@yahoogroups.com
            UnSubscribe: mailto:ntb-Clips-UnSubscribe@yahoogroups.com
            Options: http://groups.yahoo.com/group/ntb-clips
          • Hugo Paulissen
            Franz, As Jody mentioned the regular expression engine can be reluctant. That s why I gave you a very simple regex, if it doesn t work... you could try to
            Message 5 of 23 , Dec 1, 2004
            • 0 Attachment
              Franz,

              As Jody mentioned the regular expression engine can be reluctant.
              That's why I gave you a very simple regex, if it doesn't work...
              you could try to build another way to jump to the next word and have
              that checked for Capitalization. Or else, just look for " A", " B",
              etc. and then do ^!SetClipboard ^$GetWord$.

              Hugo

              >
              > ^!Jump DOC_START
              > :GETCAP
              > ^!Find "[A-Z]+" R
              > ^!IfError END
              > ^!IfTrue ^$IsCapitalized("^$GetWord$")$ ^!SetClipboard ^$GetWord$
              > ^!GoTo GETCAP
              >
            • franz_sternbald
              Hi, Thanks for the solutions you presented here... @Josh Actually, there are two different ways to do this job: 1. To extract the words you want to get, or 2.
              Message 6 of 23 , Dec 1, 2004
              • 0 Attachment
                Hi,

                Thanks for the solutions you presented here...

                @Josh

                Actually, there are two different ways to do this job: 1. To extract
                the words you want to get, or 2. to delete the words you don't want
                to get. The problem with #2 is this: Since I'm evaluating text
                databases of 500 KB, 1 MB or more I would have to delete an enormous
                amount of characters and strings that don't match the search
                criteria. This would demand dozens of command lines and RegExes for
                reducing the file. So I tried it the other way round, i.e. by
                extracting the matching words only.

                @Hugo

                Using the Pasteboard Function is a clever solution! With files > 500
                KB, however, this lasts an intolerable long time. So far, no error
                message has shown up but I stopped that procedure after half an hour.

                Maybe a mixture of both models would be the best solution. That is,
                first to reduce the file by eliminating certain strings, and then
                extracting the words I need. (The use of all this is to produce an
                index or thesaurus of keywords in a text database.)

                I used the ^$IsAlphaNumeric$ operator you mentioned but this wouldn't
                select compounds with hyphen like "Hewlett-Packard" since the
                uppercase letter at the beginning is followed by another uppercase
                letter. So I'm working with ^$IsUppercase(^$StrIndex("Str";1)$.

                Any more ideas would be highly appreciated...

                Regards,
                Franz

                PS Hi Jody! Thanks for your comment - still you see me
                working on that issue. Flo ;-)
              • Hugo Paulissen
                franz, Are you using Pro or Light? That makes quite a difference in speed. What about this approach? You can easily see for yourself if this is of any help. 1.
                Message 7 of 23 , Dec 1, 2004
                • 0 Attachment
                  franz,

                  Are you using Pro or Light? That makes quite a difference in speed.

                  What about this approach? You can easily see for yourself if this is
                  of any help.

                  1. replace " " with "^P" - don't know how fast that would be
                  2. trim/left align the text (which should have most words on a
                  separate line by now)
                  3. sort the document with [Case Sensitive Sorting] and [Remove
                  Duplicates] switched on (in options)

                  Hugo


                  > Maybe a mixture of both models would be the best solution. That is,
                  > first to reduce the file by eliminating certain strings, and then
                  > extracting the words I need. (The use of all this is to produce an
                  > index or thesaurus of keywords in a text database.)
                  >
                  > I used the ^$IsAlphaNumeric$ operator you mentioned but this
                  wouldn't
                  > select compounds with hyphen like "Hewlett-Packard" since the
                  > uppercase letter at the beginning is followed by another uppercase
                  > letter. So I'm working with ^$IsUppercase(^$StrIndex("Str";1)$.
                  >
                • Jody
                  Hi Franz, ... I know you. :) Hugo has it under control for you. He is more than competent in NoteTab. My guess is that his next step will be to make you a Clip
                  Message 8 of 23 , Dec 1, 2004
                  • 0 Attachment
                    Hi Franz,

                    >PS Hi Jody! Thanks for your comment - still you see me working on
                    >that issue. Flo ;-)

                    I know you. :) Hugo has it under control for you. He is more than
                    competent in NoteTab. My guess is that his next step will be to
                    make you a Clip without RegExp once you let him know the manual
                    method works. You could probably do that yourself with a series
                    of ^!Replace "" >> "" OPTIONS commands, or checking every word
                    using the functions you have been using in a loop with some other
                    code. There's also ^$StrSort(...)$ that might run faster to find
                    the CAPS.

                    bcnu,
                    jody

                    I can only please one person a day.
                    Today is obviously not your day.
                    Tomorrow doesn't look good either. 8D
                    http://www.clean-funnies.com
                    http://www.fookes.com/regnow.html?2448 ;)
                    http://www.sojourner.us/software
                  • Alec Burgess
                    Franz ... a file and stores them in a new file.
                    Message 9 of 23 , Dec 1, 2004
                    • 0 Attachment
                      Franz
                      >>I'm trying to create a clip that extracts all capitalized words from
                      a file and stores them in a new file.
                      <<

                      Following Hugo's suggestion about changing the sort parameters, I tested
                      this on a 475 KB file. Its not instantaneous;-( , but the result in fairly
                      acceptable time is a list of all individual upper case words in a file.

                      H=Just UpperCase words
                      ; Alec Burgess 2004-12-01 (Wed)
                      ;^!setdebug ON

                      ; change spaces and tabs to new-lines
                      ^!replace " " >> "^P" wsa
                      ^!replace "^t" >> "^P" wsa

                      ;Change every non-alphanumeric leading char string to null
                      ; -- this one takes the longest to execute - less than 30 sec
                      ; -- on my P-III 750 Mhz 256 MB ram laptop
                      ;putting the + on the find clause makes it catch ";;;Asdf" in addition to
                      ; -- just ";Asdf" - time taken was doubled to about a minute.

                      ^!replace "^[^A-Za-z0-9]+" >> "" rwsa

                      ^!select ALL

                      ; sort ignore case, ascending, remove duplicates
                      ^$StrSort("^$GetSelection$";False;True;True)$

                      ; remove all lines that do *NOT* begin with an UPPER-CASE letter
                      ; -- using do *NOT* ignore case might make it run either faster or slower
                      ; -- by making it find more smaller groups but has no effect on final result
                      ^!replace "(^[^A-Z].*\n)+" >> "" rwsa

                      Regards ... Alec
                      --


                      ---- Original Message ----
                      From: "Hugo Paulissen" <hugopaulissen@...>
                      To: <ntb-clips@yahoogroups.com>
                      Sent: Wednesday, December 01, 2004 14:42
                      Subject: [gla: [Clip] Re: Extracting words from a file

                      > franz,
                      >
                      > Are you using Pro or Light? That makes quite a difference
                      > in speed.
                      >
                      > What about this approach? You can easily see for yourself
                      > if this is of any help.
                      >
                      > 1. replace " " with "^P" - don't know how fast that would
                      > be
                      > 2. trim/left align the text (which should have most words
                      > on a separate line by now)
                      > 3. sort the document with [Case Sensitive Sorting] and
                      > [Remove Duplicates] switched on (in options)
                      >
                      > Hugo
                      >
                      >
                      >> Maybe a mixture of both models would be the best
                      >> solution. That is, first to reduce the file by
                      >> eliminating certain strings, and then extracting the
                      >> words I need. (The use of all this is to produce an
                      >> index or thesaurus of keywords in a text database.)
                      >>
                      >> I used the ^$IsAlphaNumeric$ operator you mentioned but
                      >> this wouldn't select compounds with hyphen like
                      >> "Hewlett-Packard" since the uppercase letter at the
                      >> beginning is followed by another uppercase letter. So
                      >> I'm working with ^$IsUppercase(^$StrIndex("Str";1)$.
                    • Don - htmlfixit.com
                      Interesting way of going at it. Thought you might have a winner ... but I tried it on a 181,000 word file and I got ... out of memory error.
                      Message 10 of 23 , Dec 1, 2004
                      • 0 Attachment
                        Interesting way of going at it. Thought you might have a winner ... but
                        I tried it on a 181,000 word file and I got ... out of memory error.
                        > Following Hugo's suggestion about changing the sort parameters, I tested
                        > this on a 475 KB file. Its not instantaneous;-( , but the result in fairly
                        > acceptable time is a list of all individual upper case words in a file.
                        >
                      • Alec Burgess
                        Don: I got ... out of memory error. Checking my file with TextStatistics its: chars=510116 Words= 76771 One time while debugging I got an out-of-memory error
                        Message 11 of 23 , Dec 1, 2004
                        • 0 Attachment
                          Don: > I got ... out of memory error.

                          Checking my file with TextStatistics its:
                          chars=510116
                          Words= 76771

                          One time while debugging I got an out-of-memory error but closing and then
                          restarting Notetab and closing a couple of large programs that happened to
                          be running made it work.

                          The real pig is the line:
                          ^!replace "^[^A-Za-z0-9]+" >> "" rwsa

                          perhaps removing the + sign and wrapping it in a loop so it only removes 1
                          non A/N char at a time ... or ... determining the invalid chars and writing
                          one NON-regex replace line for each would speed it up
                          eg.
                          ^!replace "[" >> "" wsa
                          ^!replace "(" >> "" wsa
                          etc ...

                          or even splitting the file in three or more chunks, processing each and then
                          combining the results :-)

                          Regards ... Alec
                          --


                          ---- Original Message ----
                          From: "Don - htmlfixit.com" <don@...>
                          To: <ntb-clips@yahoogroups.com>
                          Sent: Wednesday, December 01, 2004 23:38
                          Subject: [gla: Re: [Clip] Re: Extracting words from a file

                          > Interesting way of going at it. Thought you might have a
                          > winner ... but I tried it on a 181,000 word file and I
                          > got ... out of memory error.
                          >> Following Hugo's suggestion about changing the sort
                          >> parameters, I tested this on a 475 KB file. Its not
                          >> instantaneous;-( , but the result in fairly acceptable
                          >> time is a list of all individual upper case words in a
                          >> file.
                          >>
                          >
                          >
                          > ------------------------ Yahoo! Groups Sponsor
                          > --------------------~--> $9.95 domain names from Yahoo!.
                          > Register anything.
                          > http://us.click.yahoo.com/J8kdrA/y20IAA/yQLSAA/dkFolB/TM
                          > --------------------------------------------------------------------~->
                          >
                          >
                          > Yahoo! Groups Links
                          >
                          >
                          >
                        • hsavage
                          ... Franz, I don t know if you ve decided on anything yet but, here are 2 clips, very similar, one uses appending words to a variable, the other uses
                          Message 12 of 23 , Dec 2, 2004
                          • 0 Attachment
                            franz_sternbald wrote:
                            >
                            > Hi,
                            >
                            > I'm trying to create a clip that extracts all capitalized words from
                            > a file and stores them in a new file.
                            >
                            > As a basis for that, I took the result of TOOLS | TEXT STATISTICS.
                            > Unfortunately, the Text Statistic provides a sorted output that
                            > ignores case and removes duplicates, regardless of the options we
                            > choose in VIEW | OPTIONS | TOOLS. Consequently, it deletes the
                            > capitalized version of all homonyms. For example: If the
                            > words "Report" and "report" were found in a file, the Text Statistics
                            > outputs "report" only. Thus many capitalized words get lost.
                            >
                            > So far, I didn't manage to substitute the Text Statistics with a clip
                            > that provides a complete list of all capitalized words
                            > in a normal text file (no list). First, I tried it this way...
                            >
                            > :Loop
                            > ^!Find "[A-Z][A-Za-z\-]+" CRS
                            > ^!IfError Output
                            > ^!Set %Word%=^$GetSelection$
                            > ^!Append %Copy%=^%Word%^%NL%
                            > ^!Keyboard Right
                            > ^!GoTo Loop
                            >
                            > :Output
                            > ^!Toolbar New Document
                            > ^!InsertCode ^%Copy%
                            >
                            > In principle, this is doing the job. But processing a file of 500 KB
                            > is lasting "hours" and ends up in an "Out of memory" message.
                            >
                            > Do you know any better solution?
                            >
                            > Thanks,
                            > Franz

                            Franz,

                            I don't know if you've decided on anything yet but, here are 2 clips,
                            very similar, one uses appending words to a variable, the other uses
                            ^!AppendToFile.

                            The clips are set to give an audible signal on start and completion.
                            Also, they will enter the start time and finish time in minutes and
                            seconds so you can compare relative speed of clip.

                            I prefer the ^!AppendToFile method, it seems to have an overall small
                            time benefit.

                            I'll include the clips below, in both forms, also, a record of the
                            filesize and number of Cap words found in XX time. The time counters
                            usually sort toward the top of list and they are normally adjacent.

                            If you try these, and want to keep one, the extras, sound, start/finish
                            time etc. can be removed.


                            H="Count Caps"
                            ^!Jump 1
                            ^!SetDebug 0
                            ^!SetWordWrap 0
                            ^!SetScreenUpdate 0
                            ^!Sound ^$GetLibraryPath$cawcaw.wav
                            ^!TextToFile "^$GetSpecialPath(Desktop)$CAPwords" ^$GetDate(< nn < ss)$^%nl%
                            :GETCAP
                            ^!Find [A-Z][A-Za-z\-]+ CRS
                            ^!IfError END
                            ^!AppendToFile "^$GetSpecialPath(Desktop)$CAPwords" ^$GetWord$^%nl%
                            ^!Goto GETCAP
                            :END
                            ^!AppendToFile "^$GetSpecialPath(Desktop)$CAPwords" ^$GetDate(> nn > ss)$
                            ^!Sound ^$GetLibraryPath$cawcaw.wav
                            ^!Open "^$GetSpecialPath(Desktop)$CAPwords"
                            ^!Select ALL
                            ^!Keyboard Shift+Ctrl+X
                            ^!Menu View/Line Numbers


                            H="Count Caps1"
                            ^!Jump 1
                            ^!SetDebug 0
                            ^!SetWordWrap 0
                            ^!SetScreenUpdate 0
                            ^!Sound ^$GetLibraryPath$cawcaw.wav
                            ^!Set %words%=^$GetDate(< nn < ss)$^%nl%
                            :GETCAP
                            ^!Find [A-Z][A-Za-z\-]+ CRS
                            ^!IfError END
                            ^!Set %word%=^$GetWord$
                            ^!Append %words%=^%word%^%nl%
                            ^!Goto GETCAP
                            :END
                            ^!Append %words%=^$GetDate(> nn > ss)$
                            ^!TextToFile "^$GetSpecialPath(Desktop)$CAPwords" ^%words%
                            ^!Sound ^$GetLibraryPath$cawcaw.wav
                            ^!Open "^$GetSpecialPath(Desktop)$CAPwords"
                            ^!Select ALL
                            ^!Keyboard Shift+Ctrl+X
                            ^!Menu View/Line Numbers


                            Cap words found 2,273
                            both methods

                            filesize 399,697

                            method used
                            appending word to variable
                            < 06 < 44
                            nn ss
                            > 20 > 27

                            same method
                            < 51 < 30
                            nn ss
                            > 56 > 27


                            method used
                            AppendToFile word
                            < 49 < 44
                            nn ss
                            > 53 > 14

                            same method
                            < 03 < 51
                            nn ss
                            > 11 > 42


                            ºvº
                            hrs <04-12-02> hsavage@...
                          • franz_sternbald
                            Hi all, Thanks again for all your help! I tested all your proposals. My conclusion is: I m on the wrong track when trying to extract the capitalized words from
                            Message 13 of 23 , Dec 2, 2004
                            • 0 Attachment
                              Hi all,

                              Thanks again for all your help! I tested all your proposals. My
                              conclusion is: I'm on the wrong track when trying to extract the
                              capitalized words from a file > 500 KB. Evidently, it's the ^!Find (+
                              RegEx) Command, in which combination ever, that ends up in an "Out of
                              memory" message or forces me to terminate that procedure after 45
                              minutes or more. Even when performing SEARCH | COUNT OCCURRENCES with
                              an RegEx like [A-][A-Za-z/-]+ it ends up "out of memory" (at least on
                              my PC).

                              I think my only chance to execute that task with NoteTab (I'm using
                              the Pro version) is to reduce the file step by step until there's
                              (almost) nothing left but the words I'm searching (remember Josh's
                              recommendation: "I'd say don't look for Capped words, just delete
                              everything else and copy the result.")

                              However, I'll try to reduce the file with a couple of command lines,
                              and then apply the clips presented by hsavage. Another work-around
                              could be what Alec said: "splitting the file in three or more
                              chunks..."

                              Regards,
                              Franz
                            • abairheart
                              ... Hi Franz, I just happend across this thread. If I have understood your needs correctly, why not just reduce the list to a single column of words , and sort
                              Message 14 of 23 , Dec 3, 2004
                              • 0 Attachment
                                --- In ntb-clips@yahoogroups.com, "franz_sternbald"
                                <franz_sternbald@y...> wrote:
                                > (The use of all this is to produce an
                                > index or thesaurus of keywords in a text database.)


                                Hi Franz,

                                I just happend across this thread. If I have understood your needs
                                correctly, why not just reduce the list to a single column of words ,
                                and sort them case sensitive?

                                1. Replace all spaces in the document with "^P" to change the list to
                                individual words (ignore puntuation, if you like.

                                2. Sort the list CASE SENSITIVE

                                3. Delete the lower case words


                                500 K files should contain about 80,000 words or so. Shouldn't take
                                more than a few minutes to do this by hand. If you have a lot of
                                files you can always write down the keystrokes you use, then do the
                                sort by Menu commands (^!Menu Modify/...). I think there's a
                                configuration switch to change sorting behaviour (remove duplicates
                                or not; case sensitive or not).


                                Abair
                              • Don - htmlfixit.com
                                ... Bingo Abair, with one exception that pertains to German, but not to English! It works and doesn t use regex. I tried it on the 500 lines sent by Franz
                                Message 15 of 23 , Dec 3, 2004
                                • 0 Attachment
                                  > Hi Franz,
                                  >
                                  > I just happend across this thread. If I have understood your needs
                                  > correctly, why not just reduce the list to a single column of words ,
                                  > and sort them case sensitive?
                                  >
                                  > 1. Replace all spaces in the document with "^P" to change the list to
                                  > individual words (ignore puntuation, if you like.
                                  >
                                  > 2. Sort the list CASE SENSITIVE
                                  >
                                  > 3. Delete the lower case words
                                  >
                                  >
                                  > 500 K files should contain about 80,000 words or so. Shouldn't take
                                  > more than a few minutes to do this by hand. If you have a lot of
                                  > files you can always write down the keystrokes you use, then do the
                                  > sort by Menu commands (^!Menu Modify/...). I think there's a
                                  > configuration switch to change sorting behaviour (remove duplicates
                                  > or not; case sensitive or not).
                                  >
                                  >
                                  > Abair

                                  Bingo Abair, with one exception that pertains to German, but not to
                                  English! It works and doesn't use regex. I tried it on the 500 lines
                                  sent by Franz and on my 181,000 word file I have been trying with all
                                  others (always an out of memory error until now). I used a clip to do
                                  it as shown below. There is one problem however ... the German
                                  characters with two dots over them (is that an umlaut?) are treated as
                                  coming after the equivalent lower case letter .... so how do we deal
                                  with that? Currently as written it deletes them as lower case. Maybe I
                                  have to go one line at a time to delete? Does a German version of
                                  NoteTab sort these correctly? Is it a bug in the sorting engine? Is it
                                  just good old ASCII ordering? Are only certain letters umlauted, or
                                  whatever the double dots are called, in German?

                                  ; by don at htmlfixit.com
                                  ^!Menu Edit/Copy All
                                  ^!Toolbar Paste New
                                  ^!Replace "^P" >> " " ATIWS
                                  ^!Replace ")" >> " " ATIWS
                                  ^!Replace "(" >> " " ATIWS
                                  ^!Replace """ >> " " ATIWS
                                  ^!Replace "^T" >> " " ATIWS
                                  ^!Replace "," >> " " ATIWS
                                  ^!Replace "[" >> " " ATIWS
                                  ^!Replace "]" >> " " ATIWS
                                  ^!Replace "<" >> " " ATIWS
                                  ^!Replace ">" >> " " ATIWS
                                  ^!Replace "~" >> " " ATIWS
                                  ^!Replace "!" >> " " ATIWS
                                  ^!Replace "@" >> " " ATIWS
                                  ^!Replace "#" >> " " ATIWS
                                  ^!Replace "$" >> " " ATIWS
                                  ^!Replace "%" >> " " ATIWS
                                  ^!Replace "^" >> " " ATIWS
                                  ^!Replace "&" >> " " ATIWS
                                  ^!Replace "*" >> " " ATIWS
                                  ^!Replace "_" >> " " ATIWS
                                  ^!Replace "+" >> " " ATIWS
                                  ^!Replace "=" >> " " ATIWS
                                  ^!Replace "|" >> " " ATIWS
                                  ^!Replace "{" >> " " ATIWS
                                  ^!Replace "}" >> " " ATIWS
                                  ^!Replace "\" >> " " ATIWS
                                  ^!Replace "/" >> " " ATIWS
                                  ^!Replace "?" >> " " ATIWS
                                  ^!Replace "." >> " " ATIWS
                                  ^!Replace ";" >> " " ATIWS
                                  ^!Replace ":" >> " " ATIWS
                                  ^!Replace "" >> " " ATIWS
                                  ^!Replace "•" >> " " ATIWS
                                  ^!Replace "– " >> " " ATIWS
                                  ^!Replace "´" >> " " ATIWS
                                  ^!Replace "”" >> " " ATIWS
                                  ^!Replace "“" >> " " ATIWS
                                  ^!Replace "‘" >> " " ATIWS
                                  ^!Replace "`" >> " " ATIWS


                                  ^!Menu Modify/Spaces/Single Space
                                  ^!Replace " " >> "^P" ATIWS
                                  ^!Replace "^P’" >> "^P" ATIWS
                                  ^!Replace "^P-" >> "^P" ATIWS
                                  ^!Replace "^P " >> "^P" ATIWS
                                  ^!Menu Edit/Copy All
                                  ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
                                  ^!Select All
                                  ^!Toolbar Paste

                                  ^!Set %LineN%=0
                                  :DumpNumbers
                                  ;^!SetDebug 1
                                  ^!Inc %LineN% 10
                                  ^!Jump ^%LineN%
                                  ^!IfTrue ^$IsEmpty("^$GetLine$")$ DumpNumbers
                                  ^!Select +1
                                  ^!If "^$IsNumber("^$GetSelection$")$" = "1" DumpNumbers ELSE NotNumber
                                  :NotNumber
                                  ^!Jump -1
                                  ^!Select +1
                                  ^!If "^$IsNumber("^$GetSelection$")$" = "0" NotNumber ELSE DeleteNumbers

                                  :DeleteNumbers
                                  ^!Jump +1
                                  ^!SelectTo 1:1
                                  ^!Continue is proper highlighted

                                  ^!Keyboard DELETE


                                  ^!Set %LineN%=^$GetLineCount$
                                  :DumpLowers
                                  ^!Inc %LineN% -100
                                  ^!Jump ^%LineN%
                                  ^!Select +1
                                  ^!If "^$IsUppercase("^$GetSelection$")$" = "0" DumpLowers ELSE NotLower
                                  :NotLower
                                  ^!Jump +1
                                  ^!Select +1
                                  ^!If "^$IsUppercase("^$GetSelection$")$" = "1" NotLower ELSE DeleteLowers

                                  :DeleteLowers
                                  ^!Jump Select_Start
                                  ^!Set %cursor_row%=^$GetRow$
                                  ^!Set %cursor_col%=^$GetCol$
                                  ^!Jump Doc_End
                                  ^!SelectTo ^%cursor_row%:^%cursor_col%
                                  ^!Continue Is Proper Highlighted
                                  ^!Keyboard DELETE
                                • Don - htmlfixit.com
                                  Even better, saves the German Characters ; by don at htmlfixit.com ; runs a text file and makes ; a list of all words that start ; with a capital letter ^!Menu
                                  Message 16 of 23 , Dec 3, 2004
                                  • 0 Attachment
                                    Even better, saves the German Characters

                                    ; by don at htmlfixit.com
                                    ; runs a text file and makes
                                    ; a list of all words that start
                                    ; with a capital letter
                                    ^!Menu Edit/Copy All
                                    ^!Toolbar Paste New
                                    ^!Replace "^P" >> " " ATIWS
                                    ^!Replace ")" >> " " ATIWS
                                    ^!Replace "(" >> " " ATIWS
                                    ^!Replace """ >> " " ATIWS
                                    ^!Replace "^T" >> " " ATIWS
                                    ^!Replace "," >> " " ATIWS
                                    ^!Replace "[" >> " " ATIWS
                                    ^!Replace "]" >> " " ATIWS
                                    ^!Replace "<" >> " " ATIWS
                                    ^!Replace ">" >> " " ATIWS
                                    ^!Replace "~" >> " " ATIWS
                                    ^!Replace "!" >> " " ATIWS
                                    ^!Replace "@" >> " " ATIWS
                                    ^!Replace "#" >> " " ATIWS
                                    ^!Replace "$" >> " " ATIWS
                                    ^!Replace "%" >> " " ATIWS
                                    ^!Replace "^" >> " " ATIWS
                                    ^!Replace "&" >> " " ATIWS
                                    ^!Replace "*" >> " " ATIWS
                                    ^!Replace "_" >> " " ATIWS
                                    ^!Replace "+" >> " " ATIWS
                                    ^!Replace "=" >> " " ATIWS
                                    ^!Replace "|" >> " " ATIWS
                                    ^!Replace "{" >> " " ATIWS
                                    ^!Replace "}" >> " " ATIWS
                                    ^!Replace "\" >> " " ATIWS
                                    ^!Replace "/" >> " " ATIWS
                                    ^!Replace "?" >> " " ATIWS
                                    ^!Replace "." >> " " ATIWS
                                    ^!Replace ";" >> " " ATIWS
                                    ^!Replace ":" >> " " ATIWS
                                    ^!Replace "" >> " " ATIWS
                                    ^!Replace "•" >> " " ATIWS
                                    ^!Replace "– " >> " " ATIWS
                                    ^!Replace "´" >> " " ATIWS
                                    ^!Replace "”" >> " " ATIWS
                                    ^!Replace "“" >> " " ATIWS
                                    ^!Replace "‘" >> " " ATIWS
                                    ^!Replace "`" >> " " ATIWS


                                    ^!Menu Modify/Spaces/Single Space
                                    ^!Replace " " >> "^P" ATIWS
                                    ^!Replace "^P’" >> "^P" ATIWS
                                    ^!Replace "^P-" >> "^P" ATIWS
                                    ^!Replace "^P " >> "^P" ATIWS
                                    ^!Menu Edit/Copy All
                                    ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
                                    ^!Select All
                                    ^!Toolbar Paste
                                    ^!Jump 1

                                    :DumpBad
                                    ^!Select +1
                                    ^!IfError END
                                    ^!IfTrue ^$IsEmpty("^$GetLine$")$ NEXT ELSE SKIP_2
                                    ^!Keyboard DELETE
                                    ^!GoTo DumpBad

                                    ^!If "^$IsNumber("^$GetSelection$")$" = "1" SKIP
                                    ^!If "^$IsUppercase("^$GetSelection$")$" = "1" SKIP_4
                                    ^!Select Eol
                                    ^!Keyboard DELETE
                                    ^!Keyboard DELETE
                                    ^!GoTo DumpBad

                                    :GoNext
                                    ^!Jump +1
                                    ^!GoTo DumpBad
                                  • Don - htmlfixit.com
                                    ; by don at htmlfixit.com ; any-non lowercase non-alphabetic ; character tests positive as Uppercase ^!SetArray
                                    Message 17 of 23 , Dec 3, 2004
                                    • 0 Attachment
                                      ; by don at htmlfixit.com
                                      ; any-non lowercase non-alphabetic
                                      ; character tests positive as Uppercase
                                      ^!SetArray
                                      %Original%="0";"1";"|";"?";"a";"@";"1";"+";"=";"F";"`";"~";"-";"q";"L";"[";"}";"
                                      ";"x"
                                      ^!Set %count%=0
                                      :Loop
                                      ^!Inc %count%
                                      ^!If "^%count%" > "^%Original0%" End

                                      ^!If "^$IsUppercase("^%Original^%count%%")$" = "1" UPPER ELSE NOTUPPER

                                      :UPPER
                                      ^!Info "^%Original^%count%%" is POSITIVE when tested as upper case --
                                      even if it isn't a letter
                                      ^!GoTo Loop

                                      :NOTUPPER
                                      ^!Info "^%Original^%count%%" is negative when tested as upper case
                                      ^!GoTo Loop


                                      Most interesting! IsUppercase is really NotLowercase! You would think
                                      that IsUppercase would first verify that the character is alphabetic,
                                      but it doesn't. IsLowercase works the same too, so it is really
                                      NotUppercase. Because of this you first need to check I guess to be
                                      sure it is alphabetic.

                                      These results are consistent I guess with what help says:
                                      ^$IsUppercase("Str")$ (added in v4.8)
                                      Returns 1 if Str does not contain any lowercase characters, and 0 if it
                                      does.

                                      I would think it SHOULD be does not contain any lowercase or
                                      non-alphabetic characters. But I guess you could have a contraction or
                                      hyphenated, etc. So maybe that isn't correct. In any event, just be
                                      aware and code accordingly.
                                    • Hugo Paulissen
                                      ... words , ... to ... We re going around in circles... Isn t this what I proposed a few messages earlier? ... is
                                      Message 18 of 23 , Dec 4, 2004
                                      • 0 Attachment
                                        >
                                        > I just happend across this thread. If I have understood your needs
                                        > correctly, why not just reduce the list to a single column of
                                        words ,
                                        > and sort them case sensitive?
                                        >
                                        > 1. Replace all spaces in the document with "^P" to change the list
                                        to
                                        > individual words (ignore puntuation, if you like.
                                        >
                                        > 2. Sort the list CASE SENSITIVE
                                        >
                                        > 3. Delete the lower case words
                                        >
                                        >
                                        > 500 K files should contain about 80,000 words or so. Shouldn't take
                                        > more than a few minutes to do this by hand. If you have a lot of
                                        > files you can always write down the keystrokes you use, then do the
                                        > sort by Menu commands (^!Menu Modify/...). I think there's a
                                        > configuration switch to change sorting behaviour (remove duplicates
                                        > or not; case sensitive or not).
                                        >
                                        >
                                        > Abair



                                        We're going around in circles...

                                        Isn't this what I proposed a few messages earlier?

                                        > What about this approach? You can easily see for yourself if this
                                        is
                                        > of any help.
                                        >
                                        > 1. replace " " with "^P" - don't know how fast that would be
                                        > 2. trim/left align the text (which should have most words on a
                                        > separate line by now)
                                        > 3. sort the document with [Case Sensitive Sorting] and [Remove
                                        > Duplicates] switched on (in options)
                                        >
                                        > Hugo
                                        >
                                      • Hugo Paulissen
                                        Don, You wrote the kind of clip I had in mind and for which I didn t have the time. It was clear that NoteTab s regex was in the way... ;-). If I had the need
                                        Message 19 of 23 , Dec 4, 2004
                                        • 0 Attachment
                                          Don,

                                          You wrote the kind of clip I had in mind and for which I didn't have the
                                          time. It was clear that NoteTab's regex was in the way... ;-). If I had the
                                          need for this clip I would definitely test it!

                                          Hugo

                                          > -----Oorspronkelijk bericht-----
                                          > Van: Don - htmlfixit.com [mailto:don@...]
                                          > Verzonden: zaterdag 4 december 2004 3:37
                                          > Aan: ntb-clips@yahoogroups.com
                                          > Onderwerp: Re: [Clip] Re: Extracting words from a file
                                          >
                                          >
                                          >
                                          > > Hi Franz,
                                          > >
                                          > > I just happend across this thread. If I have understood your needs
                                          > > correctly, why not just reduce the list to a single column of words ,
                                          > > and sort them case sensitive?
                                          > >
                                          > > 1. Replace all spaces in the document with "^P" to change the list to
                                          > > individual words (ignore puntuation, if you like.
                                          > >
                                          > > 2. Sort the list CASE SENSITIVE
                                          > >
                                          > > 3. Delete the lower case words
                                          > >
                                          > >
                                          > > 500 K files should contain about 80,000 words or so. Shouldn't take
                                          > > more than a few minutes to do this by hand. If you have a lot of
                                          > > files you can always write down the keystrokes you use, then do the
                                          > > sort by Menu commands (^!Menu Modify/...). I think there's a
                                          > > configuration switch to change sorting behaviour (remove duplicates
                                          > > or not; case sensitive or not).
                                          > >
                                          > >
                                          > > Abair
                                          >
                                          > Bingo Abair, with one exception that pertains to German, but not to
                                          > English! It works and doesn't use regex. I tried it on the 500 lines
                                          > sent by Franz and on my 181,000 word file I have been trying with all
                                          > others (always an out of memory error until now). I used a clip to do
                                          > it as shown below. There is one problem however ... the German
                                          > characters with two dots over them (is that an umlaut?) are treated as
                                          > coming after the equivalent lower case letter .... so how do we deal
                                          > with that? Currently as written it deletes them as lower case. Maybe I
                                          > have to go one line at a time to delete? Does a German version of
                                          > NoteTab sort these correctly? Is it a bug in the sorting engine? Is it
                                          > just good old ASCII ordering? Are only certain letters umlauted, or
                                          > whatever the double dots are called, in German?
                                          >
                                          > ; by don at htmlfixit.com
                                          > ^!Menu Edit/Copy All
                                          > ^!Toolbar Paste New
                                          > ^!Replace "^P" >> " " ATIWS
                                          > ^!Replace ")" >> " " ATIWS
                                          > ^!Replace "(" >> " " ATIWS
                                          > ^!Replace """ >> " " ATIWS
                                          > ^!Replace "^T" >> " " ATIWS
                                          > ^!Replace "," >> " " ATIWS
                                          > ^!Replace "[" >> " " ATIWS
                                          > ^!Replace "]" >> " " ATIWS
                                          > ^!Replace "<" >> " " ATIWS
                                          > ^!Replace ">" >> " " ATIWS
                                          > ^!Replace "~" >> " " ATIWS
                                          > ^!Replace "!" >> " " ATIWS
                                          > ^!Replace "@" >> " " ATIWS
                                          > ^!Replace "#" >> " " ATIWS
                                          > ^!Replace "$" >> " " ATIWS
                                          > ^!Replace "%" >> " " ATIWS
                                          > ^!Replace "^" >> " " ATIWS
                                          > ^!Replace "&" >> " " ATIWS
                                          > ^!Replace "*" >> " " ATIWS
                                          > ^!Replace "_" >> " " ATIWS
                                          > ^!Replace "+" >> " " ATIWS
                                          > ^!Replace "=" >> " " ATIWS
                                          > ^!Replace "|" >> " " ATIWS
                                          > ^!Replace "{" >> " " ATIWS
                                          > ^!Replace "}" >> " " ATIWS
                                          > ^!Replace "\" >> " " ATIWS
                                          > ^!Replace "/" >> " " ATIWS
                                          > ^!Replace "?" >> " " ATIWS
                                          > ^!Replace "." >> " " ATIWS
                                          > ^!Replace ";" >> " " ATIWS
                                          > ^!Replace ":" >> " " ATIWS
                                          > ^!Replace "" >> " " ATIWS
                                          > ^!Replace "•" >> " " ATIWS
                                          > ^!Replace "– " >> " " ATIWS
                                          > ^!Replace "´" >> " " ATIWS
                                          > ^!Replace "”" >> " " ATIWS
                                          > ^!Replace "“" >> " " ATIWS
                                          > ^!Replace "‘" >> " " ATIWS
                                          > ^!Replace "`" >> " " ATIWS
                                          >
                                          >
                                          > ^!Menu Modify/Spaces/Single Space
                                          > ^!Replace " " >> "^P" ATIWS
                                          > ^!Replace "^P’" >> "^P" ATIWS
                                          > ^!Replace "^P-" >> "^P" ATIWS
                                          > ^!Replace "^P " >> "^P" ATIWS
                                          > ^!Menu Edit/Copy All
                                          > ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
                                          > ^!Select All
                                          > ^!Toolbar Paste
                                          >
                                          > ^!Set %LineN%=0
                                          > :DumpNumbers
                                          > ;^!SetDebug 1
                                          > ^!Inc %LineN% 10
                                          > ^!Jump ^%LineN%
                                          > ^!IfTrue ^$IsEmpty("^$GetLine$")$ DumpNumbers
                                          > ^!Select +1
                                          > ^!If "^$IsNumber("^$GetSelection$")$" = "1" DumpNumbers ELSE NotNumber
                                          > :NotNumber
                                          > ^!Jump -1
                                          > ^!Select +1
                                          > ^!If "^$IsNumber("^$GetSelection$")$" = "0" NotNumber ELSE DeleteNumbers
                                          >
                                          > :DeleteNumbers
                                          > ^!Jump +1
                                          > ^!SelectTo 1:1
                                          > ^!Continue is proper highlighted
                                          >
                                          > ^!Keyboard DELETE
                                          >
                                          >
                                          > ^!Set %LineN%=^$GetLineCount$
                                          > :DumpLowers
                                          > ^!Inc %LineN% -100
                                          > ^!Jump ^%LineN%
                                          > ^!Select +1
                                          > ^!If "^$IsUppercase("^$GetSelection$")$" = "0" DumpLowers ELSE NotLower
                                          > :NotLower
                                          > ^!Jump +1
                                          > ^!Select +1
                                          > ^!If "^$IsUppercase("^$GetSelection$")$" = "1" NotLower ELSE DeleteLowers
                                          >
                                          > :DeleteLowers
                                          > ^!Jump Select_Start
                                          > ^!Set %cursor_row%=^$GetRow$
                                          > ^!Set %cursor_col%=^$GetCol$
                                          > ^!Jump Doc_End
                                          > ^!SelectTo ^%cursor_row%:^%cursor_col%
                                          > ^!Continue Is Proper Highlighted
                                          > ^!Keyboard DELETE
                                          >
                                        • Don - htmlfixit.com
                                          ... I used the ideas of three of you (plus as usual something out of noteblock). So it is a combination of three comments including yours, the one on deleting
                                          Message 20 of 23 , Dec 4, 2004
                                          • 0 Attachment
                                            Hugo Paulissen wrote:
                                            > Don,
                                            >
                                            > You wrote the kind of clip I had in mind and for which I didn't have the
                                            > time. It was clear that NoteTab's regex was in the way... ;-). If I had the
                                            > need for this clip I would definitely test it!
                                            >
                                            > Hugo

                                            I used the ideas of three of you (plus as usual something out of
                                            noteblock). So it is a combination of three comments including yours,
                                            the one on deleting what was unneeded -- come to think of it putting the
                                            results in another file is four -- and the sorting alphbetically
                                            comment. I think if one really wanted to do it with a regex, perl would
                                            be a good choice, but for many that defeats the purpose because, even
                                            though perl is easy to run on a pc from notetab, it is another whole thing.

                                            I left screenupdate on so that you can kind of see it materialize and
                                            know it is working.
                                            I added a few special things that applied mainly to my test file, but
                                            won't hurt on another file (like deleting a leading hyphen).

                                            The most interesting thing I learned in the process is that IsUppercase
                                            means NOT Lowercase.
                                            Of course IsLowercase means NOT Uppercase as well. I first tested for
                                            it to be a number.
                                            Either will test positive for a non-alphabetic (white space,
                                            punctuation, numbers -- all test postitive under either of those).

                                            I had also forgotten the SKIP_# feature which I noticed when ripping
                                            something off in the NoteBlock Library. That came in handy. I knew
                                            SKIP but forgot you could skip multiple lines.

                                            I usually put something at the top of my clips in a comment to describe
                                            where they came from ... even though of course they usually come from a
                                            combination of you all. It is NOT to claim some level of ownership or
                                            anything because it is most often plagerized in essence to some degree
                                            or other, but rather to help if we have to later dissect it or make
                                            modifications.

                                            I have decided to start "keeping" some of the clips I work on here:
                                            http://htmlfixit.com/don_and_franki_news_blog_on_htmlfixit_dot_com/index.php?cat=14

                                            I figure that will help me find them in the future when I want to copy
                                            something from one of them for future use.
                                          • franz_sternbald
                                            Hi all, @Don Thanks a lot - this was a great help to me! I ve done some tests so far, and it has led to perfect results. The second solution (without
                                            Message 21 of 23 , Dec 4, 2004
                                            • 0 Attachment
                                              Hi all,

                                              @Don

                                              Thanks a lot - this was a great help to me! I've done some tests so
                                              far, and it has led to perfect results. The second solution (without
                                              DumpNumbers etc) reduced a file of 68,000 words to 3,590 words within
                                              8 minutes. All capitalized, and there's only very little stuff left
                                              that could be easily removed manually or with a few more Replace-
                                              lines (for example single letters A, B, C etc., some special
                                              characters like ¡¢£¤¥§©«)

                                              The first clip seemed to work rather long on my PC. I stopped the
                                              procedure after almost 2 hours. By then, it had reduced the same file
                                              to 7.890 words. Then I took a file of 2,000 lines only and compared
                                              the output of both clips with another tool. There was no big
                                              difference (the second one saved a few more umlauts). So I think the
                                              second clip is the better solution.

                                              With both clips, I didn't run into "Out of memory". The only message
                                              I got was "Some paragraphs were too long and had to be split." I
                                              think that won't affect the result.

                                              The second clip finds German umlauts (ÄÖÜäöü) as well and
                                              distinguishes properly between uppercase and lowercase umlauts. They
                                              are ASCII-sorted, but that's something we have to live with.

                                              Your Lowercase/Uppercase Test Clip shows that numbers and special
                                              characters like @?+=[ etc. are interpreted as lowercase characters
                                              although they are not alphabetic.

                                              I'll let you know in case I still get into any trouble...

                                              @Abair & Hugo

                                              > I just happend across this thread. If I have understood your needs
                                              > correctly, why not just reduce the list to a single column of words,
                                              > and sort them case sensitive?

                                              The intention is to create an index of a text database, that is a
                                              list of keywords (headwords). The databases are made with askSam (see
                                              http://www.asksam.com) and exported to a TXT file (the index function
                                              in askSam only produces words completely in capital LETTERS). These
                                              keywords mainly are represented by nouns which, in German, start with
                                              an uppercase letter ("the car/der Wagen"). However, about 50% of
                                              these "capitalized" words actually are no nouns since they are
                                              capitalized only because they are the first word in a sentence (for
                                              example conjunctions like "And/and", adverbs like "Very/very" etc.).
                                              So it won't be sufficient to sort and copy these capitalized words
                                              only. I created lists of conjunctions and adverbs, stored in an
                                              array, to be removed from that list of capitalized words.
                                              Furthermore, there is a lot of stuff to be deleted too. So the
                                              intention is to do the whole job in one go with NoteTab...

                                              Regards,
                                              Franz
                                            • Don - htmlfixit.com
                                              ... Only the second clip will save Umlauts, so it is the only one that will work. The first one should have been much faster if Umlauts weren t required. I
                                              Message 22 of 23 , Dec 4, 2004
                                              • 0 Attachment
                                                franz_sternbald wrote:
                                                >
                                                > Hi all,
                                                >
                                                > @Don
                                                >
                                                > Thanks a lot - this was a great help to me! I've done some tests so
                                                > far, and it has led to perfect results. The second solution (without
                                                > DumpNumbers etc) reduced a file of 68,000 words to 3,590 words within
                                                > 8 minutes. All capitalized, and there's only very little stuff left
                                                > that could be easily removed manually or with a few more Replace-
                                                > lines (for example single letters A, B, C etc., some special
                                                > characters like ¡¢£¤¥§©«)
                                                >
                                                > The first clip seemed to work rather long on my PC. I stopped the
                                                > procedure after almost 2 hours. By then, it had reduced the same file
                                                > to 7.890 words. Then I took a file of 2,000 lines only and compared
                                                > the output of both clips with another tool. There was no big
                                                > difference (the second one saved a few more umlauts). So I think the
                                                > second clip is the better solution.
                                                >

                                                Only the second clip will save Umlauts, so it is the only one that will
                                                work. The first one should have been much faster if Umlauts weren't
                                                required. I think I could improve it significantly by tuning the
                                                bracketed searches (ie 10 or 100 line jumps should actually be
                                                proportionate to the size of the file).

                                                Anyway, as umlauts are to be saved, I have modified version 2 and made a
                                                new version three with the following changes/enhancements:

                                                1. it adds the additional characters you highlighted
                                                2. it removes all single characters like A, Ä and ß for example if they
                                                are all by themselves

                                                Let me know:

                                                ; by don at htmlfixit.com
                                                ; using a bunch of Hugo's ideas
                                                ; runs a text file and makes
                                                ; a list of all words that start
                                                ; with a capital letter
                                                ^!Menu Edit/Copy All
                                                ^!Toolbar Paste New
                                                ^!Replace "^P" >> " " ATIWS
                                                ^!Replace ")" >> " " ATIWS
                                                ^!Replace "(" >> " " ATIWS
                                                ^!Replace """ >> " " ATIWS
                                                ^!Replace "^T" >> " " ATIWS
                                                ^!Replace "," >> " " ATIWS
                                                ^!Replace "[" >> " " ATIWS
                                                ^!Replace "]" >> " " ATIWS
                                                ^!Replace "<" >> " " ATIWS
                                                ^!Replace ">" >> " " ATIWS
                                                ^!Replace "~" >> " " ATIWS
                                                ^!Replace "!" >> " " ATIWS
                                                ^!Replace "@" >> " " ATIWS
                                                ^!Replace "#" >> " " ATIWS
                                                ^!Replace "$" >> " " ATIWS
                                                ^!Replace "%" >> " " ATIWS
                                                ^!Replace "^" >> " " ATIWS
                                                ^!Replace "&" >> " " ATIWS
                                                ^!Replace "*" >> " " ATIWS
                                                ^!Replace "_" >> " " ATIWS
                                                ^!Replace "+" >> " " ATIWS
                                                ^!Replace "=" >> " " ATIWS
                                                ^!Replace "|" >> " " ATIWS
                                                ^!Replace "{" >> " " ATIWS
                                                ^!Replace "}" >> " " ATIWS
                                                ^!Replace "\" >> " " ATIWS
                                                ^!Replace "/" >> " " ATIWS
                                                ^!Replace "?" >> " " ATIWS
                                                ^!Replace "." >> " " ATIWS
                                                ^!Replace ";" >> " " ATIWS
                                                ^!Replace ":" >> " " ATIWS
                                                ^!Replace "" >> " " ATIWS
                                                ^!Replace "•" >> " " ATIWS
                                                ^!Replace "– " >> " " ATIWS
                                                ^!Replace "´" >> " " ATIWS
                                                ^!Replace "”" >> " " ATIWS
                                                ^!Replace "“" >> " " ATIWS
                                                ^!Replace "‘" >> " " ATIWS
                                                ^!Replace "`" >> " " ATIWS
                                                ^!Replace "¡" >> " " ATIWS
                                                ^!Replace "¢" >> " " ATIWS
                                                ^!Replace "£" >> " " ATIWS
                                                ^!Replace "¤" >> " " ATIWS
                                                ^!Replace "¥" >> " " ATIWS
                                                ^!Replace "§" >> " " ATIWS
                                                ^!Replace "©" >> " " ATIWS
                                                ^!Replace "«" >> " " ATIWS

                                                ^!Menu Modify/Spaces/Single Space
                                                ^!Replace " " >> "^P" ATIWS
                                                ^!Replace "^P’" >> "^P" ATIWS
                                                ^!Replace "^P-" >> "^P" ATIWS
                                                ^!Replace "^P " >> "^P" ATIWS
                                                ^!Menu Edit/Copy All
                                                ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
                                                ^!Select All
                                                ^!Toolbar Paste
                                                ^!Jump 1

                                                ; following is to dump all numer or lower cased
                                                ; first character lines
                                                :DumpBad
                                                ^!If ^$GetRow$ = ^$GetLinecount$ Sort2
                                                ^!Select +1
                                                ^!IfTrue ^$IsEmpty("^$GetLine$")$ NEXT ELSE SKIP_2
                                                ^!Keyboard DELETE
                                                ^!GoTo DumpBad

                                                ^!If "^$IsNumber("^$GetSelection$")$" = "1" SKIP
                                                ^!If "^$IsUppercase("^$GetSelection$")$" = "1" SKIP_4
                                                ^!Select Eol
                                                ^!Keyboard DELETE
                                                ^!Keyboard DELETE
                                                ^!GoTo DumpBad

                                                :GoNext
                                                ^!Jump +1
                                                ^!GoTo DumpBad

                                                ; following is to eliminate single characters on one line
                                                :Sort2
                                                ^!Jump 1

                                                :Sort2a
                                                ^!Select Eol
                                                ^!IfError END
                                                ^!If ^$StrSize("^$GetSelection$")$ > 1 SKIP_2
                                                ^!Keyboard DELETE
                                                ^!Keyboard DELETE
                                                ^!Jump +1
                                                ^!GoTo Sort2a
                                              • dpasseng
                                                Updated link: http://htmlfixit.com/blog/index.php?cat=14 Hugo, I hope you will still add things from time to time. This answers one of my own questions of
                                                Message 23 of 23 , Oct 17, 2008
                                                • 0 Attachment
                                                  Updated link:

                                                  http://htmlfixit.com/blog/index.php?cat=14

                                                  Hugo, I hope you will still add things from time to time.

                                                  This answers one of my own questions of today! Funny I forgot all
                                                  about it.
                                                Your message has been successfully submitted and would be delivered to recipients shortly.