Loading ...
Sorry, an error occurred while loading the content.

Re: Extracting words from a file

Expand Messages
  • abairheart
    ... Hi Franz, I just happend across this thread. If I have understood your needs correctly, why not just reduce the list to a single column of words , and sort
    Message 1 of 23 , Dec 3, 2004
    • 0 Attachment
      --- In ntb-clips@yahoogroups.com, "franz_sternbald"
      <franz_sternbald@y...> wrote:
      > (The use of all this is to produce an
      > index or thesaurus of keywords in a text database.)


      Hi Franz,

      I just happend across this thread. If I have understood your needs
      correctly, why not just reduce the list to a single column of words ,
      and sort them case sensitive?

      1. Replace all spaces in the document with "^P" to change the list to
      individual words (ignore puntuation, if you like.

      2. Sort the list CASE SENSITIVE

      3. Delete the lower case words


      500 K files should contain about 80,000 words or so. Shouldn't take
      more than a few minutes to do this by hand. If you have a lot of
      files you can always write down the keystrokes you use, then do the
      sort by Menu commands (^!Menu Modify/...). I think there's a
      configuration switch to change sorting behaviour (remove duplicates
      or not; case sensitive or not).


      Abair
    • Don - htmlfixit.com
      ... Bingo Abair, with one exception that pertains to German, but not to English! It works and doesn t use regex. I tried it on the 500 lines sent by Franz
      Message 2 of 23 , Dec 3, 2004
      • 0 Attachment
        > Hi Franz,
        >
        > I just happend across this thread. If I have understood your needs
        > correctly, why not just reduce the list to a single column of words ,
        > and sort them case sensitive?
        >
        > 1. Replace all spaces in the document with "^P" to change the list to
        > individual words (ignore puntuation, if you like.
        >
        > 2. Sort the list CASE SENSITIVE
        >
        > 3. Delete the lower case words
        >
        >
        > 500 K files should contain about 80,000 words or so. Shouldn't take
        > more than a few minutes to do this by hand. If you have a lot of
        > files you can always write down the keystrokes you use, then do the
        > sort by Menu commands (^!Menu Modify/...). I think there's a
        > configuration switch to change sorting behaviour (remove duplicates
        > or not; case sensitive or not).
        >
        >
        > Abair

        Bingo Abair, with one exception that pertains to German, but not to
        English! It works and doesn't use regex. I tried it on the 500 lines
        sent by Franz and on my 181,000 word file I have been trying with all
        others (always an out of memory error until now). I used a clip to do
        it as shown below. There is one problem however ... the German
        characters with two dots over them (is that an umlaut?) are treated as
        coming after the equivalent lower case letter .... so how do we deal
        with that? Currently as written it deletes them as lower case. Maybe I
        have to go one line at a time to delete? Does a German version of
        NoteTab sort these correctly? Is it a bug in the sorting engine? Is it
        just good old ASCII ordering? Are only certain letters umlauted, or
        whatever the double dots are called, in German?

        ; by don at htmlfixit.com
        ^!Menu Edit/Copy All
        ^!Toolbar Paste New
        ^!Replace "^P" >> " " ATIWS
        ^!Replace ")" >> " " ATIWS
        ^!Replace "(" >> " " ATIWS
        ^!Replace """ >> " " ATIWS
        ^!Replace "^T" >> " " ATIWS
        ^!Replace "," >> " " ATIWS
        ^!Replace "[" >> " " ATIWS
        ^!Replace "]" >> " " ATIWS
        ^!Replace "<" >> " " ATIWS
        ^!Replace ">" >> " " ATIWS
        ^!Replace "~" >> " " ATIWS
        ^!Replace "!" >> " " ATIWS
        ^!Replace "@" >> " " ATIWS
        ^!Replace "#" >> " " ATIWS
        ^!Replace "$" >> " " ATIWS
        ^!Replace "%" >> " " ATIWS
        ^!Replace "^" >> " " ATIWS
        ^!Replace "&" >> " " ATIWS
        ^!Replace "*" >> " " ATIWS
        ^!Replace "_" >> " " ATIWS
        ^!Replace "+" >> " " ATIWS
        ^!Replace "=" >> " " ATIWS
        ^!Replace "|" >> " " ATIWS
        ^!Replace "{" >> " " ATIWS
        ^!Replace "}" >> " " ATIWS
        ^!Replace "\" >> " " ATIWS
        ^!Replace "/" >> " " ATIWS
        ^!Replace "?" >> " " ATIWS
        ^!Replace "." >> " " ATIWS
        ^!Replace ";" >> " " ATIWS
        ^!Replace ":" >> " " ATIWS
        ^!Replace "" >> " " ATIWS
        ^!Replace "•" >> " " ATIWS
        ^!Replace "– " >> " " ATIWS
        ^!Replace "´" >> " " ATIWS
        ^!Replace "”" >> " " ATIWS
        ^!Replace "“" >> " " ATIWS
        ^!Replace "‘" >> " " ATIWS
        ^!Replace "`" >> " " ATIWS


        ^!Menu Modify/Spaces/Single Space
        ^!Replace " " >> "^P" ATIWS
        ^!Replace "^P’" >> "^P" ATIWS
        ^!Replace "^P-" >> "^P" ATIWS
        ^!Replace "^P " >> "^P" ATIWS
        ^!Menu Edit/Copy All
        ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
        ^!Select All
        ^!Toolbar Paste

        ^!Set %LineN%=0
        :DumpNumbers
        ;^!SetDebug 1
        ^!Inc %LineN% 10
        ^!Jump ^%LineN%
        ^!IfTrue ^$IsEmpty("^$GetLine$")$ DumpNumbers
        ^!Select +1
        ^!If "^$IsNumber("^$GetSelection$")$" = "1" DumpNumbers ELSE NotNumber
        :NotNumber
        ^!Jump -1
        ^!Select +1
        ^!If "^$IsNumber("^$GetSelection$")$" = "0" NotNumber ELSE DeleteNumbers

        :DeleteNumbers
        ^!Jump +1
        ^!SelectTo 1:1
        ^!Continue is proper highlighted

        ^!Keyboard DELETE


        ^!Set %LineN%=^$GetLineCount$
        :DumpLowers
        ^!Inc %LineN% -100
        ^!Jump ^%LineN%
        ^!Select +1
        ^!If "^$IsUppercase("^$GetSelection$")$" = "0" DumpLowers ELSE NotLower
        :NotLower
        ^!Jump +1
        ^!Select +1
        ^!If "^$IsUppercase("^$GetSelection$")$" = "1" NotLower ELSE DeleteLowers

        :DeleteLowers
        ^!Jump Select_Start
        ^!Set %cursor_row%=^$GetRow$
        ^!Set %cursor_col%=^$GetCol$
        ^!Jump Doc_End
        ^!SelectTo ^%cursor_row%:^%cursor_col%
        ^!Continue Is Proper Highlighted
        ^!Keyboard DELETE
      • Don - htmlfixit.com
        Even better, saves the German Characters ; by don at htmlfixit.com ; runs a text file and makes ; a list of all words that start ; with a capital letter ^!Menu
        Message 3 of 23 , Dec 3, 2004
        • 0 Attachment
          Even better, saves the German Characters

          ; by don at htmlfixit.com
          ; runs a text file and makes
          ; a list of all words that start
          ; with a capital letter
          ^!Menu Edit/Copy All
          ^!Toolbar Paste New
          ^!Replace "^P" >> " " ATIWS
          ^!Replace ")" >> " " ATIWS
          ^!Replace "(" >> " " ATIWS
          ^!Replace """ >> " " ATIWS
          ^!Replace "^T" >> " " ATIWS
          ^!Replace "," >> " " ATIWS
          ^!Replace "[" >> " " ATIWS
          ^!Replace "]" >> " " ATIWS
          ^!Replace "<" >> " " ATIWS
          ^!Replace ">" >> " " ATIWS
          ^!Replace "~" >> " " ATIWS
          ^!Replace "!" >> " " ATIWS
          ^!Replace "@" >> " " ATIWS
          ^!Replace "#" >> " " ATIWS
          ^!Replace "$" >> " " ATIWS
          ^!Replace "%" >> " " ATIWS
          ^!Replace "^" >> " " ATIWS
          ^!Replace "&" >> " " ATIWS
          ^!Replace "*" >> " " ATIWS
          ^!Replace "_" >> " " ATIWS
          ^!Replace "+" >> " " ATIWS
          ^!Replace "=" >> " " ATIWS
          ^!Replace "|" >> " " ATIWS
          ^!Replace "{" >> " " ATIWS
          ^!Replace "}" >> " " ATIWS
          ^!Replace "\" >> " " ATIWS
          ^!Replace "/" >> " " ATIWS
          ^!Replace "?" >> " " ATIWS
          ^!Replace "." >> " " ATIWS
          ^!Replace ";" >> " " ATIWS
          ^!Replace ":" >> " " ATIWS
          ^!Replace "" >> " " ATIWS
          ^!Replace "•" >> " " ATIWS
          ^!Replace "– " >> " " ATIWS
          ^!Replace "´" >> " " ATIWS
          ^!Replace "”" >> " " ATIWS
          ^!Replace "“" >> " " ATIWS
          ^!Replace "‘" >> " " ATIWS
          ^!Replace "`" >> " " ATIWS


          ^!Menu Modify/Spaces/Single Space
          ^!Replace " " >> "^P" ATIWS
          ^!Replace "^P’" >> "^P" ATIWS
          ^!Replace "^P-" >> "^P" ATIWS
          ^!Replace "^P " >> "^P" ATIWS
          ^!Menu Edit/Copy All
          ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
          ^!Select All
          ^!Toolbar Paste
          ^!Jump 1

          :DumpBad
          ^!Select +1
          ^!IfError END
          ^!IfTrue ^$IsEmpty("^$GetLine$")$ NEXT ELSE SKIP_2
          ^!Keyboard DELETE
          ^!GoTo DumpBad

          ^!If "^$IsNumber("^$GetSelection$")$" = "1" SKIP
          ^!If "^$IsUppercase("^$GetSelection$")$" = "1" SKIP_4
          ^!Select Eol
          ^!Keyboard DELETE
          ^!Keyboard DELETE
          ^!GoTo DumpBad

          :GoNext
          ^!Jump +1
          ^!GoTo DumpBad
        • Don - htmlfixit.com
          ; by don at htmlfixit.com ; any-non lowercase non-alphabetic ; character tests positive as Uppercase ^!SetArray
          Message 4 of 23 , Dec 3, 2004
          • 0 Attachment
            ; by don at htmlfixit.com
            ; any-non lowercase non-alphabetic
            ; character tests positive as Uppercase
            ^!SetArray
            %Original%="0";"1";"|";"?";"a";"@";"1";"+";"=";"F";"`";"~";"-";"q";"L";"[";"}";"
            ";"x"
            ^!Set %count%=0
            :Loop
            ^!Inc %count%
            ^!If "^%count%" > "^%Original0%" End

            ^!If "^$IsUppercase("^%Original^%count%%")$" = "1" UPPER ELSE NOTUPPER

            :UPPER
            ^!Info "^%Original^%count%%" is POSITIVE when tested as upper case --
            even if it isn't a letter
            ^!GoTo Loop

            :NOTUPPER
            ^!Info "^%Original^%count%%" is negative when tested as upper case
            ^!GoTo Loop


            Most interesting! IsUppercase is really NotLowercase! You would think
            that IsUppercase would first verify that the character is alphabetic,
            but it doesn't. IsLowercase works the same too, so it is really
            NotUppercase. Because of this you first need to check I guess to be
            sure it is alphabetic.

            These results are consistent I guess with what help says:
            ^$IsUppercase("Str")$ (added in v4.8)
            Returns 1 if Str does not contain any lowercase characters, and 0 if it
            does.

            I would think it SHOULD be does not contain any lowercase or
            non-alphabetic characters. But I guess you could have a contraction or
            hyphenated, etc. So maybe that isn't correct. In any event, just be
            aware and code accordingly.
          • Hugo Paulissen
            ... words , ... to ... We re going around in circles... Isn t this what I proposed a few messages earlier? ... is
            Message 5 of 23 , Dec 4, 2004
            • 0 Attachment
              >
              > I just happend across this thread. If I have understood your needs
              > correctly, why not just reduce the list to a single column of
              words ,
              > and sort them case sensitive?
              >
              > 1. Replace all spaces in the document with "^P" to change the list
              to
              > individual words (ignore puntuation, if you like.
              >
              > 2. Sort the list CASE SENSITIVE
              >
              > 3. Delete the lower case words
              >
              >
              > 500 K files should contain about 80,000 words or so. Shouldn't take
              > more than a few minutes to do this by hand. If you have a lot of
              > files you can always write down the keystrokes you use, then do the
              > sort by Menu commands (^!Menu Modify/...). I think there's a
              > configuration switch to change sorting behaviour (remove duplicates
              > or not; case sensitive or not).
              >
              >
              > Abair



              We're going around in circles...

              Isn't this what I proposed a few messages earlier?

              > What about this approach? You can easily see for yourself if this
              is
              > of any help.
              >
              > 1. replace " " with "^P" - don't know how fast that would be
              > 2. trim/left align the text (which should have most words on a
              > separate line by now)
              > 3. sort the document with [Case Sensitive Sorting] and [Remove
              > Duplicates] switched on (in options)
              >
              > Hugo
              >
            • Hugo Paulissen
              Don, You wrote the kind of clip I had in mind and for which I didn t have the time. It was clear that NoteTab s regex was in the way... ;-). If I had the need
              Message 6 of 23 , Dec 4, 2004
              • 0 Attachment
                Don,

                You wrote the kind of clip I had in mind and for which I didn't have the
                time. It was clear that NoteTab's regex was in the way... ;-). If I had the
                need for this clip I would definitely test it!

                Hugo

                > -----Oorspronkelijk bericht-----
                > Van: Don - htmlfixit.com [mailto:don@...]
                > Verzonden: zaterdag 4 december 2004 3:37
                > Aan: ntb-clips@yahoogroups.com
                > Onderwerp: Re: [Clip] Re: Extracting words from a file
                >
                >
                >
                > > Hi Franz,
                > >
                > > I just happend across this thread. If I have understood your needs
                > > correctly, why not just reduce the list to a single column of words ,
                > > and sort them case sensitive?
                > >
                > > 1. Replace all spaces in the document with "^P" to change the list to
                > > individual words (ignore puntuation, if you like.
                > >
                > > 2. Sort the list CASE SENSITIVE
                > >
                > > 3. Delete the lower case words
                > >
                > >
                > > 500 K files should contain about 80,000 words or so. Shouldn't take
                > > more than a few minutes to do this by hand. If you have a lot of
                > > files you can always write down the keystrokes you use, then do the
                > > sort by Menu commands (^!Menu Modify/...). I think there's a
                > > configuration switch to change sorting behaviour (remove duplicates
                > > or not; case sensitive or not).
                > >
                > >
                > > Abair
                >
                > Bingo Abair, with one exception that pertains to German, but not to
                > English! It works and doesn't use regex. I tried it on the 500 lines
                > sent by Franz and on my 181,000 word file I have been trying with all
                > others (always an out of memory error until now). I used a clip to do
                > it as shown below. There is one problem however ... the German
                > characters with two dots over them (is that an umlaut?) are treated as
                > coming after the equivalent lower case letter .... so how do we deal
                > with that? Currently as written it deletes them as lower case. Maybe I
                > have to go one line at a time to delete? Does a German version of
                > NoteTab sort these correctly? Is it a bug in the sorting engine? Is it
                > just good old ASCII ordering? Are only certain letters umlauted, or
                > whatever the double dots are called, in German?
                >
                > ; by don at htmlfixit.com
                > ^!Menu Edit/Copy All
                > ^!Toolbar Paste New
                > ^!Replace "^P" >> " " ATIWS
                > ^!Replace ")" >> " " ATIWS
                > ^!Replace "(" >> " " ATIWS
                > ^!Replace """ >> " " ATIWS
                > ^!Replace "^T" >> " " ATIWS
                > ^!Replace "," >> " " ATIWS
                > ^!Replace "[" >> " " ATIWS
                > ^!Replace "]" >> " " ATIWS
                > ^!Replace "<" >> " " ATIWS
                > ^!Replace ">" >> " " ATIWS
                > ^!Replace "~" >> " " ATIWS
                > ^!Replace "!" >> " " ATIWS
                > ^!Replace "@" >> " " ATIWS
                > ^!Replace "#" >> " " ATIWS
                > ^!Replace "$" >> " " ATIWS
                > ^!Replace "%" >> " " ATIWS
                > ^!Replace "^" >> " " ATIWS
                > ^!Replace "&" >> " " ATIWS
                > ^!Replace "*" >> " " ATIWS
                > ^!Replace "_" >> " " ATIWS
                > ^!Replace "+" >> " " ATIWS
                > ^!Replace "=" >> " " ATIWS
                > ^!Replace "|" >> " " ATIWS
                > ^!Replace "{" >> " " ATIWS
                > ^!Replace "}" >> " " ATIWS
                > ^!Replace "\" >> " " ATIWS
                > ^!Replace "/" >> " " ATIWS
                > ^!Replace "?" >> " " ATIWS
                > ^!Replace "." >> " " ATIWS
                > ^!Replace ";" >> " " ATIWS
                > ^!Replace ":" >> " " ATIWS
                > ^!Replace "" >> " " ATIWS
                > ^!Replace "•" >> " " ATIWS
                > ^!Replace "– " >> " " ATIWS
                > ^!Replace "´" >> " " ATIWS
                > ^!Replace "”" >> " " ATIWS
                > ^!Replace "“" >> " " ATIWS
                > ^!Replace "‘" >> " " ATIWS
                > ^!Replace "`" >> " " ATIWS
                >
                >
                > ^!Menu Modify/Spaces/Single Space
                > ^!Replace " " >> "^P" ATIWS
                > ^!Replace "^P’" >> "^P" ATIWS
                > ^!Replace "^P-" >> "^P" ATIWS
                > ^!Replace "^P " >> "^P" ATIWS
                > ^!Menu Edit/Copy All
                > ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
                > ^!Select All
                > ^!Toolbar Paste
                >
                > ^!Set %LineN%=0
                > :DumpNumbers
                > ;^!SetDebug 1
                > ^!Inc %LineN% 10
                > ^!Jump ^%LineN%
                > ^!IfTrue ^$IsEmpty("^$GetLine$")$ DumpNumbers
                > ^!Select +1
                > ^!If "^$IsNumber("^$GetSelection$")$" = "1" DumpNumbers ELSE NotNumber
                > :NotNumber
                > ^!Jump -1
                > ^!Select +1
                > ^!If "^$IsNumber("^$GetSelection$")$" = "0" NotNumber ELSE DeleteNumbers
                >
                > :DeleteNumbers
                > ^!Jump +1
                > ^!SelectTo 1:1
                > ^!Continue is proper highlighted
                >
                > ^!Keyboard DELETE
                >
                >
                > ^!Set %LineN%=^$GetLineCount$
                > :DumpLowers
                > ^!Inc %LineN% -100
                > ^!Jump ^%LineN%
                > ^!Select +1
                > ^!If "^$IsUppercase("^$GetSelection$")$" = "0" DumpLowers ELSE NotLower
                > :NotLower
                > ^!Jump +1
                > ^!Select +1
                > ^!If "^$IsUppercase("^$GetSelection$")$" = "1" NotLower ELSE DeleteLowers
                >
                > :DeleteLowers
                > ^!Jump Select_Start
                > ^!Set %cursor_row%=^$GetRow$
                > ^!Set %cursor_col%=^$GetCol$
                > ^!Jump Doc_End
                > ^!SelectTo ^%cursor_row%:^%cursor_col%
                > ^!Continue Is Proper Highlighted
                > ^!Keyboard DELETE
                >
              • Don - htmlfixit.com
                ... I used the ideas of three of you (plus as usual something out of noteblock). So it is a combination of three comments including yours, the one on deleting
                Message 7 of 23 , Dec 4, 2004
                • 0 Attachment
                  Hugo Paulissen wrote:
                  > Don,
                  >
                  > You wrote the kind of clip I had in mind and for which I didn't have the
                  > time. It was clear that NoteTab's regex was in the way... ;-). If I had the
                  > need for this clip I would definitely test it!
                  >
                  > Hugo

                  I used the ideas of three of you (plus as usual something out of
                  noteblock). So it is a combination of three comments including yours,
                  the one on deleting what was unneeded -- come to think of it putting the
                  results in another file is four -- and the sorting alphbetically
                  comment. I think if one really wanted to do it with a regex, perl would
                  be a good choice, but for many that defeats the purpose because, even
                  though perl is easy to run on a pc from notetab, it is another whole thing.

                  I left screenupdate on so that you can kind of see it materialize and
                  know it is working.
                  I added a few special things that applied mainly to my test file, but
                  won't hurt on another file (like deleting a leading hyphen).

                  The most interesting thing I learned in the process is that IsUppercase
                  means NOT Lowercase.
                  Of course IsLowercase means NOT Uppercase as well. I first tested for
                  it to be a number.
                  Either will test positive for a non-alphabetic (white space,
                  punctuation, numbers -- all test postitive under either of those).

                  I had also forgotten the SKIP_# feature which I noticed when ripping
                  something off in the NoteBlock Library. That came in handy. I knew
                  SKIP but forgot you could skip multiple lines.

                  I usually put something at the top of my clips in a comment to describe
                  where they came from ... even though of course they usually come from a
                  combination of you all. It is NOT to claim some level of ownership or
                  anything because it is most often plagerized in essence to some degree
                  or other, but rather to help if we have to later dissect it or make
                  modifications.

                  I have decided to start "keeping" some of the clips I work on here:
                  http://htmlfixit.com/don_and_franki_news_blog_on_htmlfixit_dot_com/index.php?cat=14

                  I figure that will help me find them in the future when I want to copy
                  something from one of them for future use.
                • franz_sternbald
                  Hi all, @Don Thanks a lot - this was a great help to me! I ve done some tests so far, and it has led to perfect results. The second solution (without
                  Message 8 of 23 , Dec 4, 2004
                  • 0 Attachment
                    Hi all,

                    @Don

                    Thanks a lot - this was a great help to me! I've done some tests so
                    far, and it has led to perfect results. The second solution (without
                    DumpNumbers etc) reduced a file of 68,000 words to 3,590 words within
                    8 minutes. All capitalized, and there's only very little stuff left
                    that could be easily removed manually or with a few more Replace-
                    lines (for example single letters A, B, C etc., some special
                    characters like ¡¢£¤¥§©«)

                    The first clip seemed to work rather long on my PC. I stopped the
                    procedure after almost 2 hours. By then, it had reduced the same file
                    to 7.890 words. Then I took a file of 2,000 lines only and compared
                    the output of both clips with another tool. There was no big
                    difference (the second one saved a few more umlauts). So I think the
                    second clip is the better solution.

                    With both clips, I didn't run into "Out of memory". The only message
                    I got was "Some paragraphs were too long and had to be split." I
                    think that won't affect the result.

                    The second clip finds German umlauts (ÄÖÜäöü) as well and
                    distinguishes properly between uppercase and lowercase umlauts. They
                    are ASCII-sorted, but that's something we have to live with.

                    Your Lowercase/Uppercase Test Clip shows that numbers and special
                    characters like @?+=[ etc. are interpreted as lowercase characters
                    although they are not alphabetic.

                    I'll let you know in case I still get into any trouble...

                    @Abair & Hugo

                    > I just happend across this thread. If I have understood your needs
                    > correctly, why not just reduce the list to a single column of words,
                    > and sort them case sensitive?

                    The intention is to create an index of a text database, that is a
                    list of keywords (headwords). The databases are made with askSam (see
                    http://www.asksam.com) and exported to a TXT file (the index function
                    in askSam only produces words completely in capital LETTERS). These
                    keywords mainly are represented by nouns which, in German, start with
                    an uppercase letter ("the car/der Wagen"). However, about 50% of
                    these "capitalized" words actually are no nouns since they are
                    capitalized only because they are the first word in a sentence (for
                    example conjunctions like "And/and", adverbs like "Very/very" etc.).
                    So it won't be sufficient to sort and copy these capitalized words
                    only. I created lists of conjunctions and adverbs, stored in an
                    array, to be removed from that list of capitalized words.
                    Furthermore, there is a lot of stuff to be deleted too. So the
                    intention is to do the whole job in one go with NoteTab...

                    Regards,
                    Franz
                  • Don - htmlfixit.com
                    ... Only the second clip will save Umlauts, so it is the only one that will work. The first one should have been much faster if Umlauts weren t required. I
                    Message 9 of 23 , Dec 4, 2004
                    • 0 Attachment
                      franz_sternbald wrote:
                      >
                      > Hi all,
                      >
                      > @Don
                      >
                      > Thanks a lot - this was a great help to me! I've done some tests so
                      > far, and it has led to perfect results. The second solution (without
                      > DumpNumbers etc) reduced a file of 68,000 words to 3,590 words within
                      > 8 minutes. All capitalized, and there's only very little stuff left
                      > that could be easily removed manually or with a few more Replace-
                      > lines (for example single letters A, B, C etc., some special
                      > characters like ¡¢£¤¥§©«)
                      >
                      > The first clip seemed to work rather long on my PC. I stopped the
                      > procedure after almost 2 hours. By then, it had reduced the same file
                      > to 7.890 words. Then I took a file of 2,000 lines only and compared
                      > the output of both clips with another tool. There was no big
                      > difference (the second one saved a few more umlauts). So I think the
                      > second clip is the better solution.
                      >

                      Only the second clip will save Umlauts, so it is the only one that will
                      work. The first one should have been much faster if Umlauts weren't
                      required. I think I could improve it significantly by tuning the
                      bracketed searches (ie 10 or 100 line jumps should actually be
                      proportionate to the size of the file).

                      Anyway, as umlauts are to be saved, I have modified version 2 and made a
                      new version three with the following changes/enhancements:

                      1. it adds the additional characters you highlighted
                      2. it removes all single characters like A, Ä and ß for example if they
                      are all by themselves

                      Let me know:

                      ; by don at htmlfixit.com
                      ; using a bunch of Hugo's ideas
                      ; runs a text file and makes
                      ; a list of all words that start
                      ; with a capital letter
                      ^!Menu Edit/Copy All
                      ^!Toolbar Paste New
                      ^!Replace "^P" >> " " ATIWS
                      ^!Replace ")" >> " " ATIWS
                      ^!Replace "(" >> " " ATIWS
                      ^!Replace """ >> " " ATIWS
                      ^!Replace "^T" >> " " ATIWS
                      ^!Replace "," >> " " ATIWS
                      ^!Replace "[" >> " " ATIWS
                      ^!Replace "]" >> " " ATIWS
                      ^!Replace "<" >> " " ATIWS
                      ^!Replace ">" >> " " ATIWS
                      ^!Replace "~" >> " " ATIWS
                      ^!Replace "!" >> " " ATIWS
                      ^!Replace "@" >> " " ATIWS
                      ^!Replace "#" >> " " ATIWS
                      ^!Replace "$" >> " " ATIWS
                      ^!Replace "%" >> " " ATIWS
                      ^!Replace "^" >> " " ATIWS
                      ^!Replace "&" >> " " ATIWS
                      ^!Replace "*" >> " " ATIWS
                      ^!Replace "_" >> " " ATIWS
                      ^!Replace "+" >> " " ATIWS
                      ^!Replace "=" >> " " ATIWS
                      ^!Replace "|" >> " " ATIWS
                      ^!Replace "{" >> " " ATIWS
                      ^!Replace "}" >> " " ATIWS
                      ^!Replace "\" >> " " ATIWS
                      ^!Replace "/" >> " " ATIWS
                      ^!Replace "?" >> " " ATIWS
                      ^!Replace "." >> " " ATIWS
                      ^!Replace ";" >> " " ATIWS
                      ^!Replace ":" >> " " ATIWS
                      ^!Replace "" >> " " ATIWS
                      ^!Replace "•" >> " " ATIWS
                      ^!Replace "– " >> " " ATIWS
                      ^!Replace "´" >> " " ATIWS
                      ^!Replace "”" >> " " ATIWS
                      ^!Replace "“" >> " " ATIWS
                      ^!Replace "‘" >> " " ATIWS
                      ^!Replace "`" >> " " ATIWS
                      ^!Replace "¡" >> " " ATIWS
                      ^!Replace "¢" >> " " ATIWS
                      ^!Replace "£" >> " " ATIWS
                      ^!Replace "¤" >> " " ATIWS
                      ^!Replace "¥" >> " " ATIWS
                      ^!Replace "§" >> " " ATIWS
                      ^!Replace "©" >> " " ATIWS
                      ^!Replace "«" >> " " ATIWS

                      ^!Menu Modify/Spaces/Single Space
                      ^!Replace " " >> "^P" ATIWS
                      ^!Replace "^P’" >> "^P" ATIWS
                      ^!Replace "^P-" >> "^P" ATIWS
                      ^!Replace "^P " >> "^P" ATIWS
                      ^!Menu Edit/Copy All
                      ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
                      ^!Select All
                      ^!Toolbar Paste
                      ^!Jump 1

                      ; following is to dump all numer or lower cased
                      ; first character lines
                      :DumpBad
                      ^!If ^$GetRow$ = ^$GetLinecount$ Sort2
                      ^!Select +1
                      ^!IfTrue ^$IsEmpty("^$GetLine$")$ NEXT ELSE SKIP_2
                      ^!Keyboard DELETE
                      ^!GoTo DumpBad

                      ^!If "^$IsNumber("^$GetSelection$")$" = "1" SKIP
                      ^!If "^$IsUppercase("^$GetSelection$")$" = "1" SKIP_4
                      ^!Select Eol
                      ^!Keyboard DELETE
                      ^!Keyboard DELETE
                      ^!GoTo DumpBad

                      :GoNext
                      ^!Jump +1
                      ^!GoTo DumpBad

                      ; following is to eliminate single characters on one line
                      :Sort2
                      ^!Jump 1

                      :Sort2a
                      ^!Select Eol
                      ^!IfError END
                      ^!If ^$StrSize("^$GetSelection$")$ > 1 SKIP_2
                      ^!Keyboard DELETE
                      ^!Keyboard DELETE
                      ^!Jump +1
                      ^!GoTo Sort2a
                    • dpasseng
                      Updated link: http://htmlfixit.com/blog/index.php?cat=14 Hugo, I hope you will still add things from time to time. This answers one of my own questions of
                      Message 10 of 23 , Oct 17, 2008
                      • 0 Attachment
                        Updated link:

                        http://htmlfixit.com/blog/index.php?cat=14

                        Hugo, I hope you will still add things from time to time.

                        This answers one of my own questions of today! Funny I forgot all
                        about it.
                      Your message has been successfully submitted and would be delivered to recipients shortly.