Loading ...
Sorry, an error occurred while loading the content.

Re: [Clip] Re: Extracting words from a file

Expand Messages
  • Don - htmlfixit.com
    ... I used the ideas of three of you (plus as usual something out of noteblock). So it is a combination of three comments including yours, the one on deleting
    Message 1 of 23 , Dec 4, 2004
    • 0 Attachment
      Hugo Paulissen wrote:
      > Don,
      >
      > You wrote the kind of clip I had in mind and for which I didn't have the
      > time. It was clear that NoteTab's regex was in the way... ;-). If I had the
      > need for this clip I would definitely test it!
      >
      > Hugo

      I used the ideas of three of you (plus as usual something out of
      noteblock). So it is a combination of three comments including yours,
      the one on deleting what was unneeded -- come to think of it putting the
      results in another file is four -- and the sorting alphbetically
      comment. I think if one really wanted to do it with a regex, perl would
      be a good choice, but for many that defeats the purpose because, even
      though perl is easy to run on a pc from notetab, it is another whole thing.

      I left screenupdate on so that you can kind of see it materialize and
      know it is working.
      I added a few special things that applied mainly to my test file, but
      won't hurt on another file (like deleting a leading hyphen).

      The most interesting thing I learned in the process is that IsUppercase
      means NOT Lowercase.
      Of course IsLowercase means NOT Uppercase as well. I first tested for
      it to be a number.
      Either will test positive for a non-alphabetic (white space,
      punctuation, numbers -- all test postitive under either of those).

      I had also forgotten the SKIP_# feature which I noticed when ripping
      something off in the NoteBlock Library. That came in handy. I knew
      SKIP but forgot you could skip multiple lines.

      I usually put something at the top of my clips in a comment to describe
      where they came from ... even though of course they usually come from a
      combination of you all. It is NOT to claim some level of ownership or
      anything because it is most often plagerized in essence to some degree
      or other, but rather to help if we have to later dissect it or make
      modifications.

      I have decided to start "keeping" some of the clips I work on here:
      http://htmlfixit.com/don_and_franki_news_blog_on_htmlfixit_dot_com/index.php?cat=14

      I figure that will help me find them in the future when I want to copy
      something from one of them for future use.
    • franz_sternbald
      Hi all, @Don Thanks a lot - this was a great help to me! I ve done some tests so far, and it has led to perfect results. The second solution (without
      Message 2 of 23 , Dec 4, 2004
      • 0 Attachment
        Hi all,

        @Don

        Thanks a lot - this was a great help to me! I've done some tests so
        far, and it has led to perfect results. The second solution (without
        DumpNumbers etc) reduced a file of 68,000 words to 3,590 words within
        8 minutes. All capitalized, and there's only very little stuff left
        that could be easily removed manually or with a few more Replace-
        lines (for example single letters A, B, C etc., some special
        characters like ¡¢£¤¥§©«)

        The first clip seemed to work rather long on my PC. I stopped the
        procedure after almost 2 hours. By then, it had reduced the same file
        to 7.890 words. Then I took a file of 2,000 lines only and compared
        the output of both clips with another tool. There was no big
        difference (the second one saved a few more umlauts). So I think the
        second clip is the better solution.

        With both clips, I didn't run into "Out of memory". The only message
        I got was "Some paragraphs were too long and had to be split." I
        think that won't affect the result.

        The second clip finds German umlauts (ÄÖÜäöü) as well and
        distinguishes properly between uppercase and lowercase umlauts. They
        are ASCII-sorted, but that's something we have to live with.

        Your Lowercase/Uppercase Test Clip shows that numbers and special
        characters like @?+=[ etc. are interpreted as lowercase characters
        although they are not alphabetic.

        I'll let you know in case I still get into any trouble...

        @Abair & Hugo

        > I just happend across this thread. If I have understood your needs
        > correctly, why not just reduce the list to a single column of words,
        > and sort them case sensitive?

        The intention is to create an index of a text database, that is a
        list of keywords (headwords). The databases are made with askSam (see
        http://www.asksam.com) and exported to a TXT file (the index function
        in askSam only produces words completely in capital LETTERS). These
        keywords mainly are represented by nouns which, in German, start with
        an uppercase letter ("the car/der Wagen"). However, about 50% of
        these "capitalized" words actually are no nouns since they are
        capitalized only because they are the first word in a sentence (for
        example conjunctions like "And/and", adverbs like "Very/very" etc.).
        So it won't be sufficient to sort and copy these capitalized words
        only. I created lists of conjunctions and adverbs, stored in an
        array, to be removed from that list of capitalized words.
        Furthermore, there is a lot of stuff to be deleted too. So the
        intention is to do the whole job in one go with NoteTab...

        Regards,
        Franz
      • Don - htmlfixit.com
        ... Only the second clip will save Umlauts, so it is the only one that will work. The first one should have been much faster if Umlauts weren t required. I
        Message 3 of 23 , Dec 4, 2004
        • 0 Attachment
          franz_sternbald wrote:
          >
          > Hi all,
          >
          > @Don
          >
          > Thanks a lot - this was a great help to me! I've done some tests so
          > far, and it has led to perfect results. The second solution (without
          > DumpNumbers etc) reduced a file of 68,000 words to 3,590 words within
          > 8 minutes. All capitalized, and there's only very little stuff left
          > that could be easily removed manually or with a few more Replace-
          > lines (for example single letters A, B, C etc., some special
          > characters like ¡¢£¤¥§©«)
          >
          > The first clip seemed to work rather long on my PC. I stopped the
          > procedure after almost 2 hours. By then, it had reduced the same file
          > to 7.890 words. Then I took a file of 2,000 lines only and compared
          > the output of both clips with another tool. There was no big
          > difference (the second one saved a few more umlauts). So I think the
          > second clip is the better solution.
          >

          Only the second clip will save Umlauts, so it is the only one that will
          work. The first one should have been much faster if Umlauts weren't
          required. I think I could improve it significantly by tuning the
          bracketed searches (ie 10 or 100 line jumps should actually be
          proportionate to the size of the file).

          Anyway, as umlauts are to be saved, I have modified version 2 and made a
          new version three with the following changes/enhancements:

          1. it adds the additional characters you highlighted
          2. it removes all single characters like A, Ä and ß for example if they
          are all by themselves

          Let me know:

          ; by don at htmlfixit.com
          ; using a bunch of Hugo's ideas
          ; runs a text file and makes
          ; a list of all words that start
          ; with a capital letter
          ^!Menu Edit/Copy All
          ^!Toolbar Paste New
          ^!Replace "^P" >> " " ATIWS
          ^!Replace ")" >> " " ATIWS
          ^!Replace "(" >> " " ATIWS
          ^!Replace """ >> " " ATIWS
          ^!Replace "^T" >> " " ATIWS
          ^!Replace "," >> " " ATIWS
          ^!Replace "[" >> " " ATIWS
          ^!Replace "]" >> " " ATIWS
          ^!Replace "<" >> " " ATIWS
          ^!Replace ">" >> " " ATIWS
          ^!Replace "~" >> " " ATIWS
          ^!Replace "!" >> " " ATIWS
          ^!Replace "@" >> " " ATIWS
          ^!Replace "#" >> " " ATIWS
          ^!Replace "$" >> " " ATIWS
          ^!Replace "%" >> " " ATIWS
          ^!Replace "^" >> " " ATIWS
          ^!Replace "&" >> " " ATIWS
          ^!Replace "*" >> " " ATIWS
          ^!Replace "_" >> " " ATIWS
          ^!Replace "+" >> " " ATIWS
          ^!Replace "=" >> " " ATIWS
          ^!Replace "|" >> " " ATIWS
          ^!Replace "{" >> " " ATIWS
          ^!Replace "}" >> " " ATIWS
          ^!Replace "\" >> " " ATIWS
          ^!Replace "/" >> " " ATIWS
          ^!Replace "?" >> " " ATIWS
          ^!Replace "." >> " " ATIWS
          ^!Replace ";" >> " " ATIWS
          ^!Replace ":" >> " " ATIWS
          ^!Replace "" >> " " ATIWS
          ^!Replace "•" >> " " ATIWS
          ^!Replace "– " >> " " ATIWS
          ^!Replace "´" >> " " ATIWS
          ^!Replace "”" >> " " ATIWS
          ^!Replace "“" >> " " ATIWS
          ^!Replace "‘" >> " " ATIWS
          ^!Replace "`" >> " " ATIWS
          ^!Replace "¡" >> " " ATIWS
          ^!Replace "¢" >> " " ATIWS
          ^!Replace "£" >> " " ATIWS
          ^!Replace "¤" >> " " ATIWS
          ^!Replace "¥" >> " " ATIWS
          ^!Replace "§" >> " " ATIWS
          ^!Replace "©" >> " " ATIWS
          ^!Replace "«" >> " " ATIWS

          ^!Menu Modify/Spaces/Single Space
          ^!Replace " " >> "^P" ATIWS
          ^!Replace "^P’" >> "^P" ATIWS
          ^!Replace "^P-" >> "^P" ATIWS
          ^!Replace "^P " >> "^P" ATIWS
          ^!Menu Edit/Copy All
          ^!SetClipboard ^$StrSort("^$GetClipboard$";1;1;1)$
          ^!Select All
          ^!Toolbar Paste
          ^!Jump 1

          ; following is to dump all numer or lower cased
          ; first character lines
          :DumpBad
          ^!If ^$GetRow$ = ^$GetLinecount$ Sort2
          ^!Select +1
          ^!IfTrue ^$IsEmpty("^$GetLine$")$ NEXT ELSE SKIP_2
          ^!Keyboard DELETE
          ^!GoTo DumpBad

          ^!If "^$IsNumber("^$GetSelection$")$" = "1" SKIP
          ^!If "^$IsUppercase("^$GetSelection$")$" = "1" SKIP_4
          ^!Select Eol
          ^!Keyboard DELETE
          ^!Keyboard DELETE
          ^!GoTo DumpBad

          :GoNext
          ^!Jump +1
          ^!GoTo DumpBad

          ; following is to eliminate single characters on one line
          :Sort2
          ^!Jump 1

          :Sort2a
          ^!Select Eol
          ^!IfError END
          ^!If ^$StrSize("^$GetSelection$")$ > 1 SKIP_2
          ^!Keyboard DELETE
          ^!Keyboard DELETE
          ^!Jump +1
          ^!GoTo Sort2a
        • dpasseng
          Updated link: http://htmlfixit.com/blog/index.php?cat=14 Hugo, I hope you will still add things from time to time. This answers one of my own questions of
          Message 4 of 23 , Oct 17, 2008
          • 0 Attachment
            Updated link:

            http://htmlfixit.com/blog/index.php?cat=14

            Hugo, I hope you will still add things from time to time.

            This answers one of my own questions of today! Funny I forgot all
            about it.
          Your message has been successfully submitted and would be delivered to recipients shortly.