Loading ...
Sorry, an error occurred while loading the content.

15242[Clip] Re: Removing stopwords from word list

Expand Messages
  • jonas_ramus
    Jul 16, 2006
    • 0 Attachment

      You wrote...

      > If you modify your test files by adding a few words to your
      > ntf-stopwords.txt file that are NOT contained in ntf-wordlist.txt,
      > then you will catch my error...

      So I did, and I repeated the test with a stop word list that is
      bigger than the file to be cleaned (using ntf-wordlist.txt as stop
      words and ntf-stopwords.txt as file to be cleaned). Now the
      newlist.txt contains those three additional words, which is correct.
      But again, it contains 248 of 250 stop words (Z-words in ntf-
      stopwords.txt), which is not correct since this processing now
      should output those three additional words only.

      Next, I removed the three additional words from the stop words list
      again and added that "^!IfError ^!Jump 1" line. Again, the result of
      the same test isn't completely correct. When finished, the message

      Task complete.
      Original 250 lines in [path]\ntf-stopwords.txt
      were reduced to 0 in [path]\newlist.txt

      The newlist.txt, however, contains the following line


      This appears as a concatenation and shortening of two compounds in
      the stop words list: probably "Zucker-Aktiengesellschaft" or "Zucker-
      Marktordnung", and "Zwei-Tank-Systeme".

      Furthermore, it seems to me that you didn't deal with the "empty
      line problem" in the stop word list I mentioned before. Maybe we
      have to add something like...

      ^!Jump Doc_End
      ^!IfFalse ^$IsEmpty(^$GetLine$)$ Next Else Skip
      ^!Keyboard Enter

      to be executed in the opened stop word list in order to take care
      that it ends with an empty line. Without that empty line, the
      newlist.txt now contains...


      As mentioned before, we now also get the last stop word "Zweifacher".

    • Show all 30 messages in this topic