15242[Clip] Re: Removing stopwords from word list
- Jul 16, 2006Bob,
> If you modify your test files by adding a few words to yourSo I did, and I repeated the test with a stop word list that is
> ntf-stopwords.txt file that are NOT contained in ntf-wordlist.txt,
> then you will catch my error...
bigger than the file to be cleaned (using ntf-wordlist.txt as stop
words and ntf-stopwords.txt as file to be cleaned). Now the
newlist.txt contains those three additional words, which is correct.
But again, it contains 248 of 250 stop words (Z-words in ntf-
stopwords.txt), which is not correct since this processing now
should output those three additional words only.
Next, I removed the three additional words from the stop words list
again and added that "^!IfError ^!Jump 1" line. Again, the result of
the same test isn't completely correct. When finished, the message
Original 250 lines in [path]\ntf-stopwords.txt
were reduced to 0 in [path]\newlist.txt
The newlist.txt, however, contains the following line
This appears as a concatenation and shortening of two compounds in
the stop words list: probably "Zucker-Aktiengesellschaft" or "Zucker-
Marktordnung", and "Zwei-Tank-Systeme".
Furthermore, it seems to me that you didn't deal with the "empty
line problem" in the stop word list I mentioned before. Maybe we
have to add something like...
^!IfFalse ^$IsEmpty(^$GetLine$)$ Next Else Skip
to be executed in the opened stop word list in order to take care
that it ends with an empty line. Without that empty line, the
newlist.txt now contains...
As mentioned before, we now also get the last stop word "Zweifacher".
- << Previous post in topic Next post in topic >>