Re: [LingPipe] token frequency in the document
- To get some background on how the LingPipe bits fit
together, the best place for you to start would be the
word counting tutorial:
and the sentence detection tutorial:
The easiest thing to do to count tokens (or short sequences
of tokens) is to use a TokenizedLM. You can access its underlying
counter through callbacks or query the counts for particular
token sequences. Just use one tokenized LM for the document-wide
counts then another one for each sentence.
It will be configured with a TokenizerFactory in order to break the
text into tokens. The IndoEuropeanTokenizerFactory is
a good default if the text is English or written like English.
You might want to apply a stop list (remove common words and
punctuation) and to case normalize (so "The" and "the" and "THE"
are all the same word). The LingPipe book (free) has a long
tutorial on tokenizers and tokenizer factories:
You might also be able to use some of our string comparison tools,
depending on how you're going to compare sentences. There are
several string comparison tools in the com.aliasi.spell package.
- Bob Carpenter
On 11/17/11 11:40 AM, pl_rudy wrote:
> Hi, I'm new to lingpipe, so bear with me and thanks for the help in advance.
> I'm trying to get word or token frequency in the document, I need them
> so that I can score sentences. So I'm trying to do three things first
> tokenize the document by sentences then I need to find token frequency
> in the particular document, and based on the token frequency I want to
> rank sentences with in that document. Any ideas?