Loading ...
Sorry, an error occurred while loading the content.

Re: [LingPipe] token frequency in the document

Expand Messages
  • Bob Carpenter
    To get some background on how the LingPipe bits fit together, the best place for you to start would be the word counting tutorial:
    Message 1 of 2 , Nov 17, 2011
    View Source
    • 0 Attachment
      To get some background on how the LingPipe bits fit
      together, the best place for you to start would be the
      word counting tutorial:

      http://alias-i.com/lingpipe/demos/tutorial/interestingPhrases/read-me.html

      and the sentence detection tutorial:

      http://alias-i.com/lingpipe/demos/tutorial/sentences/read-me.html

      The easiest thing to do to count tokens (or short sequences
      of tokens) is to use a TokenizedLM. You can access its underlying
      counter through callbacks or query the counts for particular
      token sequences. Just use one tokenized LM for the document-wide
      counts then another one for each sentence.

      It will be configured with a TokenizerFactory in order to break the
      text into tokens. The IndoEuropeanTokenizerFactory is
      a good default if the text is English or written like English.
      You might want to apply a stop list (remove common words and
      punctuation) and to case normalize (so "The" and "the" and "THE"
      are all the same word). The LingPipe book (free) has a long
      tutorial on tokenizers and tokenizer factories:

      http://alias-i.com/lingpipe/web/book.html

      You might also be able to use some of our string comparison tools,
      depending on how you're going to compare sentences. There are
      several string comparison tools in the com.aliasi.spell package.

      - Bob Carpenter
      LingPipe


      On 11/17/11 11:40 AM, pl_rudy wrote:
      > Hi, I'm new to lingpipe, so bear with me and thanks for the help in advance.
      > I'm trying to get word or token frequency in the document, I need them
      > so that I can score sentences. So I'm trying to do three things first
      > tokenize the document by sentences then I need to find token frequency
      > in the particular document, and based on the token frequency I want to
      > rank sentences with in that document. Any ideas?
    Your message has been successfully submitted and would be delivered to recipients shortly.