Loading ...
Sorry, an error occurred while loading the content.

Re: [LingPipe] Text pre-processing in Language Modeling approach for Classification

Expand Messages
  • Bob Carpenter
    ... I d say the main advantage of the character LM approach is that it can learn from parts of tokens as well as across tokens, which means it s far less
    Message 1 of 5 , May 31, 2007
    View Source
    • 0 Attachment
      > I am starting to learn Language Modeling approach for Classification.
      >
      > I understand that one of main advantages of LM is that it does not
      > require text pre-processing. In fact I tried LingPipe's classifier
      > for English text without any pre-processing and received comparably good
      > results.
      >
      > My question is if I should do any pre-processing before classifying text
      > in other languages? For example should I normalize text with umlauts?
      >
      > I would be very grateful for help.

      I'd say the main advantage of the character LM
      approach is that it can learn from parts of tokens
      as well as across tokens, which means it's far
      less sensitive to normalization effects.

      Whether you can get any performance improvement
      by normalizing out diacritics, lowercasing everything,
      etc. is really an empirical question. The more
      training data you have and the longer the test
      instances, the less effect normalization will have.
      In many cases, normalization of case or diacritics
      will hurt performance.

      Tokenized models are much more sensitive to
      normalization than the character-based models, too.
      For tokenized models, it often makes sense to case
      normalize, stoplist, remove really infrequent terms,
      etc.

      To truly normalize a unicode sequence, you might
      want to look at IBM's ICU package:

      http://www.icu-project.org/index.html

      It's great stuff, written by the original
      designer of unicode.

      - Bob
    • julien_nioche
      Hi Lena, A solution could be to embed your Lingpipe learner / classifier inside a framework like UIMA or GATE. That would allow to run any number of processors
      Message 2 of 5 , Jun 1, 2007
      View Source
      • 0 Attachment
        Hi Lena,

        A solution could be to embed your Lingpipe learner / classifier inside
        a framework like UIMA or GATE. That would allow to run any number of
        processors before the classification. I think someone has mentioned an
        integration with UIMA. Doing the same with GATE would be very easy and
        would give you additional features: visualisation of documents,
        manual annotation, evaluation of performances, etc...

        There are of course other ways to do that, but it could be a good idea
        to leverage on existing frameworks and at the same time benefit from
        the LingPipe resources.

        HTH

        Julien


        --- In LingPipe@yahoogroups.com, "lenat79" <lenat79@...> wrote:
        >
        >
        > Hi All!
        >
        >
        >
        > I am starting to learn Language Modeling approach for Classification.
        >
        > I understand that one of main advantages of LM is that it does not
        > require text pre-processing. In fact I tried LingPipe's classifier
        > for English text without any pre-processing and received comparably good
        > results.
        >
        > My question is if I should do any pre-processing before classifying text
        > in other languages? For example should I normalize text with umlauts?
        >
        >
        >
        > I would be very grateful for help.
        >
        >
        >
        > Thanks,
        >
        > Lena
        >
        >
        >
        > [Non-text portions of this message have been removed]
        >
      • Bob Carpenter
        ... LingPipe was designed so that it d be easy to integrate into larger integration frameworks like UIMA or GATE. LingPipe s models are all serializable and
        Message 3 of 5 , Jun 1, 2007
        View Source
        • 0 Attachment
          > A solution could be to embed your Lingpipe learner / classifier inside
          > a framework like UIMA or GATE.

          LingPipe was designed so that it'd be
          easy to integrate into larger integration
          frameworks like UIMA or GATE. LingPipe's models
          are all serializable and thread safe, and each
          type of annotation hews to a standard API
          (language modeling, tagging, chunking, classification,
          clustering, spell checking, string distance, etc.)

          If folks have integrated LingPipe into
          larger integration frameworks like UIMA or GATE,
          we'd be happy to host them under whatever license
          you'd like in our sandbox so that they could
          easily be shared with others.

          If we actually get enough requests to do
          a UIMA or GATE embedding, we'd take it on
          ourselves.

          Also, if people find things that are useful
          in UIMA or GATE, we'd be happy to take them
          on as feature requests. Especially if you
          have evaluations of chunkers, taggers, classifiers
          or clusterers that we don't currently perform.

          - Bob Carpenter
          Alias-i
        • Florian Laws
          ... Didn t you already publish a UIMA wrapper for at least the chunkers? I have successfully used the LingPipe NER components with UIMA 1.3 Regards, Florian
          Message 4 of 5 , Jun 1, 2007
          View Source
          • 0 Attachment
            On Fri, Jun 01, 2007 at 12:43:42PM -0400, Bob Carpenter wrote:
            >
            > > A solution could be to embed your Lingpipe learner / classifier inside
            > > a framework like UIMA or GATE.
            >
            > LingPipe was designed so that it'd be
            > easy to integrate into larger integration
            > frameworks like UIMA or GATE. LingPipe's models
            > are all serializable and thread safe, and each
            > type of annotation hews to a standard API
            > (language modeling, tagging, chunking, classification,
            > clustering, spell checking, string distance, etc.)
            >
            > If folks have integrated LingPipe into
            > larger integration frameworks like UIMA or GATE,
            > we'd be happy to host them under whatever license
            > you'd like in our sandbox so that they could
            > easily be shared with others.
            >
            > If we actually get enough requests to do
            > a UIMA or GATE embedding, we'd take it on
            > ourselves.

            Didn't you already publish a UIMA wrapper for at least
            the chunkers? I have successfully used the LingPipe NER
            components with UIMA 1.3

            Regards,

            Florian
          Your message has been successfully submitted and would be delivered to recipients shortly.